Vietnamese is a monosyllabic language with each syllable is separated by space in written. Even though words can have one or more syllables, you can write all the words just by knowing all syllables. But how many syllables are there in the Vietnamese language? That should be answered in this post.
Disclaimer: I’m not a linguist. Basing on the guideline of how to form a syllable which is usually taught in primary school in Vietnam and related Wikipedia entries I would try to find all potential candidates that can be regarded as Vietnamese syllables.
I created a list of Vietnamese syllables by combining all known onsets and rimes. The result is a list of 17,974 unique syllables (download), more than half are not used in real life but this makes sure no important syllables were left behind. Minority languages are not includes, some are widely used even in official documents e.g:
There aren’t many works about Vietnamese linguistic that can be adapted into language processing, foturnately I found that the Wikipedia entry for Vietnamese language is quite informative. The Vietnamese version Tiếng Việt is also worth reading if you can read Vietnamese. However, the information is quite difficult to follow for people who are not familiar with linguistic.
Instead of going deep into the linguistic part, by establishing rules and structure we can create all Vietnamese syllables. Even though most of them may not be ‘real’, this list can be useful for many Vietnamese text and speech processing tasks.
If by any chances you’re learning Vietnamese and come across this post. This list may not suit your needs. Instead I would recommend a list of most popular syllables based on statistic. Although some information in this post might be helpful for language learners.
Structure of a syllable
A simple structure of syllable are proposed for this task. A syllable consists of two parts: the onset and the rime. The onset is optional while the rime is essential for the syllable to be valid. A rime is always associated with one of six tones of Vietnamese
Inspired by the ‘Common Vietnamese rimes’ table which can be found in Wikipedia entry for Vietnamese phonology I created a similar table for rimes construction. The difference is the Wikipedia table is based on phonology (The IPA - International Phonetic Alphabet was used) while my table is based on orthography.
A rime is always associated with one tone. Table above has rimes in their
blank tone variation. Vietnamese have 6 tones
dot. Not all rimes can be used together with every tones.
The rimes construction table has 3 regions: red, blue, yellow. The blue region contains rimes which can be used with all 6 tones. The rimes in red region can only be used with
blank example is just there for the convenient). Rimes in yellow region can be used with all 6 tones, but they can not be preceeded by an onset.
Blue region consists of 102 rimes, the red region has 55 rimes and the yellow region contains 5 rimes if you’re wondered.
Another table was prepared to show available onsets. Onsets are splitted into 3 types. Type 1 are onsets which has one letter (excludes the ones in Type 3), note that
đ is the only character not existed in English. Type 2 are onsets with 2 letters,
gi is two onsets which ended with a vowel and would cause some problem later.
Type 3 are onsets which are paired together. There are 3 pairs
k/c. For rimes started with
ê, the former onsets
k are used while the latter are used for the rest.
As mentioned earlier when appending a rime into an onset ended with a vowel (
gi), there are some rules need to be followed:
- If the rime started with
tone variationsinto the onset
gi, we eliminate the
gi. For example
- Similar with
qu. For example
One would argue there are more rules, for example
qu may not preceed rimes started with
ư but I decided to ignore too specific rules like this as the objective of this is make sure no vietnamese syllables is left behind so we would choose recall over precision. If someone else could follow the precision path it would be very helpful.
There are total 24 onsets, as we treated each pairs (Type 3) as one onset to simplify the calculation.
How many syllables are there in Vietnamese language?
After all components have been laid out, we can now calculate the number of syllbles. For a recap: there are 24 onsets +
none case where onset left blank. There are 3 group of rimes: the blue group with 102 rimes and has 6 tone variations, the red group with 55 rimes has 2 tone variations and yellow group have 5 rimes with 6 tone variations but cannot be preceeded by an onset. So our formula would be:
( red x 6 + blue x 2 ) x ( onsets + 1 ) + ( yellow x 6 ) = ( 102 x 6 + 55 x 2 ) x ( 24 + 1 ) + ( 5 x 6 ) = 18080
There is one problem though. In previous step with
gi, we created some duplicate syllables even though the rimes are different. For example
gi iếm and
gi ếm create an identical syllable
giếm. We need to fix this by eliminating duplications. The remains are 17,974 unique syllables.
Our little journey can be ended here, although if you want to develop a competitive language systems there are more things you need to worry about:
- Minority language:
Kon Tumis the name of two provinces of Vietnam so you can expect these words appear a lot. Unfortunately
konare not in our list as they are not the national language.
- Another case is foreign words which have been vietnamized and used so often people don’t notice anymore. An example would be the word
ôtô. To make it compatible with official language it should be written as
ô tôbut the former form is still very popular.
To develop a successful Vietnamese text or speech language systems you might want to put these cases into consideration.
By following a simple proceduce we have created a list of 17,974 syllables. This resource can be useful for a lot of Vietnamese text processing tasks even though there still many shortcoming remain. In the next post, I would survey and analyze the usage of Vietnamese syllables.