As Vietnamese uses Latin alphabet with extra characters for writing system, it’s easy and common to mix English into the text. This is prone to be problematic for language processing systems as not only they need to take care of the source language but also be aware of other words that might appear.
By analyzing content published in 2015 and 2016 of 5 popular Vietnamese news sources, I found that there are only 7184 unique syllables (download) are actually used and they cover more than 94% of the content. And here is a cool graphic.
To analyze the usage of Vietnamese, 5 websites were chosen as they are among the most popular online news and have consistent data available in 2015 and 2016. As Vietnamese syllables is the focus, some categories (egs: world news, technology, automobile,…) were removed as they might have a high amount of foreign words.
Data are normalized by removing abnormal characters, fixing encoding, … before process to the next step. I used a normalization pipeline developed by me for this task although the details would not be laid out here as the amount of content can be it’s own post.
Instead of combining data into one big text corpus, each sites was analyzed separately as it might reveal some interesting information. For statistic the content of each sites were gathering, paragraph was splitted into sentences, then sentences are splitted into tokens. Tokens was classified into three groups:
- VI: Vietnamese passing tokens. The list of all Vietnamese syllables was used with addition of these 5 syllables:
konas they are vastly common.
- NU: tokens have at least one numbers in it, egs:
GMT+7, … This group would be mostly ignored by my statistic.
- OT: tokens not belong to VI or NU. We can expect this group contains several classes and would be the main challenge for Vietnamese languages processing systems.
The frequencies of each unique tokens (from now I would call them as vocabs) are caculated from text corpora. As mention before all vocabs belong to NU group are removed. But instead of take all unique tokens into account we would only care about the ones that appeared at least several times. Depend on the sites I choose different frequency limit, the ideas is to get just above 20000 vocabs from each sites.
|Frequency limit||#VI||#OT||VI coverage||VI only sentences||OT once sentences|
All 5 sites shows similar pattern in every measurement, this give us more confident in our results. The most important result here is the number of Vietnamese syllables that are actually used, all 5 sites have about 6000 syllables but when combined we have 7184 unique syllables. While the number of syllables is small they cover more than 94% of all tokens used in these articles.
The figure below give us more information the Vietnamese syllables usage. As illustrating by the figure we can see a small amount of the top syllables cover most of the contents. For language learners I would reccommend the top 2000 syllables as learning matterial as they cover more than 90% of every day uses.
For language processing systems is it safe to just handle Vietnamese syllables? The answer is it depends. As show in the table above although 7184 syllables cover more than 94% of all occurences, there are just about 50% of all sentences contains Vietnamese-only vocabs. Simply speaking in case of Automatic Speech Recognition if you manage to train a perfect recognizer but they only handle Vietnamese syllables you can achieve 94% syllable accuracy while at most 50% sentence accuracy. So you should decide which is more important for your project.
To develop more sophisticate systems for Vietnamese, an in-dept analytic on out-of-domain vocabs would be very valuables. Although I don’t think I can give that analytic, but instead I would provide to you some initial observation based on the results I got here. By combining all 5 sites we get 34,000 OT vocabs which are the most popular non-vietnamese-syllables. These vocabs are appeared on more than 30% of all sentences and belongs to several different classes:
- Tone-stripped syllables:
- Vietnamized words:
- Foreign words and names:
And don’t forget all the symbols (mathematic, currency,…) which has been removed earlier in the normalization step. To develop a Vietnamese language processing system there are more than just Vietnamese syllables you need to worry about.
As Internet becomes a standard utility for Vietnamese people. Vietnamese language has evolved and adapted to a new era of globalization, which means more foreign words, more internet slangs and spelling error. To develop competitive Vietnamese language processing systems, you need to be aware of the real life usage of Vietnamese language instead of just basing on textbook and assumptions.