• Unicode character categories and the CJK ideograph complications

    As most programming languages use ASCII characters for keywords, it is a common practice to normalize text to ASCII to reduce unexpected behaviors. While this works great for English, it is problematic for other languages as this process discards useful information. To develop a robust natural language processing (NLP) system that works with native scripts, we can look at Unicode, a well-established universal character encoding, and the way it handles multi-lingual problems. Specifically, Chinese, Japanese and Korean are the main focus.

  • An end-to-end Vietnamese speech recognition recipe using ESPnet toolkit

    There is a shortage of open resources for Vietnamese language processing systems. In this latest effort, I will introduce a recipe for end-to-end Vietnamese speech recognition using ESPnet toolkit. The performance on the VIVOS corpus shows high error rate compared with conventional systems as the amount of training data is not sufficient. The recipe is available at ESPnet repository and can be easily adapted to others Vietnamese corpus.

  • A curated list of Japanese, Korean and Vietnamese open speech corpora

    I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.

  • The quest for a programmable English dictionary

    Ever want to add a dictionary feature to your applications but don’t know where to start? Turn out Internet is a wonderful place and have everything you need to craft your own English dictionary either for your own uses or to intergrate into your products.

  • An insight into Vietnamese syllables usage

    As Vietnamese uses Latin alphabet with extra characters for its writing system, it’s easy and common to mix English into the text. This is prone to be problematic for language processing systems as not only they need to take care of the source language but also be aware of other foreign words that might appear.

  • All syllables in Vietnamese language

    Vietnamese is a monosyllabic language with each syllable is separated by space in written. Even though words can have one or more syllables, you can write all the words just by knowing all syllables. But how many syllables are there in the Vietnamese language? That should be answered in this post.