Luong Hieu Thi's blog

Jan 2, 2023
How I take notes with Joplin

As several disruptive global events happened in the last couple of years, I decided to take digital notes more actively. After one year of taking notes, I start to believe that note-taking is one of those life-changing activities you can do immediately and at zero cost (exercise is another one). I keep all my notes in Joplin, an open-source multi-platform note-taking application, and have developed several plugins to cater to my specific use cases.
Jul 22, 2021
Unicode character categories and the CJK ideograph complications

As most programming languages use ASCII characters for keywords, it is a common practice to normalize text to ASCII to reduce unexpected behaviors. While this works great for English, it is problematic for other languages as this process discards useful information. To develop a robust natural language processing (NLP) system that works with native scripts, we can look at Unicode, a well-established universal character encoding, and the way it handles multi-lingual problems. Specifically, Chinese, Japanese and Korean are the main focus.
Oct 22, 2019
An end-to-end Vietnamese speech recognition recipe using ESPnet toolkit

There is a shortage of open resources for Vietnamese language processing systems. In this latest effort, I will introduce a recipe for end-to-end Vietnamese speech recognition using ESPnet toolkit. The performance on the VIVOS corpus shows high error rate compared with conventional systems as the amount of training data is not sufficient. The recipe is available at ESPnet repository and can be easily adapted to others Vietnamese corpus.
Apr 22, 2018
A curated list of Japanese, Korean and Vietnamese open speech corpora

I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.
Apr 23, 2017
The quest for a programmable English dictionary

Ever want to add a dictionary feature to your applications but don’t know where to start? Turn out Internet is a wonderful place and have everything you need to craft your own English dictionary either for your own uses or to intergrate into your products.
Apr 3, 2017
An insight into Vietnamese syllables usage

As Vietnamese uses Latin alphabet with extra characters for its writing system, it’s easy and common to mix English into the text. This is prone to be problematic for language processing systems as not only they need to take care of the source language but also be aware of other foreign words that might appear.
Mar 21, 2017
All syllables in Vietnamese language

Vietnamese is a monosyllabic language with each syllable is separated by space in written. Even though words can have one or more syllables, you can write all the words just by knowing all syllables. But how many syllables are there in the Vietnamese language? That should be answered in this post.

How I take notes with Joplin

Unicode character categories and the CJK ideograph complications

An end-to-end Vietnamese speech recognition recipe using ESPnet toolkit

A curated list of Japanese, Korean and Vietnamese open speech corpora

The quest for a programmable English dictionary

An insight into Vietnamese syllables usage

All syllables in Vietnamese language