An end-to-end Vietnamese speech recognition recipe using ESPnet toolkit

There is a shortage of open resources for Vietnamese language processing systems. In this latest effort, I will introduce a recipe for end-to-end Vietnamese speech recognition using ESPnet toolkit. The performance on the VIVOS corpus shows high error rate compared with conventional systems as the amount of training data is not sufficient. The recipe is available at ESPnet repository and can be easily adapted to others Vietnamese corpus.

End-to-End Vietnamese ASR

tl;dr

I created an end-to-end (E2E) Vietnamese speech recognition recipe using ESPnet toolkit for VIVOS corpus. The recipe, vivos, was merged into master so you could try it yourself. The results show that the E2E approach got worse performance than the pipeline approach due to the insufficient amount of training data. However in a long run, when more training data is available, E2E might be a better direction for Vietnamese ASR especially when dealing with language code-mixing as there are high amount English words used in everyday conversation.

End-to-End speech recognition and ESPnet toolkit

ESPnet [1] is an E2E speech processing toolkit focused on E2E Automatic Speech Recognition (ASR) and E2E Text-to-Speech (TTS). It uses chainer and pytorch as deep learning backend while follow KALDI bash script recipe structure. As the deep learning backend would be handled by ESPnet and the simplicity nature of E2E approach, it’s relatively easy to create a speech system for a new language.

To create a recipe for a new language, our work is to define what go in and out at the two “ends” of the system. For acoustic input we could follow the standard setup. For the linguistic unit we could use character (somes used word, subword or even byte). As Vietnamese uses an extended Latin alphabet we could use Unicode character as the target output.

We need to convert

Nhận dạng tiếng nói

into

N H Ậ N <space> D Ạ N G <space> T I Ế N G <space> N Ó I

which ESPnet is already provided such function for English (and it works fine with Unicode, there are small problems with toupper and tolower for diacritic characters though).

Recipe for Vietnamese ASR

The vivos recipe is prepared to run seemlessly from start to finish. Executing run.sh and it would start downloading, training E2E model and decoding the test sets without any further input from the user. I follow a very simple setup using Connectionist Temporal Classification (CTC). The recipe is also supported optionally character-based or syllable-based Recurrent Neural Network Language Model (RNNLM) trained on transcript of training data. As the corpus is very limited (15 hours), the recipe is kept simple to just test the idea of E2E Vietnamese speech recognition.

As VIVOS corpus included an official training and testing sets as well as the results of pipeline system, we could directly compare the results between the two approaches.

Character and syllable error rates of E2E and pipeline approaches

	End-to-End with CTC (ESPnet)			Pipeline (KALDI)
	no-lm	char-lm	syl-lm	mDNN+SAT (+tone)
%CER	22.2	19.1	20.2	N/A
%SyER	54.7	38.7	42.3	9.48

Comparing with the best systems of KALDI recipe (hybrid HMM-DNN with SAT feature and tonal information [2]), the E2E is lacking behind in all configurations. This is quite expected as E2E model requires a lot more training data to achieve good performance.

Discussion

About vivos recipe

Performance of the E2E system is a lot worse than the pipeline system. One reason is the simplicity of the recipe. If you really want to push the performance there are several simple tactics you could try.

Try different architectures: beside CTC network, ESPnet also supports attention network and a hybrid between them.
Training the RNNLM with external text corpora: the LM of KALDI recipe is trained on 500 MB text corpus. While the LM of ESPnet is only trained on the transcript of training data (1.1 MB).
Using more sophisticated linguistic unit: using a linguistic unit that take unique chracteristic of Vietnamese language (e.g., tones) into account might help as well.

About future direction

While the performance of E2E approach on VIVOS corpus is underwhelming, it does not mean E2E is not a good direction for practical Vietnamese ASR system as VIVOS is a laboratory dataset. Meaning that that speech is relatively clean and the recording prompts were carefully prepared to contain mostly Vietnamese syllable.

However Vietnamese syllable is actually not the main challenge of Vietnamese speech processing systems as it quite easy to handle with expert or even non-expert knowledges [2]. The real challenge is the high-volume and inconsistent usage of foreign words mixed in the native language.

The simplicity of the E2E approach might help alleviating this problem as both native syllables and borrowing words can be handled in the same manner instead of depending on two seperate sources of expertise knowledge. All challenges then can be solved with a unified and simple (but can be expensive) solution which is collecting a lot of real-life and diverse data. Unfortunately Vietnamese is lacking of a large-scaled open speech corpus so there is not much more we could do at the moment.

Conclusion

While E2E is the latest advance in speech processing systems, it does not mean it is the best solutions for every scenarios. For the speech recognition tasks with a well-defined and limited problem, the conventional pipeline with the use of the prior expert and non-expert knowledges still yields the best and reliable performance.

However in a long run, E2E might be a good approach for Vietnamese ASR systems. Especially for the general speech recognition task.

References

[1] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, "ESPnet: End-to-End Speech Processing Toolkit" Proc. Interspeech'18, pp. 2207-2211 (2018)

[2] Hieu-Thi Luong, Hai-Quan Vu "A non-expert Kaldi recipe for Vietnamese Speech Recognition System", Proc. WLSI-3 & OIAF4HLT-2 (2016)