I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.

Japanese, Korean and Vietnamese

Japanese, Korean and Vietnamese are languages which highly be influenced by Chinese in the old days, but their mordern counterpart is shifted to different directions which created uniques and challenging problems for speech processing systems. While Japanese still use Chinese character (Kanji) along with Hiragana, Katakana and Romanji as writing systems, Korean use Hangul as main systems but Chinese characters still be recognized in the cultures, while Vietnamese completely abandon Chinese characters and using exlusively an extended Roman alphabet for writing but lots of borrowed words (sounds) is still using in everyday life but most people don’t notice its origin. Moreover these languages is all experimenting the language mixcoding phenomenon mostly with English as Internet becomes a utility for everyone.

This post would present a curated list of speech corpora for these 3 languages, these corpora should be able to be used for academic purposes. For commercial, you should go to the corpus homepage and contact the owners directly. This post would be updated when new corpus was found.


JSUT single

  • Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo
  • Type: Single speaker, Female (Native Japanese)
  • Amount: 10 hours
  • Audio quality: 48kHz, recorded in anechoic room
  • License: can be used for research
  • Link: JSUT
  • Release year: 2017


KSS Dataset single

  • Description: Korean Single speaker Speech Dataset
  • Type: Single speaker, Female (Professional voice actress)
  • Amount: 12 hours, 12853 utterances
  • Audio quality: 44.1kHz
  • License: no commercial
  • Link: KSS Dataset
  • Release year: 2018

Zeroth Korean multiple

  • Description: Audio data of Project Zeroth for Korean Speech Recognition
  • Type: Multiple speakers (Crowdsourcing)
  • Amount: 76.6 hours, 35139 utterances, 137 speakers, 16472 unique sentences
  • Audio quality: crowdsourcing using MoreCoin (Android phone record devices)
  • License: CC BY 4.0
  • Link: Zeroth Project, alias: Openslr - Zeroth Korean
  • Release year: 2018

Pansori-TEDxKR multiple

  • Description: Korean speech corpus generated from Korean language TEDx talks
  • Type: Multiple speakers (TEDx talks)
  • Amount: ~3 hours, 41 speakers
  • Audio quality: TEDx talks
  • License: CC BY-NC-ND 4.0
  • Link: Pansori TEDxKR Corpus, alias: Openslr - Pansori-TEDxKR
  • Release year: 2019


VIVOS multiple

  • Description: Vietnamese speech corpus for speech recognition
  • Type: Multiple speakers (Volunteers)
  • Amount: 15 hours, 12420 utterances, 65 speakers
  • Audio quality: 16kHz, quiet room
  • License: CC BY-NC-SA 4.0
  • Link: VIVOS
  • Release year: 2017

Update History

  • 20190215: Added a new Korean corpus Pansori-TEDxKR
  • 20180422: Initial post with 4 corpora JSUT, KSS, Zeroth-Korean, VIVOS