I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.

Japanese, Korean and Vietnamese

Japanese, Korean and Vietnamese are languages which highly be influenced by Chinese in the old days, but their mordern counterpart is shifted to different directions which created uniques and challenging problems for speech processing systems. While Japanese still use Chinese character (Kanji) along with Hiragana, Katakana and Romanji as writing systems, Korean use Hangul as main systems but Chinese characters still be recognized in the cultures, while Vietnamese completely abandon Chinese characters and using exlusively an extended Roman alphabet for writing but lots of borrowed words (sounds) is still using in everyday life but most people don’t notice its origin. Moreover these languages is all experimenting the language mixcoding phenomenon mostly with English as Internet becomes a utility for everyone.

This post would present a curated list of speech corpora for these 3 languages, these corpora should be able to be used for academic purposes. For commercial, you should go to the corpus homepage and contact the owners directly. This post would be updated when new corpus was found.

Japanese

JSUT #single

  • Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo
  • Type: Single speaker, Female (Native Japanese)
  • Amount: 10 hours
  • Audio quality: 48kHz, recorded in anechoic room
  • License: can be used for research
  • Link: JSUT
  • Release year: 2017

JVS #multiple

  • Description: Japanese versatile speech corpus
  • Type: Multiple speaker (Professional speakers)
  • Amount: 30 hours, 100 speakers
  • Audio quality: 24kHz, studio recording
  • License: can be used for research
  • Link: JVS
  • Release year: 2019

CSS10-ja #single

  • Description: A collection of single speaker speech datasets for 10 languages - Japanese
  • Type: Single speaker, Male (Native Japanese)
  • Amount: 14.9 hours
  • Audio quality: 22kHz, LibriVox audiobook
  • License: CC0, public domain
  • Link: CSS10-ja
  • Release year: 2019

JSUT-book #single

  • Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo, audiobook
  • Type: Single speaker, Female (non-professional Japanese speaker)
  • Amount: ~1 hour
  • Audio quality: 48kHz
  • License: can be used for research
  • Link: JSUT-book
  • Release year: 2020

JSSS #single

  • Description: Japanese speech corpus for summarization and simplification
  • Type: Single speaker, Female (non-professional Japanese speaker)
  • Amount: ~8 hour
  • Audio quality: 24kHz
  • License: can be used for research
  • Link: JSSS
  • Release year: 2020

JMD #multiple #dialect

  • Description: Japanese multi-dialect corpus for speech synthesis
  • Type: Several speakers, native dialect speaker’s voice
  • Amount: 2 speakers, ~2 hours per speaker
  • Audio quality: 24kHz
  • License: can be used for research
  • Link: JMD
  • Release year: 2021

J-KAC #single

  • Description: Japanese Kamishibai and audiobook corpus
  • Type: Single speakers, Male (Professional speaker)
  • Amount: ~9 hours
  • Audio quality: 48kHz
  • License: research only
  • Link: J-KAC
  • Release year: 2021

Korean

KSS Dataset #single

  • Description: Korean Single speaker Speech Dataset
  • Type: Single speaker, Female (Professional voice actress)
  • Amount: 12 hours, 12853 utterances
  • Audio quality: 44.1kHz
  • License: no commercial
  • Link: KSS Dataset
  • Release year: 2018

Zeroth Korean #multiple

  • Description: Audio data of Project Zeroth for Korean Speech Recognition
  • Type: Multiple speakers (Crowdsourcing)
  • Amount: 76.6 hours, 35139 utterances, 137 speakers, 16472 unique sentences
  • Audio quality: crowdsourcing using MoreCoin (Android phone record devices)
  • License: CC BY 4.0
  • Link: Zeroth Project, alias: Openslr - Zeroth Korean
  • Release year: 2018

Pansori-TEDxKR #multiple

  • Description: Korean speech corpus generated from Korean language TEDx talks
  • Type: Multiple speakers (TEDx talks)
  • Amount: ~3 hours, 41 speakers
  • Audio quality: 16kHz, TEDx talks
  • License: CC BY-NC-ND 4.0
  • Link: Pansori TEDxKR Corpus, alias: Openslr - Pansori-TEDxKR
  • Release year: 2019

Deeply Korean Read Speech corpus #multiple

  • Description: Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
  • Type: Multiple speakers
  • Amount: ~3 hours, ~2000 utterances (It is 1% subset of a commercial corpus)
  • Audio quality: Studio apartment, Dance studio, Anechoic chamber
  • License: CC BY-NC-ND 4.0
  • Link: Deeply Korean read speech corpus, Openslr
  • Release year: 2021

Deeply parent-child vocal interaction dataset #multiple

  • Description: The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
  • Type: Multiple speakers
  • Amount: ~16 hours, ~20000 utterances (It is 1% subset of a commercial corpus)
  • Audio quality: Studio apartment, Dance studio, Anechoic chamber
  • License: CC BY-NC-ND 4.0
  • Link: Deeply parent-child vocal interaction dataset, Openslr
  • Release year: 2021

Vietnamese

VIVOS #multiple

  • Description: Vietnamese speech corpus for speech recognition
  • Type: Multiple speakers (volunteers)
  • Amount: 15 hours, 12420 utterances, 65 speakers
  • Audio quality: 16kHz, quiet room
  • License: CC BY-NC-SA 4.0
  • Link: VIVOS
  • Release year: 2017

VinBigdata-VLSP2020-100h #multiple

  • Description: The speech corpus for the automatic speech recognition task in VLSP-2020
  • Type: Multiple speakers (reading & spontaneous speech)
  • Amount: ~100 hours (~20 hours of reading speech and ~80 hours of spontaneous speech)
  • Audio quality: 16kHz, reading speech was recorded with smartphone in various environments, spontaneous speech was crawled and manually transcribed
  • License: UNSPECIFIED!!! use it with your own risk
  • Link: VinBigdata-VLSP2020-100h
  • Release year: 2020

Update History

  • 20210628: Added several Japanese corpora JSSS, JMD, J-KAC
  • 20210209: Added a new Japanese single-speaker corpus JSUT-book and 2 Korean multi-speaker corpora Deeply Korean read speech corpus, Deeply parent-child vocal interaction dataset
  • 20201209: Added a new Vietnamese multi-speaker corpus VinBigdata-VLSP2020-100h
  • 20190925: Added a new Japanese single-speaker corpus CSS10-ja
  • 20190902: Added a new Japanese multi-speaker corpus JVS
  • 20190215: Added a new Korean multi-speaker corpus Pansori-TEDxKR
  • 20180422: Initial post with 4 corpora JSUT, KSS, Zeroth-Korean, VIVOS