I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.

Japanese, Korean and Vietnamese

Japanese, Korean and Vietnamese are languages which highly be influenced by Chinese in the old days, but their mordern counterpart is shifted to different directions which created uniques and challenging problems for speech processing systems. While Japanese still use Chinese character (Kanji) along with Hiragana, Katakana and Romanji as writing systems, Korean use Hangul as main systems but Chinese characters still be recognized in the cultures, while Vietnamese completely abandon Chinese characters and using exlusively an extended Roman alphabet for writing but lots of borrowed words (sounds) is still using in everyday life but most people don’t notice its origin. Moreover these languages is all experimenting the language mixcoding phenomenon mostly with English as Internet becomes a utility for everyone.

This post would present a curated list of speech corpora for these 3 languages, these corpora should be able to be used for academic purposes. For commercial, you should go to the corpus homepage and contact the owners directly. This post would be updated when new corpus was found.

Available tags: #single, #multiple, #dialect, #polyglot, #in-the-wild, #code-switch, #bilingual

Japanese


JSUT

  • Tags: #single
  • Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo
  • Type: Single speaker, Female (Native Japanese)
  • Amount: 10 hours
  • Audio quality: 48kHz, recorded in anechoic room
  • License: can be used for research
  • Link: JSUT
  • Release year: 2017

JVS

  • Tags: #multiple
  • Description: Japanese versatile speech corpus
  • Type: Multiple speaker (Professional speakers)
  • Amount: 30 hours, 100 speakers
  • Audio quality: 24kHz, studio recording
  • License: can be used for research
  • Link: JVS
  • Release year: 2019

CSS10-ja

  • Tags: #single
  • Description: A collection of single speaker speech datasets for 10 languages - Japanese
  • Type: Single speaker, Male (Native Japanese)
  • Amount: 14.9 hours
  • Audio quality: 22kHz, LibriVox audiobook
  • License: CC0, public domain
  • Link: CSS10-ja
  • Release year: 2019

JSUT-book

  • Tags: #single
  • Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo, audiobook
  • Type: Single speaker, Female (non-professional Japanese speaker)
  • Amount: ~1 hour
  • Audio quality: 48kHz
  • License: can be used for research
  • Link: JSUT-book
  • Release year: 2020

JSSS

  • Tags: #single
  • Description: Japanese speech corpus for summarization and simplification
  • Type: Single speaker, Female (non-professional Japanese speaker)
  • Amount: ~8 hour
  • Audio quality: 24kHz
  • License: can be used for research
  • Link: JSSS
  • Release year: 2020

Also check:

TEDxJP-10K

  • Tags: #multiple
  • Description: Japanese speech dataset for ASR evalation built from Japanese TEDx videos and their subtitles
  • Type: Multiple speakers, TEDx talks
  • Amount: 10,000 segments of videos in YouTube “TEDx talks in Japanese” playlist
  • Audio quality: varying
  • License: N/A
  • Link: TEDxJP-10k
  • Release year: 2020

LaboroTVSpeech

  • Tags: #multiple
  • Description: A large-scale Japanese speech corpus on TV recordings
  • Type: Multiple speakers, TV recordings and their subtitles
  • Amount: over 2,000 hours of speech
  • Audio quality: 16 kHz
  • License: N/A
  • Link: LaboroTVSpeech
  • Release year: 2020

JMD

  • Tags: #multiple #dialect
  • Description: Japanese multi-dialect corpus for speech synthesis
  • Type: Several speakers, native dialect speaker’s voice
  • Amount: 2 speakers, ~2 hours per speaker
  • Audio quality: 24kHz
  • License: can be used for research
  • Link: JMD
  • Release year: 2021

J-KAC

  • Tags: #single
  • Description: Japanese Kamishibai and audiobook corpus
  • Type: Single speakers, Male (Professional speaker)
  • Amount: ~9 hours
  • Audio quality: 48kHz
  • License: research only
  • Link: J-KAC
  • Release year: 2021

JTubeSpeech

  • Tags: #multiple #in-the-wild
  • Description: Corpus of Japanese speech collected from YouTube
  • Type: Youtube scraping, natural and synthetic speech (TTS)
  • Amount: 10,000 hours, lots of speakers
  • Audio quality: varying
  • License: N/A
  • Link: JTubeSpeech
  • Release year: 2021

tri-jek

  • Tags: #single #polyglot
  • Description: Japanese-English-Korean tri-lingual speech corpus
  • Type: Single speaker, Female, Japanese (native), Korean (native), English
  • Amount: 11 hours (ja: 2.8, kr: 6.7, and en: 1.5 hours)
  • Audio quality: 24kHz
  • License: can be used for research
  • Link: tri-jek
  • Release year: 2021

Kokoro

  • Tags: #single
  • Description: Kokoro Speech Dataset is a public domain Japanese speech dataset
  • Type: Single Speaker, Male, native Japanese, Librivox audiobook
  • Amount: ~60 hours
  • Audio quality: 22.05 kHz
  • License: CC0, public domain
  • Link: Kokoro-Speech-Dataset
  • Release year: 2021

JECS

  • Tags: #single, #bilingual, #code-switch
  • Description: Japanese-English bilingual code-switching corpus
  • Type: Single speaker, Male, bilingual speakers, parallel English and Japanese utterances + code-switch utterance with acted emotion
  • Amount: 2.5 hours in totals
  • Audio quality: 24kHz
  • License: Can be used for research
  • Link: jecs
  • Release year: 2022

SpeedSpeech-JA-2022

  • Tags: #multiple
  • Description: Speech-rate conversion corpus, one sentence read with different speed by a same speaker
  • Type: One male and one female professional narrator
  • Amount: 324 sentences per speed rate per speaker
  • Audio quality: 48 kHz, 24-bit
  • License: CC BY-NC 4.0
  • Link: SpeedSpeech-JA-2022
  • Release year: 2022

SMASH corpus

  • Tags: #multiple
  • Description: A spontaneous speech corpus recording third-person audio commentaries on gameplay
  • Type: players’ conversations (Super Smash Bros. Ultimate), game screen capture, third-person commentaries and transcript
  • Amount: ~3.2 hours of speech, multiple matches
  • Audio quality: 16 kHz
  • License: Can be used for research
  • Link: smash
  • Release year: 2022

Korean


Seoul Corpus

  • Tags: #multiple
  • Description: The Korean Corpus of Spontaneous Speech
  • Type: Multiple speakers, age/gender groups, interviews, labeling
  • Amount: 42.8 hours, 40 speakers
  • Audio quality: 22.05kHz
  • License: CC BY-NC 2.0
  • Link: Seoul Corpus - OpenSLR
  • Release year: 2015

KSS Dataset

  • Tags: #single
  • Description: Korean Single speaker Speech Dataset
  • Type: Single speaker, Female (Professional voice actress)
  • Amount: 12 hours, 12853 utterances
  • Audio quality: 44.1kHz
  • License: no commercial
  • Link: KSS Dataset
  • Release year: 2018

Zeroth Korean

  • Tags: #multiple
  • Description: Audio data of Project Zeroth for Korean Speech Recognition
  • Type: Multiple speakers (Crowdsourcing)
  • Amount: 76.6 hours, 35139 utterances, 137 speakers, 16472 unique sentences
  • Audio quality: crowdsourcing using MoreCoin (Android phone record devices)
  • License: CC BY 4.0
  • Link: Zeroth Project, alias: Openslr - Zeroth Korean
  • Release year: 2018

Pansori-TEDxKR

  • Tags: #multiple
  • Description: Korean speech corpus generated from Korean language TEDx talks
  • Type: Multiple speakers (TEDx talks)
  • Amount: ~3 hours, 41 speakers
  • Audio quality: 16kHz, TEDx talks
  • License: CC BY-NC-ND 4.0
  • Link: Pansori TEDxKR Corpus, alias: Openslr - Pansori-TEDxKR
  • Release year: 2019

Deeply Korean Read Speech corpus

  • Tags: #multiple
  • Description: Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
  • Type: Multiple speakers
  • Amount: ~3 hours, ~2000 utterances (It is 1% subset of a commercial corpus)
  • Audio quality: Studio apartment, Dance studio, Anechoic chamber
  • License: CC BY-NC-ND 4.0
  • Link: Deeply Korean read speech corpus, Openslr
  • Release year: 2021

Deeply parent-child vocal interaction dataset

  • Tags: #multiple
  • Description: The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
  • Type: Multiple speakers
  • Amount: ~16 hours, ~20000 utterances (It is 1% subset of a commercial corpus)
  • Audio quality: Studio apartment, Dance studio, Anechoic chamber
  • License: CC BY-NC-ND 4.0
  • Link: Deeply parent-child vocal interaction dataset, Openslr
  • Release year: 2021

tri-jek

  • Tags: #single #polyglot

Details in the Japanese Section

Vietnamese


VIVOS

  • Tags: #multiple
  • Description: Vietnamese speech corpus for speech recognition
  • Type: Multiple speakers (volunteers)
  • Amount: 15 hours, 12420 utterances, 65 speakers
  • Audio quality: 16kHz, quiet room
  • License: CC BY-NC-SA 4.0
  • Link: VIVOS
  • Release year: 2016

Update History


  • 20220916: Update VIVOS link, added several Japanese speech corpora: TEDxJP-10K, LaboroTVSpeech, Kokoro, JECS, SpeedSpeech-JA-2022, SMASH corpus
  • 20211123: Removed VinBigdata-VLSP2020-100h, added a new Korean Corpus Seoul Corpus, a new Japanese Corpus JTubeSpeech, and a new polyglot corpustri-jek.
  • 20210628: Added several Japanese corpora JSSS, JMD, J-KAC
  • 20210209: Added a new Japanese single-speaker corpus JSUT-book and 2 Korean multi-speaker corpora Deeply Korean read speech corpus, Deeply parent-child vocal interaction dataset
  • 20201209: Added a new Vietnamese multi-speaker corpus VinBigdata-VLSP2020-100h
  • 20190925: Added a new Japanese single-speaker corpus CSS10-ja
  • 20190902: Added a new Japanese multi-speaker corpus JVS
  • 20190215: Added a new Korean multi-speaker corpus Pansori-TEDxKR
  • 20180422: Initial post with 4 corpora JSUT, KSS, Zeroth-Korean, VIVOS