A curated list of Japanese, Korean and Vietnamese open speech corpora

I would curate a list of open speech corpora for academic uses of Japanese, Korean and Vietnamese. While speech processing systems achieves outstanding results exponentialy for major languages like English and Chinese, the development of other languages is not as active. This list was created to make it more easy to jump start a speech process project and spark interests in research and development of speech processing systems.

Japanese, Korean and Vietnamese

Japanese, Korean and Vietnamese are languages which highly be influenced by Chinese in the old days, but their mordern counterpart is shifted to different directions which created uniques and challenging problems for speech processing systems. While Japanese still use Chinese character (Kanji) along with Hiragana, Katakana and Romanji as writing systems, Korean use Hangul as main systems but Chinese characters still be recognized in the cultures, while Vietnamese completely abandon Chinese characters and using exlusively an extended Roman alphabet for writing but lots of borrowed words (sounds) is still using in everyday life but most people don’t notice its origin. Moreover these languages is all experimenting the language mixcoding phenomenon mostly with English as Internet becomes a utility for everyone.

This post would present a curated list of speech corpora for these 3 languages, these corpora should be able to be used for academic purposes. For commercial, you should go to the corpus homepage and contact the owners directly. This post would be updated when new corpus was found.

Available tags: #single, #multiple, #dialect, #polyglot, #in-the-wild, #code-switch, #bilingual

Japanese

JSUT

Tags: #single
Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo
Type: Single speaker, Female (Native Japanese)
Amount: 10 hours
Audio quality: 48kHz, recorded in anechoic room
License: can be used for research
Link: JSUT
Release year: 2017

JVS

Tags: #multiple
Description: Japanese versatile speech corpus
Type: Multiple speaker (Professional speakers)
Amount: 30 hours, 100 speakers
Audio quality: 24kHz, studio recording
License: can be used for research
Link: JVS
Release year: 2019

CSS10-ja

Tags: #single
Description: A collection of single speaker speech datasets for 10 languages - Japanese
Type: Single speaker, Male (Native Japanese)
Amount: 14.9 hours
Audio quality: 22kHz, LibriVox audiobook
License: CC0, public domain
Link: CSS10-ja
Release year: 2019

JSUT-book

Tags: #single
Description: Japanese speech corpus of Saruwatari Lab, University of Tokyo, audiobook
Type: Single speaker, Female (non-professional Japanese speaker)
Amount: ~1 hour
Audio quality: 48kHz
License: can be used for research
Link: JSUT-book
Release year: 2020

JSSS

Tags: #single
Description: Japanese speech corpus for summarization and simplification
Type: Single speaker, Female (non-professional Japanese speaker)
Amount: ~8 hour
Audio quality: 24kHz
License: can be used for research
Link: JSSS
Release year: 2020

Also check:

JSSS-misc: misc tasks of JSSS corpus

TEDxJP-10K

Tags: #multiple
Description: Japanese speech dataset for ASR evalation built from Japanese TEDx videos and their subtitles
Type: Multiple speakers, TEDx talks
Amount: 10,000 segments of videos in YouTube “TEDx talks in Japanese” playlist
Audio quality: varying
License: N/A
Link: TEDxJP-10k
Release year: 2020

LaboroTVSpeech

Tags: #multiple
Description: A large-scale Japanese speech corpus on TV recordings
Type: Multiple speakers, TV recordings and their subtitles
Amount: over 2,000 hours of speech
Audio quality: 16 kHz
License: N/A
Link: LaboroTVSpeech
Release year: 2020

JMD

Tags: #multiple #dialect
Description: Japanese multi-dialect corpus for speech synthesis
Type: Several speakers, native dialect speaker’s voice
Amount: 2 speakers, ~2 hours per speaker
Audio quality: 24kHz
License: can be used for research
Link: JMD
Release year: 2021

J-KAC

Tags: #single
Description: Japanese Kamishibai and audiobook corpus
Type: Single speakers, Male (Professional speaker)
Amount: ~9 hours
Audio quality: 48kHz
License: research only
Link: J-KAC
Release year: 2021

JTubeSpeech

Tags: #multiple #in-the-wild
Description: Corpus of Japanese speech collected from YouTube
Type: Youtube scraping, natural and synthetic speech (TTS)
Amount: 10,000 hours, lots of speakers
Audio quality: varying
License: N/A
Link: JTubeSpeech
Release year: 2021

tri-jek

Tags: #single #polyglot
Description: Japanese-English-Korean tri-lingual speech corpus
Type: Single speaker, Female, Japanese (native), Korean (native), English
Amount: 11 hours (ja: 2.8, kr: 6.7, and en: 1.5 hours)
Audio quality: 24kHz
License: can be used for research
Link: tri-jek
Release year: 2021

Kokoro

Tags: #single
Description: Kokoro Speech Dataset is a public domain Japanese speech dataset
Type: Single Speaker, Male, native Japanese, Librivox audiobook
Amount: ~60 hours
Audio quality: 22.05 kHz
License: CC0, public domain
Link: Kokoro-Speech-Dataset
Release year: 2021

JECS

Tags: #single, #bilingual, #code-switch
Description: Japanese-English bilingual code-switching corpus
Type: Single speaker, Male, bilingual speakers, parallel English and Japanese utterances + code-switch utterance with acted emotion
Amount: 2.5 hours in totals
Audio quality: 24kHz
License: Can be used for research
Link: jecs
Release year: 2022

SpeedSpeech-JA-2022

Tags: #multiple
Description: Speech-rate conversion corpus, one sentence read with different speed by a same speaker
Type: One male and one female professional narrator
Amount: 324 sentences per speed rate per speaker
Audio quality: 48 kHz, 24-bit
License: CC BY-NC 4.0
Link: SpeedSpeech-JA-2022
Release year: 2022

SMASH corpus

Tags: #multiple
Description: A spontaneous speech corpus recording third-person audio commentaries on gameplay
Type: players’ conversations (Super Smash Bros. Ultimate), game screen capture, third-person commentaries and transcript
Amount: ~3.2 hours of speech, multiple matches
Audio quality: 16 kHz
License: Can be used for research
Link: smash
Release year: 2022

Korean

Seoul Corpus

Tags: #multiple
Description: The Korean Corpus of Spontaneous Speech
Type: Multiple speakers, age/gender groups, interviews, labeling
Amount: 42.8 hours, 40 speakers
Audio quality: 22.05kHz
License: CC BY-NC 2.0
Link: Seoul Corpus - OpenSLR
Release year: 2015

KSS Dataset

Tags: #single
Description: Korean Single speaker Speech Dataset
Type: Single speaker, Female (Professional voice actress)
Amount: 12 hours, 12853 utterances
Audio quality: 44.1kHz
License: no commercial
Link: KSS Dataset
Release year: 2018

Zeroth Korean

Tags: #multiple
Description: Audio data of Project Zeroth for Korean Speech Recognition
Type: Multiple speakers (Crowdsourcing)
Amount: 76.6 hours, 35139 utterances, 137 speakers, 16472 unique sentences
Audio quality: crowdsourcing using MoreCoin (Android phone record devices)
License: CC BY 4.0
Link: Zeroth Project, alias: Openslr - Zeroth Korean
Release year: 2018

Pansori-TEDxKR

Tags: #multiple
Description: Korean speech corpus generated from Korean language TEDx talks
Type: Multiple speakers (TEDx talks)
Amount: ~3 hours, 41 speakers
Audio quality: 16kHz, TEDx talks
License: CC BY-NC-ND 4.0
Link: Pansori TEDxKR Corpus, alias: Openslr - Pansori-TEDxKR
Release year: 2019

Deeply Korean Read Speech corpus

Tags: #multiple
Description: Pairs of Korean reading the scripts with 3 text sentiments using 3 vocal sentiments. Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
Type: Multiple speakers
Amount: ~3 hours, ~2000 utterances (It is 1% subset of a commercial corpus)
Audio quality: Studio apartment, Dance studio, Anechoic chamber
License: CC BY-NC-ND 4.0
Link: Deeply Korean read speech corpus, Openslr
Release year: 2021

Deeply parent-child vocal interaction dataset

Tags: #multiple
Description: The interaction of pairs of parent and child(reading fairy tales, singing children’s songs, conversing, and others).Recorded in 3 types of places, at 3 distinct distances, with 2 types of smartphone.
Type: Multiple speakers
Amount: ~16 hours, ~20000 utterances (It is 1% subset of a commercial corpus)
Audio quality: Studio apartment, Dance studio, Anechoic chamber
License: CC BY-NC-ND 4.0
Link: Deeply parent-child vocal interaction dataset, Openslr
Release year: 2021

tri-jek

Tags: #single #polyglot

Details in the Japanese Section

Vietnamese

VIVOS

Tags: #multiple
Description: Vietnamese speech corpus for speech recognition
Type: Multiple speakers (volunteers)
Amount: 15 hours, 12420 utterances, 65 speakers
Audio quality: 16kHz, quiet room
License: CC BY-NC-SA 4.0
Link: VIVOS
Release year: 2016

Update History

20220916: Update VIVOS link, added several Japanese speech corpora: TEDxJP-10K, LaboroTVSpeech, Kokoro, JECS, SpeedSpeech-JA-2022, SMASH corpus
20211123: Removed VinBigdata-VLSP2020-100h, added a new Korean Corpus Seoul Corpus, a new Japanese Corpus JTubeSpeech, and a new polyglot corpustri-jek.
20210628: Added several Japanese corpora JSSS, JMD, J-KAC
20210209: Added a new Japanese single-speaker corpus JSUT-book and 2 Korean multi-speaker corpora Deeply Korean read speech corpus, Deeply parent-child vocal interaction dataset
20201209: Added a new Vietnamese multi-speaker corpus VinBigdata-VLSP2020-100h
20190925: Added a new Japanese single-speaker corpus CSS10-ja
20190902: Added a new Japanese multi-speaker corpus JVS
20190215: Added a new Korean multi-speaker corpus Pansori-TEDxKR
20180422: Initial post with 4 corpora JSUT, KSS, Zeroth-Korean, VIVOS