Adapting and Controlling DNN-based Speech Synthesis using Input Codes

Speech samples to support ICASSP 2017 paper "Adapting and Controlling DNN-based Speech Synthesis using Input Codes".

7 models were used to generate various speech samples using different strategies:

Speaker Code (S) Gender Code (G) Age Code (A)
Models Type Size Type Size Type Size
ONE-S One-hot 112 N/A N/A N/A N/A
ONE-SGA' One-hot 112 One-hot 2 One-hot 7
ONE-SGA One-hot 112 Numeric 1 Numeric 1
RND112-SGA Random 112 Numeric 1 Numeric 1
RND008-SGA Random 8 Numeric 1 Numeric 1
DCC112-SGA DCC 112 Numeric 1 Numeric 1
DCC008-SGA DCC 8 Numeric 1 Numeric 1

1. Multi-Speaker

Samples for speakers in training set. Natural is recording speech. 'a' indicates an average value for that feature while 'c' indicated correct value.

1 2 3 4
Natural ► Play ► Play ► Play ► Play
ONE-a ► Play ► Play ► Play ► Play
ONE-c ► Play ► Play ► Play ► Play
ONE-ccc' ► Play ► Play ► Play ► Play
ONE-ccc ► Play ► Play ► Play ► Play
RND112-ccc ► Play ► Play ► Play ► Play
RND008-ccc ► Play ► Play ► Play ► Play
DCC112-ccc ► Play ► Play ► Play ► Play
DCC008-ccc ► Play ► Play ► Play ► Play

2. Adaptation

Samples for speakers not included in training set. 'e' indicated the estimated value, found by using back-propagation algorithm, of that feature.

1 2 3 4
Natural ► Play ► Play ► Play ► Play
ONE-a ► Play ► Play ► Play ► Play
ONE-e ► Play ► Play ► Play ► Play
ONE-ecc' ► Play ► Play ► Play ► Play
ONE-ecc ► Play ► Play ► Play ► Play
RND112-ecc ► Play ► Play ► Play ► Play
RND008-ecc ► Play ► Play ► Play ► Play
DCC112-ecc ► Play ► Play ► Play ► Play
DCC008-ecc ► Play ► Play ► Play ► Play


3. Manipulation

3.1. Gender

Samples of speakers included in training set with gender code was switched from Male to Female and vice versa. In case of model ONE-SGA' extreme value cannot be used

Male Female
Natural ► Play Natural ► Play
ONE-c ► Play ONE-c ► Play
ONE-cFc' ► Play ONE-cMc' ► Play
ONE-cFc ► Play ONE-cMc ► Play
RND112-cFc ► Play RND112-cMc ► Play
RND008-cFc ► Play RND008-cMc ► Play
DCC112-cFc ► Play DCC112-cMc ► Play
DCC008-cFc ► Play DCC008-cMc ► Play

3.2. Age

Samples of speakers in training set. A male speaker with age in range of 41-50 was chosen. In model ONE-SGA' values 15 and 75 were used as they are 1-of-k vector and cannot be assigned an extreme value.

Male (41-50)
Natural ► Play
ONE-c ► Play
ONE-cc15' ► Play ONE-cc75' ► Play
ONE-cc05 ► Play ONE-cc85 ► Play
RND112-cc05 ► Play RND112-cc85 ► Play
RND008-cc05 ► Play RND008-cc85 ► Play
DCC112-cc05 ► Play DCC112-cc85 ► Play
DCC008-cc05 ► Play DCC008-cc85 ► Play

3.3. Interpolation

The rest of samples were generated using DCC008-SGA model. In this section value of Speaker Code, Age Code and Gender Code are interpolated from one value to another in a single utterance

3.3.a. Speaker interpolation

The Speaker Code was interpolated between 2 speakers in the training set that have the same gender and age.

21-30 years old Male
Sample 1
Sample 2

61-70 years old Female
Sample 1
Sample 2

3.3.b. Age and Gender interpolation

The Gender Code or Age Code was interpolated from one extreme value to another, while the others 2 codes was keep with the correct value.

Sample 1
DCC008-ccc
Gender (-2->3)
Age (-50->200)

Sample 2
DCC008-ccc
Gender (-2->3)
Age (-50->200)