Adapting and Controlling DNN-based Speech Synthesis using Input Codes

Speech samples to support ICASSP 2017 paper "Adapting and Controlling DNN-based Speech Synthesis using Input Codes".

7 models were used to generate various speech samples using different strategies:

	Speaker Code (S)		Gender Code (G)		Age Code (A)
Models	Type	Size	Type	Size	Type	Size
ONE-S	One-hot	112	N/A	N/A	N/A	N/A
ONE-SGA'	One-hot	112	One-hot	2	One-hot	7
ONE-SGA	One-hot	112	Numeric	1	Numeric	1
RND112-SGA	Random	112	Numeric	1	Numeric	1
RND008-SGA	Random	8	Numeric	1	Numeric	1
DCC112-SGA	DCC	112	Numeric	1	Numeric	1
DCC008-SGA	DCC	8	Numeric	1	Numeric	1

1. Multi-Speaker

Samples for speakers in training set. Natural is recording speech. 'a' indicates an average value for that feature while 'c' indicated correct value.

	1	2	3	4
Natural	► Play	► Play	► Play	► Play
ONE-a	► Play	► Play	► Play	► Play
ONE-c	► Play	► Play	► Play	► Play
ONE-ccc'	► Play	► Play	► Play	► Play
ONE-ccc	► Play	► Play	► Play	► Play
RND112-ccc	► Play	► Play	► Play	► Play
RND008-ccc	► Play	► Play	► Play	► Play
DCC112-ccc	► Play	► Play	► Play	► Play
DCC008-ccc	► Play	► Play	► Play	► Play

2. Adaptation

Samples for speakers not included in training set. 'e' indicated the estimated value, found by using back-propagation algorithm, of that feature.

	1	2	3	4
Natural	► Play	► Play	► Play	► Play
ONE-a	► Play	► Play	► Play	► Play
ONE-e	► Play	► Play	► Play	► Play
ONE-ecc'	► Play	► Play	► Play	► Play
ONE-ecc	► Play	► Play	► Play	► Play
RND112-ecc	► Play	► Play	► Play	► Play
RND008-ecc	► Play	► Play	► Play	► Play
DCC112-ecc	► Play	► Play	► Play	► Play
DCC008-ecc	► Play	► Play	► Play	► Play

3. Manipulation

3.1. Gender

Samples of speakers included in training set with gender code was switched from Male to Female and vice versa. In case of model ONE-SGA' extreme value cannot be used

	Male		Female
Natural	► Play	Natural	► Play
ONE-c	► Play	ONE-c	► Play
ONE-cFc'	► Play	ONE-cMc'	► Play
ONE-cFc	► Play	ONE-cMc	► Play
RND112-cFc	► Play	RND112-cMc	► Play
RND008-cFc	► Play	RND008-cMc	► Play
DCC112-cFc	► Play	DCC112-cMc	► Play
DCC008-cFc	► Play	DCC008-cMc	► Play

3.2. Age

Samples of speakers in training set. A male speaker with age in range of 41-50 was chosen. In model ONE-SGA' values 15 and 75 were used as they are 1-of-k vector and cannot be assigned an extreme value.

	Male (41-50)
Natural	► Play
ONE-c	► Play
ONE-cc15'	► Play	ONE-cc75'	► Play
ONE-cc05	► Play	ONE-cc85	► Play
RND112-cc05	► Play	RND112-cc85	► Play
RND008-cc05	► Play	RND008-cc85	► Play
DCC112-cc05	► Play	DCC112-cc85	► Play
DCC008-cc05	► Play	DCC008-cc85	► Play

3.3. Interpolation

The rest of samples were generated using DCC008-SGA model. In this section value of Speaker Code, Age Code and Gender Code are interpolated from one value to another in a single utterance

3.3.a. Speaker interpolation

The Speaker Code was interpolated between 2 speakers in the training set that have the same gender and age.

	21-30 years old Male
Sample 1
Sample 2

	61-70 years old Female
Sample 1
Sample 2

3.3.b. Age and Gender interpolation

The Gender Code or Age Code was interpolated from one extreme value to another, while the others 2 codes was keep with the correct value.

	Sample 1
DCC008-ccc
Gender (-2->3)
Age (-50->200)

	Sample 2
DCC008-ccc
Gender (-2->3)
Age (-50->200)