Voice conversion experiment

TOC

Preparation of environment

Before starting (macOS)

When macOS is assumed as the target OS, you have to install command line tools before starting because git, make, g++ etc. are required in the experiment. If these commands have not been installed yet, type the following command.

xcode-select --install

Before starting (Windows)

When Windows is assumed as the target OS, first you have to install WSL (Windows subsystem for Linux). After setup of a distribution (e.g. Ubuntu or Debian), you have to install several commands which are required in the experiment. To install the required commands, type the following commands.

sudo apt install build-essential curl git tcsh

Recording

You have already recorded some samples for this experiment by the provided script. The record samples are stored in speech/ directory and each of their samples is named as a0001.wav.

Directory structure

First, to prepare an environment for the experiment, download a skelton of the structure of directories and expand it by the following procedure.

curl -O https://www.gavo.t.u-tokyo.ac.jp/~dsk_saito/download/vc-experiment.tar.gz 
tar xvzf vc-experiment.tar.gz
cd vc-experiment

Now vc-experiment/ is your working directory. The structure of the directory is as follows.

vc-experiment
├── bin
├── data
│   ├── src
│   │   └── wav
│   └── tgt
│       └── wav
├── include
├── lib
└── src

Copy recorded samples

To realize voice conversion from source to target speakers, you have to prepare samples of these two ones. The recorded samples that you want to use as source and target should be copied to data/src/wav and data/tgt/wav, respectively.

Installation of required tools

WORLD

WORLD is a free software for high-quality speech analysis, manipulation and sythesis. You can obtain the resources from github repository In order to compile WORLD, the following procedure is carried out.

cd src
git clone https://github.com/mmorise/World.git world
cd world
curl -O https://www.gavo.t.u-tokyo.ac.jp/~dsk_saito/download/world.patch
patch -p1 < world.patch
make
cd examples/analysis_synthesis
make

Two programs, analysis and synthesis are built in the directory src/world/build resulting from the above process. In addition, ctest, which is a test tool for manupulation, is also built in the directory src/world/build. The three programs are copied to bin/ by the following procedure. Assume that vc-experiment/ is your current directory.

install src/world/build/ctest bin/
install src/world/build/analysis bin/
install src/world/build/synthesis bin/

SPTK

The Speech Signal Processing Tookkit (SPTK) is a suite of speech signal processing tools for UNIX environments. You can obtain the resources from SPTK webpage.

curl -L http://downloads.sourceforge.net/sp-tk/SPTK-3.11.tar.gz > src/SPTK-3.11.tar.gz

To compile SPTK, an ordinary procedure of UNIX program is carried out.

cd src
tar xvzf SPTK-3.11.tar.gz
cd SPTK-3.11
./configure --prefix=$(pwd)/../../../vc-experiment
make 
make install

Now a suite of the tools is installed to bin/ (under the vc-experiment/).

After installation [Important]

For your convenience, add the vc-experiment/bin to PATH environment variable.

export PATH=$(pwd)/bin:$PATH

Analysis & synthesis

Pitch and spectrum manupulation

Voice modification which is realized by manually manupulating pitch and spectral parameters, is one of the simplest cases of voice conversion. Let's try the following commands to manupulate the recorded voices. You can check the meanings of arguments by typing ctest without arguments.

ctest
# usage: ctest <input.wav> <output.wav> <f0 scale scalar> <spec scale scalar>
mkdir manupulate_result
ctest data/src/wav/a0001.wav manupulate_result/output01.wav 1.0 1.0 # without manupulation
ctest data/src/wav/a0001.wav manupulate_result/output02.wav 2.0 1.0 # pitch becomes higher
ctest data/src/wav/a0001.wav manupulate_result/output03.wav 0.5 1.0 # pitch becomes lower 
ctest data/src/wav/a0001.wav manupulate_result/output04.wav 1.0 1.2 # lighter voice
ctest data/src/wav/a0001.wav manupulate_result/output05.wav 1.0 0.8 # heavier voice
ctest data/src/wav/a0001.wav manupulate_result/output06.wav 2.0 1.2 # combination v1
ctest data/src/wav/a0001.wav manupulate_result/output07.wav 0.5 0.8 # combination v2

Let's listen to the samples which are generated in manupulate_result/ directory.

Statistical voice conversion

Overview

By manupulating the parameters which represent voice characteristics, voice modification is achieved as the above section. Hence key issues of statistical voice conversion are how to learn a manupulation strategy by the given training data. The manupulation strategy is represented as a regression model, in which the parameters of the target speaker are estimated by the inputted parameters of the source speaker. In this experiment, Gaussian mixture models (GMM) are adopted as regression model. Building the model, the following steps are carried out.

  1. Feature extraction: Waveform is decomposed into multiple parameters, i.e. F0 (pitch), spectra, and aperiodicity parameters.
  2. Data alignment (Dynamic time warping): Paired data is constructed from the source and target data. Duration difference between the two utterances is absorbed by dynamic time warping (DTW) algorithm.
  3. Estimation of GMM parameters: Model parameters of GMM are inferred from the constructed paired data.
  4. Voice conversion for test data: By the estimated GMM, statistical regression of the parameters is realized.

Feature extraction

This feature extraction phase consists of two steps; speech analysis based on WORLD, and parameter compression based on SPTK. First, for all the samples in data/, i.e. the samples from both the source and target speakers, waveform is decomposed into multiple parameters, F0 (pitch), spectra, and aperiodicity parameters.

mkdir data/{src,tgt}/{sp,f0,ap} # preparing directory
for i in data/{src,tgt}/wav/* ; do analysis ${i} ${i//wav/f0} ${i//wav/sp} ${i//wav/ap} ; done 

A file .f0 is for representing pitch information. A fundamental frequency for each time frame is represented as 8B double binary. A file .sp is for representing spectral envelope information. The first 4B integer binary represents the sampling frequency and the next 8B double binary represents the frame shift period. These 12B binary can be viewed as header. Spectral envelope for each frame is represented as 1025 binary data (8B double) in the case of 48kHz-sampled waveform. A file .ap is for representing aperiodicity parameters. There is no header information. Aperiodicity information for each frame is represented as 1025 binary data (8B double) in the case of 48kHz-sampled waveform.

In this experiment, a conversion model only for .sp information is constructed.

At the next step, spectral envelope information is encoded to low-dimensional parameters, which are called mel-cepstrum. Mel-cepstrum is derived from cepstral analysis, which is an efficient way to extract envelope information. In the experiment, 1025-dimensional spectral envelopes are converted to 36-dimensional mel-cepstral vectors as follows.

mkdir data/{src,tgt}/mcep
for i in data/{src,tgt}/sp/* ; do bcut +c -s 12 ${i} | x2x +df | mcep -a 0.55 -m 35 -l 2048 -q 4 > ${i//sp/mcep} ; done

The first bcut removes the header information, the next x2x converts double binaries to float ones, and mcep converts spectral envelopes to mel-cepstral vectors. The resulting vectors consist of 1-dimensional power coefficient and 35-dimensional cepstral coefficients for each frame. When the conversion model is constructed, power coefficients are ignored. Hence, these parts are removed by the following procedure.

mkdir data/{src,tgt}/dat
for i in data/{src,tgt}/mcep/* ; do bcp -l 36 -s 1 $i > ${i//mcep/dat} ; done

Dynamic time warping

Dynamic type warping (DTW) is an alignment technique between two temporal sequences, which may vary in speech. (See Wikipedia article for detail). In the case of voice conversion, even if both the source and target speakers read the same sentence, the length of utterances would be different from each other. To construct a pair of data for regression model, we should apply DTW to the two sequences of mel-cepstral vectors as follows.

mkdir -p data/joint/dat
for i in data/tgt/dat/* ; do dtw -l 35 $i ${i/tgt/src} > ${i/tgt/joint} ; done 

Sequences of joint vectors are stored in data/joint/dat directory.

GMM estimation

Gaussian mixture models is a probabilistic distribution which consists of a weighted summation of multiple Gaussian distributions (See Wikipedia article for detail). Since a single Gaussian has a close connection to linear regression, GMM can capture non-linear relationship based on piecewise linear transformation. In classical voice conversion methods, GMM is widely adopted as regression model. Utilizing the constructed joint vector sequences, you can estimate the parameters of GMM as follows. In this experiment, 45 pairs of sequences, i.e. a0001.dat to a0045.dat are used as training data, and the 5 remaining sentences are for testing.

for i in {1..45}; do cat data/joint/dat/a00$(printf %02d $i).dat ; done > data/joint/train.dat
mkdir model
gmm -l 70 -m 4 -B 35 35 -c1 data/joint/train.dat > model/gmm_test.dat

-m option means the number of mixtures for GMM. In the above case, 4 Gaussian components are included. -b option means the number of maximum iterations. Even if the model is not converged, the estimation is stopped when the number of iterations reaches the set value.

The number of mixtures corresponds to flexibility or complexity of regression models. Let's vary this value from 1 to 64 as the following procedure. Note that it takes a slightly long time for convergence.

for i in {1,2,4,8,16,32,64} ; do gmm -l 70 -m ${i} -B 35 35 -c1 data/joint/train.dat > model/gmm${i}.dat ; done 

The different models, each of which corresponds to the specific number of mixtures, are stored in model directory.

Voice conversion based on GMM for test data

Utilizing the constructed GMM, let's convert the new input from the source speaker as follows. In the following example, gmm4.dat is adopted as conversion model.

mkdir -p data/output/dat
for i in data/src/dat/a00{46..50}.dat ; do vc -l 35 -m 4 model/gmm4.dat < $i > ${i/src/output}; done 

The obtained results are stored in data/output/dat/ . This data do not include power coefficients. To treat the power coefficients, the following procedure is carried out. Displayed values are examples. They depend on your own data. The first and second values are mean and variance, respectively. Final values are derived from mean and variance scaling \(y=\mu_y+\frac{\sigma_y}{\sigma_x}(x-\mu_x)\) .

#power extraction
mkdir -p data/{src,tgt}/pow
for i in data/{src,tgt}/mcep/* ; do bcp -l 36 -s 0 -e 0 $i > ${i//mcep/pow} ; done
# calculate mean & variance (src)
for i in {1..45}; do cat data/src/pow/a00$(printf %02d $i).pow ; done | vstat | x2x +fa
-6.67382
1.90849
# calculate mean & variance (tgt)
for i in {1..45}; do cat data/tgt/pow/a00$(printf %02d $i).pow ; done | vstat | x2x +fa
-6.53854
2.05689
# mean & variance scaling by sopr
mkdir -p data/output/pow
for i in data/src/pow/a00{46..50}.pow ; do sopr -s -6.67382 $i | sopr -d 1.90849 | sopr -m 2.05689 | sopr -a -6.53854 > ${i/src/output} ; done 

sopr command can achieve scalar-based operation. -a, -s, -m and -d are addition, subtraction, multiplication, and division, respectively.

The obtained power coefficients and mel-cepstral coefficients should be merged.

mkdir -p data/output/mcep
for i in data/output/pow/* ; do merge -l 35 -L 1 -s 0 ${i} < ${i//pow/dat} > ${i//pow/mcep} ; done

The obtained feature vectors are finally converted into spectral envelopes which are suitable for WORLD synthesis.

mkdir data/output/sp
# WORLD header extraction
bcut +c -s 0 -e 11 data/src/sp/a0001.sp > data/world.head
# mcep to WORLD sp
for i in data/output/mcep/* ; do mgc2sp -a 0.55 -m 35 -l 2048 -o 3 ${i} | x2x +fd > data/tmp ; cat data/world.head data/tmp > ${i//mcep/sp} ; rm data/tmp ; done 

Finally, new synthesized utterances are obtained by WORLD.

mkdir -p data/output/wav
for i in {46..50}; do synthesis data/src/f0/a00${i}.f0 data/output/sp/a00${i}.sp data/src/ap/a00${i}.ap data/output/wav/a00${i}.wav ; done 

The generated samples are stored in data/output/wav . Since the procedure is quite simple in the experiment, the quality of the samples would be not so good. However you can achieved simple ways of statistical voice conversion.

Assignment

Let's vary the number of mixtures for conversion of test samples. Discuss the differences between the obtained results. Please upload your data to your own storage service, such as Google drive, dropbox, and then send the link information.

Contact: Daisuke Saito [ dsk_saito (at) gavo.t.u-tokyo.ac.jp ]