Automatically trained singing synthesis
I completed my Ph.D. in September 1999, under the supervision of Prof.
Keikichi Hirose, at the department of information and communication
engineering, the the university of Tokyo.
The title of the thesis:
"High Quality Singing Synthesis using the Selection-based Synthesis
This work describes improvements to singing synthesis systems, by enhancing
singing synthesis quality, reducing the amount of manual work needed to learn a
new voice, and extending the range of training data, which the system can use to
learn and produce new voices.
In order to improve the naturalness and range of expression of the synthesized
singing, a large training database approach was adopted for singing synthesis.
Large database synthesis has been shown to improve speech synthesis quality and
naturalness. In a similar manner, this work attempts to improve singing
synthesis by using a large training database, with specific modifications aimed
at singing synthesis. The system is shown to be able to produce high
Since singing is a fundamentally different from speech, processing of special
characteristics of singing was introduced.
The synthesis of vibrato, which is an important expressive device in western
classical singing, is a possible source of singing quality degradation, as it
involves large prosodic, and spectral variations. A method for explicitly
handling vibrato synthesis was introduced, and shown to improve synthesis
The prosody of singing has higher variance, and more complex temporal
structures, compared with speech. While for speech, it may be enough to
represent a prosodic feature of one phoneme by just it's average value, for
singing, more detail is needed. A finer modeling of the time structure of the
prosodic features inside one phoneme was introduced in order to improve the
The singer's formant, is a perceptually important voice characteristic,
occurring in the voice of trained western classical singers. A feature
corresponding to the level of the singer's formant during singing was added to
the unit selection process, and was shown to improve synthetic singing quality.
The synthesis method is based on the selection synthesis scheme, where, given a
target song, the best matching units from the training database are selected,
and used to synthesize the target. Before selection synthesis can be used, the
selection parameters have to be trained, defining the relative weight of the
various features used for the selection process. The computational load of this
training increases as the number of features used for selection increases. As
new features were added to the selection process, to address singing specific
phenomena, the basic training scheme becomes impractical. Modifications to the
training method were introduced, and are shown to improve the training results.
In order for selection synthesis to work well, a large, phonetically segmented
database is necessary. Phonetic segmentation, done manually, can take months of
work, thus making this synthesis method impractical. Previous works showed that
speech recognition techniques can be applied to single speaker speech
databases, to produce automatic segmentation. When applied to singing, however, this
automatic segmentation produced a much less accurate segmentation. Use of the
knowledge of the music sheet information was incorporated in order to constrain
the segmentation, improving model convergence and limiting gross errors.
The system was designed to be practical, and especially to minimizing the
required manual work for training the system.
The primary training mode of the system is from recordings of a singer, singing
along a MIDI created playback, which recorded separately.
Two additional training scenarios are considered. In the first, the singer is
recorded separately, but the singer sings along a playback with no MIDI
data available. In the second, the singer's recording is mixed with an
instrument, and no MIDI data is available. As numerous recordings corresponding
to these two additional recording conditions already exist, enabling the
training of the system from these kinds of recordings can significantly increase
the amount of recordings which can be used to train the system, so that many
more voices can be easily trained.
In order to realize the first training scenario, an algorithm to align an audio
recording to music sheet information was developed, and shown to perform
For the second training scenario, separation of the singer and the piano signals
is necessary. A method of separating singer and piano sounds was developed. In
this method, the harmonic part of the piano sound is modeled by a semi
parametric model. A way to automatically train the model, and automatically
estimate the model parameters for a given instance was developed, and shown to
improve separation quality in certain conditions, compared to literature
methods. The inharmonic part of the piano sound is modeled using transient
modeling, which was shown to be able to suppress the transient parts, with minimal
degradation to the rest of the signal.
The system was implemented on a general purpose computer, and was shown to be
able to learn a new voice completely automatically, and synthesize songs, with a
natural and expressively rich voice quality. The improvement in singing quality
was confirmed by listening experiments.
The general scheme of the system (training part) is shown here:
Ph.D. thesis - "High Quality Singing
Synthesis using the Selection-based Synthesis Scheme" (unofficial version)
Some published papers, related to the Ph.D. thesis:
ICSLP98 - "Separation of singing and
EUROSPEECH99 - "Efficient weight training
for selection based synthesis"
ICASSP2000 - "Synthesis of Vibrato Singing"
Letzte Hoffnung (from the song cycle
"Winterreise" by F. Schubert) -- (Male singer, 22050 Hz
Wav file, 2.2MB)
Back to home page