Automatically trained singing synthesis

I completed my Ph.D. in September 1999, under the supervision of Prof. Keikichi Hirose, at the department of information and communication engineering, the the university of Tokyo.

The title of the thesis:

"High Quality Singing Synthesis using the Selection-based Synthesis Scheme".


Thesis abstract:

This work describes improvements to singing synthesis systems, by enhancing singing synthesis quality, reducing the amount of manual work needed to learn a new voice, and extending the range of training data, which the system can use to learn and produce new voices.

In order to improve the naturalness and range of expression of the synthesized singing, a large training database approach was adopted for singing synthesis.

Large database synthesis has been shown to improve speech synthesis quality and naturalness. In a similar manner, this work attempts to improve singing synthesis by using a large training database, with specific modifications aimed at singing synthesis. The system is shown to be able to produce high quality singing.

Since singing is a fundamentally different from speech, processing of special characteristics of singing was introduced.

The synthesis of vibrato, which is an important expressive device in western classical singing, is a possible source of singing quality degradation, as it involves large prosodic, and spectral variations. A method for explicitly handling vibrato synthesis was introduced, and shown to improve synthesis quality.

The prosody of singing has higher variance, and more complex temporal structures, compared with speech. While for speech, it may be enough to represent a prosodic feature of one phoneme by just it's average value, for singing, more detail is needed. A finer modeling of the time structure of the prosodic features inside one phoneme was introduced in order to improve the selection results.

The singer's formant, is a perceptually important voice characteristic, occurring in the voice of trained western classical singers. A feature corresponding to the level of the singer's formant during singing was added to the unit selection process, and was shown to improve synthetic singing quality.

The synthesis method is based on the selection synthesis scheme, where, given a target song, the best matching units from the training database are selected, and used to synthesize the target. Before selection synthesis can be used, the selection parameters have to be trained, defining the relative weight of the various features used for the selection process. The computational load of this training increases as the number of features used for selection increases. As new features were added to the selection process, to address singing specific phenomena, the basic training scheme becomes impractical. Modifications to the training method were introduced, and are shown to improve the training results.

In order for selection synthesis to work well, a large, phonetically segmented database is necessary. Phonetic segmentation, done manually, can take months of work, thus making this synthesis method impractical. Previous works showed that speech recognition techniques can be applied to single speaker speech databases, to produce automatic segmentation. When applied to singing, however, this automatic segmentation produced a much less accurate segmentation. Use of the knowledge of the music sheet information was incorporated in order to constrain the segmentation, improving model convergence and limiting gross errors.

The system was designed to be practical, and especially to minimizing the required manual work for training the system. The primary training mode of the system is from recordings of a singer, singing along a MIDI created playback, which recorded separately.

Two additional training scenarios are considered. In the first, the singer is recorded separately, but the singer sings along a playback with no MIDI data available. In the second, the singer's recording is mixed with an instrument, and no MIDI data is available. As numerous recordings corresponding to these two additional recording conditions already exist, enabling the training of the system from these kinds of recordings can significantly increase the amount of recordings which can be used to train the system, so that many more voices can be easily trained.

In order to realize the first training scenario, an algorithm to align an audio recording to music sheet information was developed, and shown to perform satisfactorily.

For the second training scenario, separation of the singer and the piano signals is necessary. A method of separating singer and piano sounds was developed. In this method, the harmonic part of the piano sound is modeled by a semi parametric model. A way to automatically train the model, and automatically estimate the model parameters for a given instance was developed, and shown to improve separation quality in certain conditions, compared to literature methods. The inharmonic part of the piano sound is modeled using transient modeling, which was shown to be able to suppress the transient parts, with minimal degradation to the rest of the signal.

The system was implemented on a general purpose computer, and was shown to be able to learn a new voice completely automatically, and synthesize songs, with a natural and expressively rich voice quality. The improvement in singing quality was confirmed by listening experiments.

The general scheme of the system (training part) is shown here:
Ph.D. thesis - "High Quality Singing Synthesis using the Selection-based Synthesis Scheme" (unofficial version)

Some published papers, related to the Ph.D. thesis:

ICSLP98 - "Separation of singing and piano sounds"

EUROSPEECH99 - "Efficient weight training for selection based synthesis"

ICASSP2000 - "Synthesis of Vibrato Singing"


Synthesis Examples:

Letzte Hoffnung (from the song cycle "Winterreise" by F. Schubert) -- (Male singer, 22050 Hz Wav file, 2.2MB)
Back to home page