Invited Speakers
There will be an hour invited talk by Prof. Hiroya Fujisaki at the Opening Session.
Prof. Wolfgang Hess (University of Bonn) will give a summary of the conference at the closing session.
Invited speakers to oral sessions:
  • Phonology and Phonetics of Prosody
    • Prof. John Ohala (University of California, Berkeley)
    • Prof. Daniel Hirst (University of Provence)
  • Para- and Non-Linguistic Information Conveyed by Prosody
    • Prof. Klaus R. Scherer (University of Geneva)
    • Dr. Kikuo Maekawa (National Institute for Japanese Language)
  • Physiology and Pathology of Prosody
    • Dr. Kiyoshi Honda (ATR)
  • Prosody and Voice Quality
    • Dr. Ailbhe Ni Chasaide and Dr. Christer Gobl (Trinity College Dublin)
  • Control of Prosody for High-quality and Expressive Speech Synthesis
    • Prof. Bjorn Granstrom (KTH)
    • Prof. Yoshinori Sagisaka (Waseda University)
  • Prosody in Speech Recognition, Understanding, and Summarization
    • Dr. Elizabeth Shriberg (SRI/ICSI)
Invited Talks:
Prosody and Phonology
John Ohala (University of California, Berkeley)

Emotional expression in prosody: A review and an agenda for future research
Klaus R. Scherer and Tanja Baenziger (University of Geneva)

Ever since the teaching of rhetoric by Greek and Roman philosophers, the powerful effect of emotion on speech, both with respect to voice quality and prosody, has been highlighted. While empirical research during the last decades has documented the acoustic correlates of different states of speaker affect, work on emotional expressivity in prosody has been somewhat neglected. While there are many suggestion for emotion-specific prosodic patterns in the literature, most of these are based on suggestive examples rather than systematic research. This talk will provide an overview of the mechanism underlying the emotional effects on speech, emphasizing the role of voice quality and prosody, review a number of landmark studies, and identify methodological difficulties. The major issues, including the possibility of using emotional prosody in synthesis, will be illustrated with current work from our Geneva laboratory. In concluding, suggestions for further work in this area, requiring close collaboration between linguists, phoneticians, speech scientists, engineers, and psychologists, will be offered.

Production and perception of "paralinguistic" information
Kikuo Maekawa (National Institute for Japanese Language)
Phonetic manifestation of paralinguistic information (PI) like speaker's attitude and intention is a unique property of speech communication.
Production and perception of six PI types were examined using Japanese.
In speech production, acoustic and articulatory analyses revealed that speech signal and the underlying articulatory gesture differed systematically and considerably under the specification of PI. Further it was shown that the planning of PI could be classified into two different processes; one that makes reference to phonological structure of utterance, and the other that does not.
As for perception, identification experiments followed by MDS analysis revealed that native subjects could identify the PI types correctly in three dimensional perceptual space, and, regression analysis revealed high correlation between the acoustic measures and the perceptual space.
Lastly, cross-linguistic perception experiments followed by MDS analyses revealed partly language-dependent nature of PI perception. This finding was in congruence with the finding that production of PI makes partial reference to the phonological structure of utterance.

Physiological factors causing tonal characteristics of speech: from global to local prosody
Kiyoshi Honda (ATR Human Information Science Laboratories)

Voice fundamental frequency (F0) determines tonal quality of vowels, and its rise and fall comprise a part of prosody in speech. This seemingly simple linear function, however, derives from numerous physiological activities and thus lacks definitive accounts. This talk will review old studies and recent discoveries regarding causal factors of F0 patterns, and discuss possible explanations for phrasal declination, lexical accent, and local F0 fluctuations. A special focus will be placed on the following topics. (1) Historical arguments on the two actions of the cricothyroid joint for stretching the vocal folds: whether they both do exist and how they contribute to F0 pattern will be revisited with MRI observations. (2) Causal mechanisms of the so-called micro-prosody, i.e., F0 fluctuations due to voicing and vowel articulation: whether such local prosodic patterns are automatically derived from the relevant anatomical structure or they are from deliberate efforts of a speaker to enhance speech perception will be discussed based on EMG data.
Voice quality and f0 in prosody: towards a holistic account
Ailbhe Ni Chasaide and Christer Gobl(Centre for Language and Communication Studies, University of Dublin, Trinity College, Ireland)
This paper presents a discussion of the role of voice quality in prosody. Illustrations from past production and perception data by the authors indicate that source parameters other than f0 are an inherent part of prosody, implicated in both its linguistic and paralinguistic functions. While current prosodic (intonational) analyses of a language are uniquely in terms of f0 dynamics, it is argued here that a fuller understanding of the underlying production and perceptual correlates of prosodic terms such as pitch accent, declination, focus and boundaries might result from an integrative approach, where f0 and voice quality - two dimensions of the voice source - are treated together, and are related to the temporal/rhythmic structure of utterances. Such an approach may serve to bring together the currently fragmented accounts of two core aspects of prosodic functioning: its role in signalling (i) linguistic, contrastive and discourse-related information and (ii) in communicating speaker affect, i.e. mood, emotional state and attitude. While the illustrations presented here provide initial hypotheses, a newly initiated project on Irish prosody will seek to incorporate such a holistic approach to prosodic analysis.
Audiovisual representation of prosody in expressive speech communication
Bjorn Granstrom and David House (Royal Institute of Technology)

Prosody in a single speaking style ? often read speech ? has been studied extensively in acoustic speech. During the past few years we have expanded our interest in two directions: 1.) Prosody in expressive speech communication and 2.) Prosody as an audiovisual expression. Understanding the interactions between visual expressions (primarily in the face) and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons tightly connected to the acoustics (e.g. lip and jaw movements), but there are other articulatory movements that do not show up on the outside of the face. Furthermore, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. In this presentation we will give some examples of recent work, primarily at KTH, addressing these questions. We will report on methods for the acquisition and modeling of visual and acoustic data, and some evaluation experiments in which audiovisual prosody is tested. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher.

Speech synthesis with attitude
Yoshinori Sagisaka, Takumi Yamashita and Yoko Kokenawa@(Waseda University)

It has been pointed out that the conventional synthesizer is too far from satisfaction to generate natural conversational speech response in QA systems where human like speech output is highly expected.
As we do not have any effective modeling of conversational speech generation up to this moment, we have not yet figured out how to cope with this demand. In this paper, we present our research efforts towards natural conversational speech generation. We have analyzed the F0 contours of short resonses with adverbs expressing degree in realistic conversational situations. Generation and perception experiments showed the coherent control characteristics and the possibility of speech synthesis with natural prosody. Throughout the experiment, we have noticed that output contexts are quite informative by themselves but also the attitudal information is the cue to obtain natural conversational prosody.
It is speculated that some attitudal information obtained during output content generation are quite useful for further advanced controls.

Direct Modeling of Prosody:
An Overview of Applications in Automatic Speech Processing
Elizabeth Shriberg (SRI/ICSI)

We describe a gdirect modelingh approach to using prosody in various speech technology tasks. The approach does not in-volve any hand-labeling or modeling of prosodic events such as pitch accents or boundary tones. Instead, prosodic features are extracted directly from the speech signal and from the output of an automatic speech recognizer. Machine learning techniques then determine a prosodic model, which is integrated with lexical and other information to predict the target classes of interest. We discuss task-specific modeling and results for a line of research covering four general application areas: (1) structural tagging (finding sentence boundaries, disfluencies), (2) pragmatic and paralinguistic tagging (classifying dialog acts, emotion, and ghot spotsh), (3) speaker recognition, and (4) word recognition itself. To provide an idea of performance on real-world data, we focus on spontaneous (rather than read or acted) speech from a variety of contexts?including human-human telephone conversations, game-playing, human-computer dialog, and multi-party meetings.