Master thesis

"Detection of accents and their use in recognition of dialog acts"



My master thesis is written in German. Download the german, gzipped postscript version here (1.6 MB)

It covers the following topics:

The following lines give you a first impression of our (Dr. Elmar Nöth, Dr. Anton Batliner et al.) work. You can find more detailed informations in this english, gzipped postscript Verbmobil report (0.2 MB).

Speech production and prosody 

This chapter introduces the anatomy and mechanisms of speech production. It further defines the term "prosody". Prosody is concerned with suprasegmental events in spoken language. These events overlay spoken units, which cover more than only a single isolated sound (phoneme). Prosody means by that e.g. the speech melody or articulation of a syllable, a word, a phrase or a sentence.

We restrict ourselves on detection of prosodic accents. A prosodic accent in our definition is nothing else than word stress.  A word can be stressed in different ways. Word stress depends on changing the pitch, the loudness and the duration or their combination.

up

Extraction of prosodic features 

We are interested in the detection of prosodic accents in spontaneously spoken german language. Therefore we have to define and to compute prosodic features. I my thesis I use the 276 prosodic features developed by Dr. Andreas Kießling and described in his Ph.D. They rely on an automatic time alignment and cover the duration, the pitch, the energy (or loudness) of  words and/or syllables.
 up

Verbmobil project 

My thesis is embedded into the Verbmobil project. It aims to construct an automatic german-english translator. For more information please refer to the Verbmobil homepage.
up

Automatically labeling of prosodic accents 

In the Verbmobil project exists a large recorded collection of spontaneous spoken speech data. After recording, the dialogs have been transliterated and are now available as ASCII text together with the original recording. A small subset of the database has been marked with prosodic information (e.g. pause or word stress information) by listener judgment.

Dr. Anton Batliner and I developed a rule-based system to automatically label word stress. What is special about our system is that it needs only the ASCII word chain (the transliteration), information about pauses and additionally a lexicon with annotated parts of speech.

Information about pauses can automatically be robust computed as Dr. Andreas Kießling and Dr. Ralf Kompe in their Ph.Ds. show. Another way is the one which has been gone by Dr. Anton Batliner, who annotated pause information, by only inspecting the transliteration.

Upon this base our rule based labeling system for spontaneous german speech in the context of the Verbmobil project was constructed. The experiments show that there is a close relationship between the labeling of word stress by listener judgment and by  our rule based approach.

Our approach has the advantage to label a large database in a very short time. Its drawback is, that the label are bit less exact than the listener judgment labels. Compared to the sometimes wrong listener judgment labels we can label 76%  correctly.

up

 Recognition of prosodic accents using neural networks 

After (re-)labelling the databases we extract the above mentioned prosodic features and train artificial neural networks (ANN). We restrict ourselves to a special kind of ANN, the multi layer perceptron (MLP). As software tool we use the freely available ANN simulator SNNS. Our experiments show, that we can achieve a recognition rate of 80% for listener judgment labels and 78% for labels marked by our rule based approach. A combination of the two MLP slightly improves the recognition rates.
up

Recognition of prosodic accent using MLP and n-grams 

Due to the automatic labeling of accent information (word stress) we are able to label the up to now unlabeled huge database. This enables us to train 3-grams for accent recognition. We achieve recognition rate about 87%. Further we can slightly improve these rate to 88% by a combination of MLP and 3-gram, the MLP/3-gram-hybrid.
up

The use of accents to distinguish dialogacts 

In natural language processing sentences can not be viewed as basis units, because often speakers interrupt or correct themselves, without starting a new sentence. The main criterion for understanding is not the grammar, but the intention of a phrase. The Verbmobil project  introduces the dialogact to determine the meaning of an utterance. E.g. "Good morning" , "Hello"  and "Hi, Stephanie" belong to the dialogact "Greet".

Our job was to determine the influence of prosodic accent information (word stress) in classification of  dialogacts. Our experiments on a subset of the Verbmobil dialogacts show, that considering prosodic information improves the recognition rate of dialogacts. These results show the importance of prosodic information in the field of automatic speech recognition.

up
  Matthias Nutt, August 1998