"Detection of accents and their use in recognition of dialog acts"
My master thesis is written in German.
Download the german, gzipped postscript version
here (1.6 MB)
It covers the following topics:
The following lines give you a first impression of our (Dr.
Elmar Nöth, Dr.
Anton Batliner et al.) work.
You can find more detailed informations
in this english, gzipped postscript
Verbmobil report (0.2 MB).
Speech production and prosody
This chapter introduces the anatomy and mechanisms of speech production.
It further defines the term "prosody". Prosody is concerned with suprasegmental
events in spoken language. These events overlay spoken units, which cover
more than only a single isolated sound (phoneme). Prosody means by that
e.g. the speech melody or articulation of a syllable, a word, a phrase
or a sentence.
We restrict ourselves on detection of prosodic accents. A prosodic accent
in our definition is nothing else than word stress. A word can be
stressed in different ways. Word stress depends on changing the pitch,
the loudness and the duration or their combination.
Extraction of prosodic features
We are interested in the detection of prosodic accents in spontaneously
spoken german language. Therefore we have to define and to compute prosodic
features. I my thesis I use the 276 prosodic features developed by Dr.
Andreas Kießling and described in his Ph.D. They rely on an automatic
time alignment and cover the duration, the pitch, the energy (or loudness)
of words and/or syllables.
My thesis is embedded into the Verbmobil project. It aims to construct
an automatic german-english translator. For more information please refer
to the Verbmobil homepage.
Automatically labeling of prosodic accents
In the Verbmobil project exists a large recorded collection of spontaneous
spoken speech data. After recording, the dialogs have been transliterated
and are now available as ASCII text together with the original recording.
A small subset of the database has been marked with prosodic information
(e.g. pause or word stress information) by listener judgment.
Dr. Anton Batliner and I developed a rule-based system to automatically
label word stress. What is special about our system is that it needs only
the ASCII word chain (the transliteration), information about pauses and
additionally a lexicon with annotated parts of speech.
Information about pauses can automatically be robust computed as Dr.
Andreas Kießling and Dr. Ralf Kompe in their Ph.Ds. show. Another
way is the one which has been gone by Dr. Anton Batliner, who annotated
pause information, by only inspecting the transliteration.
Upon this base our rule based labeling system for spontaneous german
speech in the context of the Verbmobil project was constructed. The experiments
show that there is a close relationship between the labeling of word stress
by listener judgment and by our rule based approach.
Our approach has the advantage to label a large database in a very short
time. Its drawback is, that the label are bit less exact than the listener
judgment labels. Compared to the sometimes wrong listener judgment labels
we can label 76% correctly.
Recognition of prosodic accents
using neural networks
After (re-)labelling the databases we extract the above mentioned prosodic
features and train artificial neural networks (ANN). We restrict ourselves
to a special kind of ANN, the multi layer perceptron (MLP). As software
tool we use the freely available ANN simulator SNNS. Our experiments show,
that we can achieve a recognition rate of 80% for listener judgment labels
and 78% for labels marked by our rule based approach. A combination of
the two MLP slightly improves the recognition rates.
Recognition of prosodic accent using MLP and
Due to the automatic labeling of accent information (word stress) we are
able to label the up to now unlabeled huge database. This enables us to
train 3-grams for accent recognition. We achieve recognition rate about
87%. Further we can slightly improve these rate to 88% by a combination
of MLP and 3-gram, the MLP/3-gram-hybrid.
The use of accents to distinguish dialogacts
In natural language processing sentences can not be viewed as basis units,
because often speakers interrupt or correct themselves, without starting
a new sentence. The main criterion for understanding is not the grammar,
but the intention of a phrase. The Verbmobil project introduces the
dialogact to determine the meaning of an utterance. E.g. "Good morning"
, "Hello" and "Hi, Stephanie" belong to the dialogact "Greet".
Our job was to determine the influence of prosodic accent information
(word stress) in classification of dialogacts. Our experiments on
a subset of the Verbmobil dialogacts show, that considering prosodic information
improves the recognition rate of dialogacts. These results show the importance
of prosodic information in the field of automatic speech recognition.
Matthias Nutt, August 1998