Improved robustness towards spontaneous speech is essential for many current and future applications of automatic speech understanding. With a growing degree of spontaneity, it becomes increasingly important to deal with spontaneous speech phenomena, such as agrammatical utterances, pauses, filled pauses and non-verbals, slurring, pronunciation variants, word fragments, and out-of-vocabulary words. Furthermore, in most applications of spontaneous speech recognizers, the aim is not necessarily to deliver an exact word-by-word transcription of an utterance, but rather to provide an intermediate representation that leads to an optimal understanding performance of a complex speech understanding system. This also involves information on the prosodic structure of an utterance, which has been completely ignored in previous a pproaches to spontaneous speech recognition.
The primary goal of this thesis is to examine the possibilities of using enhanced speech and language models for spontaneous speech in the framework of state-of-the-art, hidden Markov model (HMM)-based speech recognizers. It is demonstrated that this framework is sufficiently powerful to produce a more useful representation of a speech signal than merely an unstructured sequence of words. Instead, the spontaneous speech recognizer presented in this thesis produces a sequence of words (or a word graph) that includes phrase boundary markers, which indicate the syntactic-prosodic structure of an utterance, as well as semantically-tagged OOV word labels. By integrating this information into the speech recognition search process the word recognition accuracy is implicitly improved, because both the occurrence of phrase boundaries and the occurrence and semantic classification of an OOV word are of considerable importance for determining the language model probabilities of neighbouring words. Moreover, by including additional prosodic information into the recognizer to improve phrase boundary detection, prosody can also help to reduce word recognition errors.
This thesis also addresses some problems of recognizing spontaneous speech which are not directly related to phrase boundary detection and OOV word classification. All approaches reported in this thesis, however, have three important properties in common. First, they are stochastic, corpus-based approaches, i.e. they can be trained automatically, provided that suitable training data is available. Second, they can be directly integrated into the recognition process of any state-of-the-art speech recognizer without radical modifications to the decoding algorithm and without any postprocessing or rescoring of the recognition result. Third, all models take into consideration the restrictions of real-time (or at least close to real-time) decoding.