287 research outputs found
Hidden Markov Models
Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research
Implementation and Evaluation of Acoustic Distance Measures for Syllables
Munier C. Implementation and Evaluation of Acoustic Distance Measures for Syllables. Bielefeld (Germany): Bielefeld University; 2011.In dieser Arbeit werden verschiedene akustische ĂhnlichkeitsmaĂe fĂŒr Silben motiviert und anschlieĂend evaluiert. Der Mahalanobisabstand als lokales AbstandsmaĂ fĂŒr einen Dynamic-Time-Warping-Ansatz zum Messen von akustischen AbstĂ€nden hat die FĂ€higkeit, Silben zu unterscheiden. Als solcher erlaubt er die Klassifizierung von Silben mit einer Genauigkeit, die fĂŒr die Klassifizierung von kleinen akustischen Einheiten ĂŒblich ist (60 Prozent fĂŒr eine NĂ€chster-Nachbar-Klassifizierung auf einem Satz von zehn Silben fĂŒr Samples eines einzelnen Sprechers). Dieses MaĂ kann durch verschiedene Techniken verbessert werden, die jedoch seine AusfĂŒhrungsgeschwindigkeit verschlechtern (Benutzen von mehr Mischverteilungskomponenten fĂŒr die SchĂ€tzung von Kovarianzen auf einer GauĂschen Mischverteilung, Benutzen von voll besetzten Kovarianzmatrizen anstelle von diagonalen Kovarianzmatrizen). Durch experimentelle Evaluierung wird deutlich, dass ein gut funktionierender Algorithmus zur Silbensegmentierung, welcher eine akkurate SchĂ€tzung von Silbengrenzen erlaubt, fĂŒr die korrekte Berechnung von akustischen AbstĂ€nden durch die in dieser Arbeit entwickelten ĂhnlichkeitsmaĂe unabdingbar ist. Weitere AnsĂ€tze fĂŒr ĂhnlichkeitsmaĂe, die durch ihre Anwendung in der Timbre-Klassifizierung von MusikstĂŒcken motiviert sind, zeigen keine adĂ€quate FĂ€higkeit zur Silbenunterscheidung.In this work, several acoustic similarity measures for syllables are motivated and successively evaluated. The Mahalanobis distance as local distance measure for a dynamic time warping approach to measure acoustic distances is a measure that is able to discriminate syllables and thus allows for syllable classification with an accuracy that is common to the classification of small acoustic units (60 percent for a nearest neighbor classification of a set of ten syllables using samples of a single speaker). This measure can be improved using several techniques that however impair the execution speed of the distance measure (usage of more mixture density components for the estimation of covariances from a Gaussian mixture model, usage of fully occupied covariance matrices instead of diagonal covariance matrices). Through experimental evaluation it becomes evident that a decently working syllable segmentation algorithm allowing for accurate syllable border estimations is essential to the correct computation of acoustic distances by the similarity measures developed in this work. Further approaches for similarity measures which are motivated by their usage in timbre classification of music pieces do not show adequate syllable discrimination abilities
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Prosody-Based Automatic Segmentation of Speech into Sentences and Topics
A crucial step in processing speech audio data for information extraction,
topic detection, or browsing/playback is to segment the input into sentence and
topic units. Speech segmentation is challenging, since the cues typically
present for segmenting text (headers, paragraphs, punctuation) are absent in
spoken language. We investigate the use of prosody (information gleaned from
the timing and melody of speech) for these tasks. Using decision tree and
hidden Markov modeling techniques, we combine prosodic cues with word-based
approaches, and evaluate performance on two speech corpora, Broadcast News and
Switchboard. Results show that the prosodic model alone performs on par with,
or better than, word-based statistical language models -- for both true and
automatically recognized words in news speech. The prosodic model achieves
comparable performance with significantly less training data, and requires no
hand-labeling of prosodic events. Across tasks and corpora, we obtain a
significant improvement over word-only models using a probabilistic combination
of prosodic and lexical information. Inspection reveals that the prosodic
models capture language-independent boundary indicators described in the
literature. Finally, cue usage is task and corpus dependent. For example, pause
and pitch features are highly informative for segmenting news speech, whereas
pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2),
Special Issue on Accessing Information in Spoken Audio, September 200
Speech-driven animation using multi-modal hidden Markov models
The main objective of this thesis was the synthesis of speech synchronised motion, in
particular head motion. The hypothesis that head motion can be estimated from the
speech signal was confirmed. In order to achieve satisfactory results, a motion capture
data base was recorded, a definition of head motion in terms of articulation was discovered,
a continuous stream mapping procedure was developed, and finally the synthesis
was evaluated. Based on previous research into non-verbal behaviour basic types of
head motion were invented that could function as modelling units. The stream mapping
method investigated in this thesis is based on Hidden Markov Models (HMMs), which
employ modelling units to map between continuous signals. The objective evaluation
of the modelling parameters confirmed that head motion types could be predicted from
the speech signal with an accuracy above chance, close to 70%. Furthermore, a special
type ofHMMcalled trajectoryHMMwas used because it enables synthesis of continuous
output. However head motion is a stochastic process therefore the trajectory HMM
was further extended to allow for non-deterministic output. Finally the resulting head
motion synthesis was perceptually evaluated. The effects of the âuncanny valleyâ were
also considered in the evaluation, confirming that rendering quality has an influence on
our judgement of movement of virtual characters. In conclusion a general method for
synthesising speech-synchronised behaviour was invented that can applied to a whole
range of behaviours
- âŠ