15,505 research outputs found
Why has (reasonably accurate) Automatic Speech Recognition been so hard to achieve?
Hidden Markov models (HMMs) have been successfully applied to automatic
speech recognition for more than 35 years in spite of the fact that a key HMM
assumption -- the statistical independence of frames -- is obviously violated
by speech data. In fact, this data/model mismatch has inspired many attempts to
modify or replace HMMs with alternative models that are better able to take
into account the statistical dependence of frames. However it is fair to say
that in 2010 the HMM is the consensus model of choice for speech recognition
and that HMMs are at the heart of both commercially available products and
contemporary research systems. In this paper we present a preliminary
exploration aimed at understanding how speech data depart from HMMs and what
effect this departure has on the accuracy of HMM-based speech recognition. Our
analysis uses standard diagnostic tools from the field of statistics --
hypothesis testing, simulation and resampling -- which are rarely used in the
field of speech recognition. Our main result, obtained by novel manipulations
of real and resampled data, demonstrates that real data have statistical
dependency and that this dependency is responsible for significant numbers of
recognition errors. We also demonstrate, using simulation and resampling, that
if we `remove' the statistical dependency from data, then the resulting
recognition error rates become negligible. Taken together, these results
suggest that a better understanding of the structure of the statistical
dependency in speech data is a crucial first step towards improving HMM-based
speech recognition
Emotional State Categorization from Speech: Machine vs. Human
This paper presents our investigations on emotional state categorization from
speech signals with a psychologically inspired computational model against
human performance under the same experimental setup. Based on psychological
studies, we propose a multistage categorization strategy which allows
establishing an automatic categorization model flexibly for a given emotional
speech categorization task. We apply the strategy to the Serbian Emotional
Speech Corpus (GEES) and the Danish Emotional Speech Corpus (DES), where human
performance was reported in previous psychological studies. Our work is the
first attempt to apply machine learning to the GEES corpus where the human
recognition rates were only available prior to our study. Unlike the previous
work on the DES corpus, our work focuses on a comparison to human performance
under the same experimental settings. Our studies suggest that
psychology-inspired systems yield behaviours that, to a great extent, resemble
what humans perceived and their performance is close to that of humans under
the same experimental setup. Furthermore, our work also uncovers some
differences between machine and humans in terms of emotional state recognition
from speech.Comment: 14 pages, 15 figures, 12 table
Detecting User Engagement in Everyday Conversations
This paper presents a novel application of speech emotion recognition:
estimation of the level of conversational engagement between users of a voice
communication system. We begin by using machine learning techniques, such as
the support vector machine (SVM), to classify users' emotions as expressed in
individual utterances. However, this alone fails to model the temporal and
interactive aspects of conversational engagement. We therefore propose the use
of a multilevel structure based on coupled hidden Markov models (HMM) to
estimate engagement levels in continuous natural speech. The first level is
comprised of SVM-based classifiers that recognize emotional states, which could
be (e.g.) discrete emotion types or arousal/valence levels. A high-level HMM
then uses these emotional states as input, estimating users' engagement in
conversation by decoding the internal states of the HMM. We report experimental
results obtained by applying our algorithms to the LDC Emotional Prosody and
CallFriend speech corpora.Comment: 4 pages (A4), 1 figure (EPS
Exploiting correlogram structure for robust speech recognition with multiple speech sources
This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as
tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay
that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local
pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together
with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions,
which results in significantly better recognition accuracy
Emotion Recognition from Acted and Spontaneous Speech
Dizertační práce se zabývá rozpoznáním emočního stavu mluvčích z řečového signálu. Práce je rozdělena do dvou hlavních častí, první část popisuju navržené metody pro rozpoznání emočního stavu z hraných databází. V rámci této části jsou představeny výsledky rozpoznání použitím dvou různých databází s různými jazyky. Hlavními přínosy této části je detailní analýza rozsáhlé škály různých příznaků získaných z řečového signálu, návrh nových klasifikačních architektur jako je například „emoční párování“ a návrh nové metody pro mapování diskrétních emočních stavů do dvou dimenzionálního prostoru. Druhá část se zabývá rozpoznáním emočních stavů z databáze spontánní řeči, která byla získána ze záznamů hovorů z reálných call center. Poznatky z analýzy a návrhu metod rozpoznání z hrané řeči byly využity pro návrh nového systému pro rozpoznání sedmi spontánních emočních stavů. Jádrem navrženého přístupu je komplexní klasifikační architektura založena na fúzi různých systémů. Práce se dále zabývá vlivem emočního stavu mluvčího na úspěšnosti rozpoznání pohlaví a návrhem systému pro automatickou detekci úspěšných hovorů v call centrech na základě analýzy parametrů dialogu mezi účastníky telefonních hovorů.Doctoral thesis deals with emotion recognition from speech signals. The thesis is divided into two main parts; the first part describes proposed approaches for emotion recognition using two different multilingual databases of acted emotional speech. The main contributions of this part are detailed analysis of a big set of acoustic features, new classification schemes for vocal emotion recognition such as “emotion coupling” and new method for mapping discrete emotions into two-dimensional space. The second part of this thesis is devoted to emotion recognition using multilingual databases of spontaneous emotional speech, which is based on telephone records obtained from real call centers. The knowledge gained from experiments with emotion recognition from acted speech was exploited to design a new approach for classifying seven emotional states. The core of the proposed approach is a complex classification architecture based on the fusion of different systems. The thesis also examines the influence of speaker’s emotional state on gender recognition performance and proposes system for automatic identification of successful phone calls in call center by means of dialogue features.
- …