1,191 research outputs found
Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor
Using supporting backchannel (BC) cues can make human-computer interaction
more social. BCs provide a feedback from the listener to the speaker indicating
to the speaker that he is still listened to. BCs can be expressed in different
ways, depending on the modality of the interaction, for example as gestures or
acoustic cues. In this work, we only considered acoustic cues. We are proposing
an approach towards detecting BC opportunities based on acoustic input features
like power and pitch. While other works in the field rely on the use of a
hand-written rule set or specialized features, we made use of artificial neural
networks. They are capable of deriving higher order features from input
features themselves. In our setup, we first used a fully connected feed-forward
network to establish an updated baseline in comparison to our previously
proposed setup. We also extended this setup by the use of Long Short-Term
Memory (LSTM) networks which have shown to outperform feed-forward based setups
on various tasks. Our best system achieved an F1-Score of 0.37 using power and
pitch features. Adding linguistic information using word2vec, the score
increased to 0.39
Feature extraction based on bio-inspired model for robust emotion recognition
Emotional state identification is an important issue to achieve more natural speech interactive systems. Ideally, these systems should also be able to work in real environments in which generally exist some kind of noise. Several bio-inspired representations have been applied to artificial systems for speech processing under noise conditions. In this work, an auditory signal representation is used to obtain a novel bio-inspired set of features for emotional speech signals. These characteristics, together with other spectral and prosodic features, are used for emotion recognition under noise conditions. Neural models were trained as classifiers and results were compared to the well-known mel-frequency cepstral coefficients. Results show that using the proposed representations, it is possible to significantly improve the robustness of an emotion recognition system. The results were also validated in a speaker independent scheme and with two emotional speech corpora.Fil: Albornoz, Enrique Marcelo. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Milone, Diego Humberto. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; ArgentinaFil: Rufiner, Hugo Leonardo. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - Santa Fe. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional. Universidad Nacional del Litoral. Facultad de IngenierĂa y Ciencias HĂdricas. Instituto de InvestigaciĂłn en Señales, Sistemas e Inteligencia Computacional; Argentin
A Generative Product-of-Filters Model of Audio
We propose the product-of-filters (PoF) model, a generative model that
decomposes audio spectra as sparse linear combinations of "filters" in the
log-spectral domain. PoF makes similar assumptions to those used in the classic
homomorphic filtering approach to signal processing, but replaces hand-designed
decompositions built of basic signal processing operations with a learned
decomposition based on statistical inference. This paper formulates the PoF
model and derives a mean-field method for posterior inference and a variational
EM algorithm to estimate the model's free parameters. We demonstrate PoF's
potential for audio processing on a bandwidth expansion task, and show that PoF
can serve as an effective unsupervised feature extractor for a speaker
identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
- …