299 research outputs found
Speaker-independent emotion recognition exploiting a psychologically-inspired binary cascade classification schema
In this paper, a psychologically-inspired binary cascade classification schema is proposed for speech emotion recognition. Performance is enhanced because commonly confused pairs of emotions are distinguishable from one another. Extracted features are related to statistics of pitch, formants, and energy contours, as well as spectrum, cepstrum, perceptual and temporal features, autocorrelation, MPEG-7 descriptors, Fujisakis model parameters, voice quality, jitter, and shimmer. Selected features are fed as input to K nearest neighborhood classifier and to support vector machines. Two kernels are tested for the latter: Linear and Gaussian radial basis function. The recently proposed speaker-independent experimental protocol is tested on the Berlin emotional speech database for each gender separately. The best emotion recognition accuracy, achieved by support vector machines with linear kernel, equals 87.7%, outperforming state-of-the-art approaches. Statistical analysis is first carried out with respect to the classifiers error rates and then to evaluate the information expressed by the classifiers confusion matrices. © Springer Science+Business Media, LLC 2011
Modeling Group Dynamics for Personalized Robot-Mediated Interactions
The field of human-human-robot interaction (HHRI) uses social robots to
positively influence how humans interact with each other. This objective
requires models of human understanding that consider multiple humans in an
interaction as a collective entity and represent the group dynamics that exist
within it. Understanding group dynamics is important because these can
influence the behaviors, attitudes, and opinions of each individual within the
group, as well as the group as a whole. Such an understanding is also useful
when personalizing an interaction between a robot and the humans in its
environment, where a group-level model can facilitate the design of robot
behaviors that are tailored to a given group, the dynamics that exist within
it, and the specific needs and preferences of the individual interactants. In
this paper, we highlight the need for group-level models of human understanding
in human-human-robot interaction research and how these can be useful in
developing personalization techniques. We survey existing models of group
dynamics and categorize them into models of social dominance, affect, social
cohesion, and conflict resolution. We highlight the important features these
models utilize, evaluate their potential to capture interpersonal aspects of a
social interaction, and highlight their value for personalization techniques.
Finally, we identify directions for future work, and make a case for models of
relational affect as an approach that can better capture group-level
understanding of human-human interactions and be useful in personalizing
human-human-robot interactions
Inaudible Adversarial Perturbation: Manipulating the Recognition of User Speech in Real Time
Automatic speech recognition (ASR) systems have been shown to be vulnerable
to adversarial examples (AEs). Recent success all assumes that users will not
notice or disrupt the attack process despite the existence of music/noise-like
sounds and spontaneous responses from voice assistants. Nonetheless, in
practical user-present scenarios, user awareness may nullify existing attack
attempts that launch unexpected sounds or ASR usage. In this paper, we seek to
bridge the gap in existing research and extend the attack to user-present
scenarios. We propose VRIFLE, an inaudible adversarial perturbation (IAP)
attack via ultrasound delivery that can manipulate ASRs as a user speaks. The
inherent differences between audible sounds and ultrasounds make IAP delivery
face unprecedented challenges such as distortion, noise, and instability. In
this regard, we design a novel ultrasonic transformation model to enhance the
crafted perturbation to be physically effective and even survive long-distance
delivery. We further enable VRIFLE's robustness by adopting a series of
augmentation on user and real-world variations during the generation process.
In this way, VRIFLE features an effective real-time manipulation of the ASR
output from different distances and under any speech of users, with an
alter-and-mute strategy that suppresses the impact of user disruption. Our
extensive experiments in both digital and physical worlds verify VRIFLE's
effectiveness under various configurations, robustness against six kinds of
defenses, and universality in a targeted manner. We also show that VRIFLE can
be delivered with a portable attack device and even everyday-life loudspeakers.Comment: Accepted by NDSS Symposium 202
AudioGen: Textually Guided Audio Generation
We tackle the problem of generating audio samples conditioned on descriptive
text captions. In this work, we propose AaudioGen, an auto-regressive
generative model that generates audio samples conditioned on text inputs.
AudioGen operates on a learnt discrete audio representation. The task of
text-to-audio generation poses multiple challenges. Due to the way audio
travels through a medium, differentiating ``objects'' can be a difficult task
(e.g., separating multiple people simultaneously speaking). This is further
complicated by real-world recording conditions (e.g., background noise,
reverberation, etc.). Scarce text annotations impose another constraint,
limiting the ability to scale models. Finally, modeling high-fidelity audio
requires encoding audio at high sampling rate, leading to extremely long
sequences. To alleviate the aforementioned challenges we propose an
augmentation technique that mixes different audio samples, driving the model to
internally learn to separate multiple sources. We curated 10 datasets
containing different types of audio and text annotations to handle the scarcity
of text-audio data points. For faster inference, we explore the use of
multi-stream modeling, allowing the use of shorter sequences while maintaining
a similar bitrate and perceptual quality. We apply classifier-free guidance to
improve adherence to text. Comparing to the evaluated baselines, AudioGen
outperforms over both objective and subjective metrics. Finally, we explore the
ability of the proposed method to generate audio continuation conditionally and
unconditionally. Samples: https://tinyurl.com/audiogen-text2audi
Affect-based indexing and retrieval of multimedia data
Digital multimedia systems are creating many new opportunities for rapid access to content archives. In order to explore these collections using search, the content must be annotated with significant features. An important and often overlooked aspect o f human interpretation o f multimedia data is the affective dimension. The hypothesis o f this thesis is that affective labels o f content can be extracted automatically from within multimedia data streams, and that these can then be used for content-based retrieval and browsing. A novel system is presented for extracting affective features from video content and mapping it onto a set o f keywords with predetermined emotional interpretations. These labels are then used to demonstrate affect-based retrieval on a range o f feature films. Because o f the subjective nature o f the words people use to describe emotions, an approach towards an open vocabulary query system utilizing the electronic lexical database WordNet is also presented. This gives flexibility for search queries to be extended to include keywords without predetermined emotional interpretations using a word-similarity measure. The thesis presents the framework and design for the affectbased indexing and retrieval system along with experiments, analysis, and conclusions
- …