9 research outputs found
End-to-end Recurrent Denoising Autoencoder Embeddings for Speaker Identification
Speech 'in-the-wild' is a handicap for speaker recognition systems due to the
variability induced by real-life conditions, such as environmental noise and
emotions in the speaker. Taking advantage of representation learning, on this
paper we aim to design a recurrent denoising autoencoder that extracts robust
speaker embeddings from noisy spectrograms to perform speaker identification.
The end-to-end proposed architecture uses a feedback loop to encode information
regarding the speaker into low-dimensional representations extracted by a
spectrogram denoising autoencoder. We employ data augmentation techniques by
additively corrupting clean speech with real life environmental noise and make
use of a database with real stressed speech. We prove that the joint
optimization of both the denoiser and the speaker identification module
outperforms independent optimization of both modules under stress and noise
distortions as well as hand-crafted features.Comment: 8 pages + 2 of references + 5 of images. Submitted on Monday 20th of
July to Elsevier Signal Processing Short Communication
Speaker recognition under stress conditions
Proceeding of: IberSPEECH 2018, 21-23 November 2018, Barcelona, SpainSpeaker recognition systems exhibit a decrease in performance when the input speech is not in optimal circumstances, for example when the user is under emotional or stress conditions. The objective of this paper is measuring the effects of stress on speech to ultimately try to mitigate its consequences on a speaker recognition task. On this paper, we develop a stress-robust speaker identification system using data selection and augmentation by means of the manipulation of the original speech utterances. An extensive experimentation has been carried out for assessing the effectiveness of the proposed techniques. First, we concluded that the best performance is always obtained when naturally stressed samples are included in the training set, and second, when these are not available, their substitution and augmentation with synthetically generated stress-like samples, improves the performance of the system.This work is partially supported by the Spanish Government-MinECo projects TEC2014-53390-P and TEC2017-84395-P
Ecology & computer audition: applications of audio technology to monitor organisms and environment
Among the 17 Sustainable Development Goals (SDGs) proposed within the 2030 Agenda and adopted by all the United Nations member states, the 13th SDG is a call for action to combat climate change. Moreover, SDGs 14 and 15 claim the protection and conservation of life below water and life on land, respectively. In this work, we provide a literature-founded overview of application areas, in which computer audition – a powerful but in this context so far hardly considered technology, combining audio signal processing and machine intelligence – is employed to monitor our ecosystem with the potential to identify ecologically critical processes or states. We distinguish between applications related to organisms, such as species richness analysis and plant health monitoring, and applications related to the environment, such as melting ice monitoring or wildfire detection. This work positions computer audition in relation to alternative approaches by discussing methodological strengths and limitations, as well as ethical aspects. We conclude with an urgent call to action to the research community for a greater involvement of audio intelligence methodology in future ecosystem monitoring approaches
Aprendizaje automático de modelos de atención visual en el ámbito gastronómico
Este Trabajo Fin de Grado pretende desarrollar un modelo SVM que consiga predecir la atención visual humana en vídeos de ámbito gastronómico. El modelo se entrena con características y fijaciones visuales humanas capturadas en vídeos culinarios. Para comprobar su funcionamiento se evalúa su precisión contra los datos de atención visual de sujetos sanos mientras visualizan esos vídeos gastronómicos.
Como el objetivo es encontrar la mejor solución para nuestro trabajo, se han seguido varias fases: se ha creado una base de datos de la atención visual en los vídeos, se han modelado las características que mejor se han adaptado a la tarea de interés -las recetas de cocina-, y se ha diseñado y elegido un clasificador adecuado para este problema, una máquina de vectores soporte. Finalmente, se ha evaluado el modelo comparándolo con algoritmos de referencia como Itti & Koch y Harel & Koch.This Final Project Degree aims to develop a SVM model that gets to predict human visual attention
in gastronomic videos. The model is trained by features and human fixations captured from culinary
videos. To check its performance, its accuracy is evaluated against data from human subjects'
visual attention while they watch those gastronomic videos.
As the main objective is finding the best solution to our problem, several phases have been
followed: recording of visual attention data from videos in a database, modelling features that
best have adapted to our task of interest, -cook recipes-, and designing and choosing the most
appropriate classifier for our problem, a support vector machine. Finally, the model has been
evaluated comparing it to current state of the art algorithms, ltti & Koch's and Harel & Koch's.Ingeniería de Sistemas Audiovisuale
Multimodal Affective Computing in Wearable Devices with Applications in the Detection of Gender-based Violence
Mención Internacional en el título de doctorAccording to the World Health Organization (WHO), 1 out of every 3 women
suffer from physical or sexual violence in some point of their lives, reflecting
the effect of Gender-based Violence (GBV) in the world. In particular in Spain,
more than 1, 100 women have been assassinated from 2003 to 2022, victims of
gender-based violence.
There is an urgent need for solutions to this prevailing problem in our society.
They may involve the appropriate investment in technological research, among
legislative, educational and economical efforts. An Artificial Intelligence (AI)
driven solution that made a comprehensive analysis of aspects such as the person’s
emotional state, plus a context or external situation analysis (e.g.: circumstances,
location) and therefore automatically detect when a woman’s security is in danger,
could provide an automatic and fast response to ensure women’s safety.
Thus, this PhD thesis stems from the need to detect gender-based violence risk
situations for women, addressing the problem from a multidisciplinary point of
view by bringing together Artificial Intelligence (AI) technologies and a gender
perspective. More specifically, we direct the focus to the auditory modality,
analysing speech data produced by the user given that voice can be recorded
unobtrusively, can be used as a personal identifier and indicator of affective sates
reflected in it.
The immediate response in a human being when in a situation of risk or danger
is the fight-flight-freeze response. Several physiological changes affect the body:
breathing, heart rate, muscle activation including the complex speech production
apparatus and vocalisation characteristics, affecting our speech production. Due to
all these physical and physiological changes and their involuntary nature as a result
of being in a situation of risk, we considered relying on physiological signals such
as pulse, perspiration, respiration, and also speech, in order to detect the emotional
state of a person with the intention of recognising fear, which could be a consequence
of being in a threatening situation. For such, we developed “Bindi”. This is a
an end-to-end, AI-driven, inconspicuous, connected, edge computation-based, and
wearable solution targeting the automatic detection of GBV situations. It consists
of two smart devices that monitor the physiological variables and the acoustic
environment including voice of an individual, connected to a smartphone and a
cloud system able to call for help.
Ideally, in order to build a Machine Learning or Deep Learning Artificial
Intelligence system for the automatic detection of risk situations from auditory data,
we would like to count on speech recorded under realistic conditions belonging to
the target user.
In our first steps, we found the difficulty of the lack of suitable data available,
as there were non-existent (or non-available) speech datasets of real fear (not acted)
currently in the literature. Real, original, spontaneous, in-the-wild and emotional
speech are the ideal categories we needed for our application. Therefore, we
decided to choose stress as the closest emotion to the target scenario possible for
data collection to be able to flesh out the algorithms and acquire the knowledge
needed. Thus, we describe and justify the use of datasets containing such emotion
as the starting point of our investigation. Additionally, we describe the need for the
creation of our own set of datasets to fill such literature niche. Then, members of our UC3M4Safety team captured the UC3M4Safety
Audiovisual Stimuli Database, a dataset of 42 audiovisual stimuli to elicit emotions.
Using them, we contributed to the community with the collection of WEMAC, a
multi-modal dataset, which comprises a laboratory-based experiment for women
volunteers that were exposed to the UC3M4Safety Audiovisual Stimuli Database.
It aims to induce real emotions by using a virtual reality headset while the user’s
physiological, speech signals and self-reports are collected.
But recording emotional speech in fearful conditions that is realistic and
spontaneous is very difficult, if not impossible. To get as close as possible
to these conditions and hopefully record fearful speech, the UC3M4Safety team
created the WE-LIVE database. With it we collected physiological, auditory and
contextual signals from women in real-life conditions, as well as the labelling of
their emotional reactions to everyday events in their lives, using the current Bindi
system (wristband, pendant, mobile application and server).
In order to detect GBV risk situations through speech, we first need to detect the
voice of the specific user we are interested in, a speaker recognition task, among all
the information contained in the audio signal. Thus, we aim to track the user’s voice
separating it from the rest of the speakers in the acoustic scenario, trying to avoid the
influence of emotions or ambient noise on the identification of the speaker as these
factors could be detrimental for it.
We study speaker recognition systems under two variability conditions,
1) speaker identification under stress conditions, to see how much these stress
conditions affect speaker recognition systems and, 2) speaker recognition under
real-life noisy conditions, isolating the speaker’s identity, among all additional
information contained in the audio signal.
We also dive into the development of the Bindi system for the recognition of
fear-related emotions. We describe the architectures in Bindi versions 1.0 and 2.0,
the evolution from one another, together with their implementation. We explain the
approach followed for the design of a cascade multimodal system for Bindi 1.0, and
also the design of a complete Internet of Things system with edge, fog and cloud
computing components, for Bindi 2.0; specifically detailing how we designed the
intelligence architectures in the Bindi devices for fear detection in the user.
We then perform monomodal inference first by targeting the detection of
realistic stress through speech. Later, as core experimentation, we work with
WEMAC for the task of fear detection using data fusion strategies. The
experimental results show an average accuracy of fear recognition of 63.61%
with the Leave-hAlf-Subject-Out (LASO) method, which is a speaker-adapted
subject-dependent training classification strategy. To the best of the UC3M4Safety
team’s knowledge, this is the first time that a multimodal fusion of physiological
and speech data for fear recognition has been given in this GBV context. Besides,
this is the first time a LASO model considering fear recognition, multisensorial
signal fusion, and virtual reality stimuli has been presented. We even explored
how the gender-based violence victim condition could be detected only by speech
paralinguistic cues.
Overall, this thesis explores the use of audio technology and artificial intelligence
to prevent and combat gender-based violence. We hope that we have lit the way for
it in the speech community and beyond and that our experimentation, findings and
conclusions can help in future research. The ultimate goal of this work is to ignite
the community’s interest in developing solutions to the very challenging problem of
GBV.I would like to thank the following institutions for the financial support received
to complete this thesis:
• Department of Research and Innovation of Madrid Regional Authority, for the
EMPATIA-CM research project (reference Y2018/TCS-5046).
• Spanish State Research Agency, for the SAPIENTAE4Bindi Project (reference
PDC2021-121071-I00) funded by MCINAEI10.13039/501100011033 and by the
European Union ”NextGenerationEU/PRTR”.
• Community of Madrid YEI Predoctoral Program, for the Predoctoral Research
Personnel in Training (PEJD-2019-PRE/TIC-16295) scholarship.
• Spanish Ministry of Universities, for the University Teacher in Training
["Formación de Personal Universitario (FPU)"] grant FPU19/00448 and the
Supplementary Short-stay Mobility 2020 Grant for beneficiaries of the
University Teacher in Training program ["Ayudas complementarias de movilidad
Estancias Breves 2020 destinadas a beneficiarios del programa de Formación del
Profesorado Universitario (FPU)"].
• German Academic Exchange Service (DAAD), for the Short Term Grant
Scholarship 2020Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidenta: Elisabeth André.- Secretaria: Elena Romero Perales.- Vocal: Emilia María Parada Cabaleir
Bindi: Affective internet of things to combat gender-based violence
The main research motivation of this article is the fight against gender-based violence and achieving gender equality from a technological perspective. The solution proposed in this work goes beyond currently existing panic buttons, needing to be manually operated by the victims under difficult circumstances. Instead, Bindi, our end-to-end autonomous multimodal system, relies on artificial intelligence methods to automatically identify violent situations, based on detecting fear-related emotions, and trigger a protection protocol, if necessary. To this end, Bindi integrates modern state-of-the-art technologies, such as the Internet of Bodies, affective computing, and cyber-physical systems, leveraging: 1) affective Internet of Things (IoT) with auditory and physiological commercial off-the-shelf smart sensors embedded in wearable devices; 2) hierarchical multisensorial information fusion; and 3) the edge-fog-cloud IoT architecture. This solution is evaluated using our own data set named WEMAC, a very recently collected and freely available collection of data comprising the auditory and physiological responses of 47 women to several emotions elicited by using a virtual reality environment. On this basis, this work provides an analysis of multimodal late fusion strategies to combine the physiological and speech data processing pipelines to identify the best intelligence engine strategy for Bindi. In particular, the best data fusion strategy reports an overall fear classification accuracy of 63.61% for a subject-independent approach. Both a power consumption study and an audio data processing pipeline to detect violent acoustic events complement this analysis. This research is intended as an initial multimodal baseline that facilitates further work with real-life elicited fear in women.This work was supported in part by the Department of Research and Innovation of Madrid Regional Authority, in the EMPATIA-CM Research Project (Reference Y2018/TCS-5046) funded by MCIN/AEI/10.13039/501100011033 under Grant PDC2021-121071-I00; in part by the European Union NextGenerationEU/PRTR in part by the Spanish Ministry of Universities with the FPU under Grant FPU19/00448; and in part by the Madrid Government (Comunidad de Madrid-Spain) through the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M26), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)
UC3M4Safety Database - WEMAC: Audio features
EMPATIA-CM (Comprehensive Protection of Gender-based Violence Victims through Multimodal Affective Computing) is a research project that aims to generally understand the reactions of Gender-based Violence Victims to situations of danger, generate mechanisms for automatic detection of these situations and study how to react in a comprehensive, coordinated and effective way to protect them in the most optimal way possible.
Audio signals collected during laboratory experiment to measure participants' emotional responses to 14 audiovisual stimuli
WEMAC: Women and Emotion Multi-modal Affective Computing dataset
Among the seventeen Sustainable Development Goals (SDGs) proposed within the
2030 Agenda and adopted by all the United Nations member states, the Fifth SDG
is a call for action to turn Gender Equality into a fundamental human right and
an essential foundation for a better world. It includes the eradication of all
types of violence against women. Within this context, the UC3M4Safety research
team aims to develop Bindi. This is a cyber-physical system which includes
embedded Artificial Intelligence algorithms, for user real-time monitoring
towards the detection of affective states, with the ultimate goal of achieving
the early detection of risk situations for women. On this basis, we make use of
wearable affective computing including smart sensors, data encryption for
secure and accurate collection of presumed crime evidence, as well as the
remote connection to protecting agents. Towards the development of such system,
the recordings of different laboratory and into-the-wild datasets are in
process. These are contained within the UC3M4Safety Database. Thus, this paper
presents and details the first release of WEMAC, a novel multi-modal dataset,
which comprises a laboratory-based experiment for 47 women volunteers that were
exposed to validated audio-visual stimuli to induce real emotions by using a
virtual reality headset while physiological, speech signals and self-reports
were acquired and collected. We believe this dataset will serve and assist
research on multi-modal affective computing using physiological and speech
information
UC3M4Safety Database description
EMPATIA-CM (Comprehensive Protection of Gender-based Violence Victims through Multimodal Affective Computing) is a research project that aims to generally understand the reactions of Gender-based Violence Victims to situations of danger, generate mechanisms for automatic detection of these situations and study how to react in a comprehensive, coordinated and effective way to protect them in the most optimal way possible.
The project is divided into five objectives that demonstrate the need and added value of the multidisciplinary approach:
* Understand the reaction mechanisms of the Gender-based Violence Victims to risky situations.
* Investigate, design and verify algorithms to automatically detect Risk Situations in Gender-based Violence Victims.
* Design and implement the Automatic Detection System for Risk Situations in Gender-based Violence Victims.
* Investigate a new protocol to protect Gender-based Violence Victims with a holistic approach.
* Use the data collected by the System for detecting hazardous situations in Gender-based Violence Victims.This project is funded by the Comunidad de Madrid, Consejería de Ciencia, Universidades e Innovación, Programa de proyectos sinérgicos de I+D en nuevas y emergentes áreas científicas en la frontera de la ciencia y de naturaleza interdisciplinar, cofinanciada con los Programas Operativos del Fondo Social Europeo y del Fondo Europeo de Desarrollo Regional, 2014-2020, de la Comunidad de Madrid (EMPATÍA-CM, Ref: Y2018/TCS-5046