559 research outputs found
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
MULTIMODAL EMOTION ANALYSIS WITH FOCUSED ATTENTION
Emotion analysis, a subset of sentiment analysis, involves the study of a wide array of emotional indicators. In contrast to sentiment analysis, which restricts its focus to positive and negative sentiments, emotion analysis extends beyond these limitations to a diverse spectrum of emotional cues. Contemporary trends in emotion analysis lean toward multimodal approaches that leverage audiovisual and text modalities. However, implementing multimodal strategies introduces its own set of challenges, marked by a rise in model complexity and an expansion of parameters, thereby creating a need for a larger volume of data. This thesis responds to this challenge by proposing a robust model tailored for emotion recognition, specifically focusing on leveraging audio and text data. Our approach is centered on using audio spectrogram transformers (AST), and the powerful BERT language model to extract distinctive features from both auditory and textual modalities followed by feature fusion. Despite the absence of the visual component, employed by state-of-the-art (SOTA) methods, our model demonstrates comparable performance levels achieving an f1 score of 0.67 when benchmarked against existing standards on the IEMOCAP dataset [1] which consists of 12-hour audio recordings broken down into 5255 scripted and 4784 spontaneous turns, with each turn labeled by emotions such as anger, neutral, frustration, happy, and sad. In essence, We propose a fully attention-focused multimodal approach for effective emotion analysis for relatively smaller datasets leveraging lightweight data sources like audio and text highlighting the efficacy of our proposed model. For reproducibility, the code is available at 2AI Lab’s GitHub repository: https://github.com/2ai-lab/multimodal-emotion
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Syväoppiminen puhutun kielen tunnistamisessa
This thesis applies deep learning based classification techniques to identify natural languages from speech. The primary motivation behind this thesis is to implement accurate techniques for segmenting multimedia materials by the languages spoken in them.
Several existing state-of-the-art, deep learning based approaches are discussed and a subset of the discussed approaches are selected for quantitative experimentation. The selected model architectures are trained on several well-known spoken language identification datasets containing several different languages. Segmentation granularity varies between models, some supporting input audio lengths of 0.2 seconds, while others require 10 second long input to make a language decision.
Results from the thesis experiments show that an unsupervised representation of acoustic units, produced by a deep sequence-to-sequence auto encoder, cannot reach the language identification performance of a supervised representation, produced by a multilingual phoneme recognizer. Contrary to most existing results, in this thesis, acoustic-phonetic language classifiers trained on labeled spectral representations outperform phonotactic classifiers trained on bottleneck features of a multilingual phoneme recognizer. More work is required, using transcribed datasets and automatic speech recognition techniques, to investigate why phoneme embeddings did not outperform simple, labeled spectral features.
While an accurate online language segmentation tool for multimedia materials could not be constructed, the work completed in this thesis provides several insights for building feasible, modern spoken language identification systems. As a side-product of the experiments performed during this thesis, a free open source spoken language identification software library called "lidbox" was developed, allowing future experiments to begin where the experiments of this thesis end.Tämä diplomityö keskittyy soveltamaan syviä neuroverkkomalleja luonnollisten kielien automaattiseen tunnistamiseen puheesta. Tämän työn ensisijainen tavoite on toteuttaa tarkka menetelmä multimediamateriaalien ositteluun niissä esiintyvien puhuttujen kielien perusteella.
Työssä tarkastellaan useampaa jo olemassa olevaa neuroverkkoihin perustuvaa lähestymistapaa, joista valitaan alijoukko tarkempaan tarkasteluun, kvantitatiivisten kokeiden suorittamiseksi. Valitut malliarkkitehtuurit koulutetaan käyttäen eri puhetietokantoja, sisältäen useampia eri kieliä. Kieliosittelun hienojakoisuus vaihtelee käytettyjen mallien mukaan, 0,2 sekunnista 10 sekuntiin, riippuen kuinka pitkän aikaikkunan perusteella malli pystyy tuottamaan kieliennusteen.
Diplomityön aikana suoritetut kokeet osoittavat, että sekvenssiautoenkoodaajalla ohjaamattomasti löydetty puheen diskreetti akustinen esitysmuoto ei ole riittävä kielen tunnistamista varten, verrattuna foneemitunnistimen tuottamaan, ohjatusti opetettuun foneemiesitysmuotoon. Tässä työssä havaittiin, että akustisfoneettiset kielentunnistusmallit saavuttavat korkeamman kielentunnistustarkkuuden kuin foneemiesitysmuotoa käyttävät kielentunnistusmallit, mikä eroaa monista kirjallisuudessa esitetyistä tuloksista. Diplomityön tutkimuksia on jatkettava, esimerkiksi litteroituja puhetietokantoja ja puheentunnistusmenetelmiä käyttäen, jotta pystyttäisiin selittämään miksi foneemimallin tuottamalla esitysmuodolla ei saatu parempia tuloksia kuin yksinkertaisemmalla, taajuusspektrin esitysmuodolla.
Tämän työn aikana puhutun kielen tunnistaminen osoittautui huomattavasti haasteellisemmaksi kuin mitä työn alussa oli arvioitu, eikä työn aikana onnistuttu toteuttamaan tarpeeksi tarkkaa multimediamateriaalien kielienosittelumenetelmää. Tästä huolimatta, työssä esitetyt lähestymistavat tarjoavat toimivia käytännön menetelmiä puhutun kielen tunnistamiseen tarkoitettujen, modernien järjestelmien rakentamiseksi. Tämän diplomityön sivutuotteena syntyi myös puhutun kielen tunnistamiseen tarkoitettu avoimen lähdekoodin kirjasto nimeltä "lidbox", jonka ansiosta tämän työn kvantitatiivisia kokeita voi jatkaa siitä, mihin ne tämän työn päätteeksi jäivät
A Survey on Deep Multi-modal Learning for Body Language Recognition and Generation
Body language (BL) refers to the non-verbal communication expressed through
physical movements, gestures, facial expressions, and postures. It is a form of
communication that conveys information, emotions, attitudes, and intentions
without the use of spoken or written words. It plays a crucial role in
interpersonal interactions and can complement or even override verbal
communication. Deep multi-modal learning techniques have shown promise in
understanding and analyzing these diverse aspects of BL. The survey emphasizes
their applications to BL generation and recognition. Several common BLs are
considered i.e., Sign Language (SL), Cued Speech (CS), Co-speech (CoS), and
Talking Head (TH), and we have conducted an analysis and established the
connections among these four BL for the first time. Their generation and
recognition often involve multi-modal approaches. Benchmark datasets for BL
research are well collected and organized, along with the evaluation of SOTA
methods on these datasets. The survey highlights challenges such as limited
labeled data, multi-modal learning, and the need for domain adaptation to
generalize models to unseen speakers or languages. Future research directions
are presented, including exploring self-supervised learning techniques,
integrating contextual information from other modalities, and exploiting
large-scale pre-trained multi-modal models. In summary, this survey paper
provides a comprehensive understanding of deep multi-modal learning for various
BL generations and recognitions for the first time. By analyzing advancements,
challenges, and future directions, it serves as a valuable resource for
researchers and practitioners in advancing this field. n addition, we maintain
a continuously updated paper list for deep multi-modal learning for BL
recognition and generation: https://github.com/wentaoL86/awesome-body-language
- …