261 research outputs found
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection
Sound events often occur in unstructured environments where they exhibit wide
variations in their frequency content and temporal structure. Convolutional
neural networks (CNN) are able to extract higher level features that are
invariant to local spectral and temporal variations. Recurrent neural networks
(RNNs) are powerful in learning the longer term temporal context in the audio
signals. CNNs and RNNs as classifiers have recently shown improved performances
over established methods in various sound recognition tasks. We combine these
two approaches in a Convolutional Recurrent Neural Network (CRNN) and apply it
on a polyphonic sound event detection task. We compare the performance of the
proposed CRNN method with CNN, RNN, and other established methods, and observe
a considerable improvement for four different datasets consisting of everyday
sound events.Comment: Accepted for IEEE Transactions on Audio, Speech and Language
Processing, Special Issue on Sound Scene and Event Analysi
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Generative Adversarial Network with Convolutional Wavelet Packet Transforms for Automated Speaker Recognition and Classification
Speech is an effective mode of communication that always conveys abundant and pertinent information, such as the gender, accent, and other distinguishing characteristics of the speaker. These distinctive characteristics allow researchers to identify human voices using artificial intelligence (AI) techniques, which are useful for forensic voice verification, security and surveillance, electronic voice eavesdropping, mobile banking, and mobile purchasing. Deep learning (DL) and other advances in hardware have piqued the interest of researchers studying automatic speaker identification (SI). In recent years, Generative Adversarial Networks (GANs) have demonstrated exceptional ability in producing synthetic data and improving the performance of several machine learning tasks. The capacity of Convolutional Wavelet Packet Transform (CWPT) and Generative Adversarial Networks are combined in this paper to propose a novel way of enhancing the accuracy and robustness of Speaker Recognition and Classification systems. Audio signals are dissected using the Convolutional Wavelet Packet Transform into a multi-resolution, time-frequency representation that faithfully preserves local and global characteristics. The improved audio features better precisely describe speech traits and handle pitch, tone, and pronunciation variations that are frequent in speaker recognition tasks. Using GANs to create synthetic speech samples, our suggested method GAN-CWPT enriches the training data and broadens the dataset's diversity. The generator and discriminator components of the GAN architecture have been tweaked to produce realistic speech samples with attributes quite similar to genuine speaker utterances. The new dataset enhances the Speaker Recognition and Classification system's robustness and generalization, even in environments with little training data. We conduct extensive tests on standard speaker recognition datasets to determine how well our method works. The findings demonstrate that, compared to conventional methods, the GAN-CWPTs combination significantly improves speaker recognition, classification accuracy, and efficiency. Additionally, the suggested model GAN-CWPT exhibits stronger generalization on unknown speakers and excels even with loud and poor audio inputs
Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
Self-supervised learning (SSL) proficiency in speech-related tasks has driven
research into utilizing discrete tokens for speech tasks like recognition and
translation, which offer lower storage requirements and great potential to
employ natural language processing techniques. However, these studies, mainly
single-task focused, faced challenges like overfitting and performance
degradation in speech recognition tasks, often at the cost of sacrificing
performance in multi-task scenarios. This study presents a comprehensive
comparison and optimization of discrete tokens generated by various leading SSL
models in speech recognition and synthesis tasks. We aim to explore the
universality of speech discrete tokens across multiple speech tasks.
Experimental results demonstrate that discrete tokens achieve comparable
results against systems trained on FBank features in speech recognition tasks
and outperform mel-spectrogram features in speech synthesis in subjective and
objective metrics. These findings suggest that universal discrete tokens have
enormous potential in various speech-related tasks. Our work is open-source and
publicly available to facilitate research in this direction
Automatic Recognition of Non-Verbal Acoustic Communication Events With Neural Networks
Non-verbal acoustic communication is of high importance to humans and animals: Infants use the voice as a primary communication tool. Animals of all kinds employ acoustic communication, such as chimpanzees, which use pant-hoot vocalizations for long-distance communication.
Many applications require the assessment of such communication for a variety of analysis goals. Computational systems can support these areas through automatization of the assessment process. This is of particular importance in monitoring scenarios over large spatial and time scales, which are infeasible to perform manually.
Algorithms for sound recognition have traditionally been based on conventional machine learning approaches. In recent years, so-called representation learning approaches have gained increasing popularity. This particularly includes deep learning approaches that feed raw data to deep neural networks. However, there remain open challenges in applying these approaches to automatic recognition of non-verbal acoustic communication events, such as compensating for small data set sizes.
The leading question of this thesis is: How can we apply deep learning more effectively to automatic recognition of non-verbal acoustic communication events? The target communication types were specifically (1) infant vocalizations and (2) chimpanzee long-distance calls.
This thesis comprises four studies that investigated aspects of this question:
Study (A) investigated the assessment of infant vocalizations by laypersons. The central goal was to derive an infant vocalization classification scheme based on the laypersons' perception. The study method was based on the Nijmegen Protocol, where participants rated vocalization recordings through various items, such as affective ratings and class labels. Results showed a strong association between valence ratings and class labels, which was used to derive a classification scheme.
Study (B) was a comparative study on various neural network types for the automatic classification of infant vocalizations. The goal was to determine the best performing network type among the currently most prevailing ones, while considering the influence of their architectural configuration. Results showed that convolutional neural networks outperformed recurrent neural networks and that the choice of the frequency and time aggregation layer inside the network is the most important architectural choice.
Study (C) was a detailed investigation on computer vision-like convolutional neural networks for infant vocalization classification. The goal was to determine the most important architectural properties for increasing classification performance. Results confirmed the importance of the aggregation layer and additionally identified the input size of the fully-connected layers and the accumulated receptive field to be of major importance.
Study (D) was an investigation on compensating class imbalance for chimpanzee call detection in naturalistic long-term recordings. The goal was to determine which compensation method among a selected group improved performance the most for a deep learning system. Results showed that spectrogram denoising was most effective, while methods for compensating relative imbalance either retained or decreased performance.:1. Introduction
2. Foundations in Automatic Recognition of Acoustic Communication
3. State of Research
4. Study (A): Investigation of the Assessment of Infant Vocalizations by Laypersons
5. Study (B): Comparison of Neural Network Types for Automatic Classification of Infant Vocalizations
6. Study (C): Detailed Investigation of CNNs for Automatic Classification of Infant Vocalizations
7. Study (D): Compensating Class Imbalance for Acoustic Chimpanzee Detection With Convolutional Recurrent Neural Networks
8. Conclusion and Collected Discussion
9. AppendixNonverbale akustische Kommunikation ist für Menschen und Tiere von großer Bedeutung: Säuglinge nutzen die Stimme als primäres Kommunikationsmittel. Schimpanse verwenden sogenannte 'Pant-hoots' und Trommeln zur Kommunikation über weite Entfernungen.
Viele Anwendungen erfordern die Beurteilung solcher Kommunikation für verschiedenste Analyseziele. Algorithmen können solche Bereiche durch die Automatisierung der Beurteilung unterstützen. Dies ist besonders wichtig beim Monitoring langer Zeitspannen oder großer Gebiete, welche manuell nicht durchführbar sind.
Algorithmen zur Geräuscherkennung verwendeten bisher größtenteils konventionelle Ansätzen des maschinellen Lernens. In den letzten Jahren hat eine alternative Herangehensweise Popularität gewonnen, das sogenannte Representation Learning. Dazu gehört insbesondere Deep Learning, bei dem Rohdaten in tiefe neuronale Netze eingespeist werden. Jedoch gibt es bei der Anwendung dieser Ansätze auf die automatische Erkennung von nonverbaler akustischer Kommunikation ungelöste Herausforderungen, wie z.B. die Kompensation der relativ kleinen Datenmengen.
Die Leitfrage dieser Arbeit ist: Wie können wir Deep Learning effektiver zur automatischen Erkennung nonverbaler akustischer Kommunikation verwenden? Diese Arbeit konzentriert sich speziell auf zwei Kommunikationsarten: (1) vokale Laute von Säuglingen (2) Langstreckenrufe von Schimpansen.
Diese Arbeit umfasst vier Studien, welche Aspekte dieser Frage untersuchen:
Studie (A) untersuchte die Beurteilung von Säuglingslauten durch Laien. Zentrales Ziel war die Ableitung eines Klassifikationsschemas für Säuglingslaute auf der Grundlage der Wahrnehmung von Laien. Die Untersuchungsmethode basierte auf dem sogenannten Nijmegen-Protokoll. Hier beurteilten die Teilnehmenden Lautaufnahmen von Säuglingen anhand verschiedener Variablen, wie z.B. affektive Bewertungen und Klassenbezeichnungen. Die Ergebnisse zeigten eine starke Assoziation zwischen Valenzbewertungen und Klassenbezeichnungen, die zur Ableitung eines Klassifikationsschemas verwendet wurde.
Studie (B) war eine vergleichende Studie verschiedener Typen neuronaler Netzwerke für die automatische Klassifizierung von Säuglingslauten. Ziel war es, den leistungsfähigsten Netzwerktyp unter den momentan verbreitetsten Typen zu ermitteln. Hierbei wurde der Einfluss verschiedener architektonischer Konfigurationen innerhalb der Typen berücksichtigt. Die Ergebnisse zeigten, dass Convolutional Neural Networks eine höhere Performance als Recurrent Neural Networks erreichten. Außerdem wurde gezeigt, dass die Wahl der Frequenz- und Zeitaggregationsschicht die wichtigste architektonische Entscheidung ist.
Studie (C) war eine detaillierte Untersuchung von Computer Vision-ähnlichen Convolutional Neural Networks für die Klassifizierung von Säuglingslauten. Ziel war es, die wichtigsten architektonischen Eigenschaften zur Steigerung der Erkennungsperformance zu bestimmen. Die Ergebnisse bestätigten die Bedeutung der Aggregationsschicht. Zusätzlich Eigenschaften, die als wichtig identifiziert wurden, waren die Eingangsgröße der vollständig verbundenen Schichten und das akkumulierte rezeptive Feld.
Studie (D) war eine Untersuchung zur Kompensation der Klassenimbalance zur Erkennung von Schimpansenrufen in Langzeitaufnahmen. Ziel war es, herauszufinden, welche Kompensationsmethode aus einer Menge ausgewählter Methoden die Performance eines Deep Learning Systems am meisten verbessert. Die Ergebnisse zeigten, dass Spektrogrammentrauschen am effektivsten war, während Methoden zur Kompensation des relativen Ungleichgewichts die Performance entweder gleichhielten oder verringerten.:1. Introduction
2. Foundations in Automatic Recognition of Acoustic Communication
3. State of Research
4. Study (A): Investigation of the Assessment of Infant Vocalizations by Laypersons
5. Study (B): Comparison of Neural Network Types for Automatic Classification of Infant Vocalizations
6. Study (C): Detailed Investigation of CNNs for Automatic Classification of Infant Vocalizations
7. Study (D): Compensating Class Imbalance for Acoustic Chimpanzee Detection With Convolutional Recurrent Neural Networks
8. Conclusion and Collected Discussion
9. Appendi
- …