Search CORE

466 research outputs found

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Crossref

Front-end technologies for robust ASR in reverberant environments—spectral enhancement-based dereverberation and auditory modulation filterbank features

Author: A Mohamed
A Sehr
B Atal
B Cauchi
BH Juang
BT Meyer
D Povey
EAP Habets
G Hinton
G Langner
I Kodrasi
K Lebart
KE Muller
MR Schroeder
MR Schädler
N Moritz
R Martin
SB David
T Dau
T Gerkmann
T Nakatani
T Yoshioka
Y Ephraim
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

SeizureNet: Multi-Spectral Deep Feature Learning for Seizure Type Classification

Author: D Silverman
KM Tsiouris
L Itti
M Längkvist
Q Lin
S Montabone
TN Alotaiby
UR Acharya
Publication venue
Publication date: 29/09/2020
Field of study

Automatic classification of epileptic seizure types in electroencephalograms (EEGs) data can enable more precise diagnosis and efficient management of the disease. This task is challenging due to factors such as low signal-to-noise ratios, signal artefacts, high variance in seizure semiology among epileptic patients, and limited availability of clinical data. To overcome these challenges, in this paper, we present SeizureNet, a deep learning framework which learns multi-spectral feature embeddings using an ensemble architecture for cross-patient seizure type classification. We used the recently released TUH EEG Seizure Corpus (V1.4.0 and V1.5.2) to evaluate the performance of SeizureNet. Experiments show that SeizureNet can reach a weighted F1 score of up to 0.94 for seizure-wise cross validation and 0.59 for patient-wise cross validation for scalp EEG based multi-class seizure type classification. We also show that the high-level feature embeddings learnt by SeizureNet considerably improve the accuracy of smaller networks through knowledge distillation for applications with low-memory constraints

arXiv.org e-Print Archive

Crossref

Detecting autism, emotions and social signals using AdaBoost

Author: Busa-Fekete Róbert
Gosztolya Gábor
Tóth László
Publication venue: Interspeech
Publication date: 01/01/2013
Field of study

SZTE Publicatio Repozitórium - SZTE - Repository of Publications

Low latency modeling of temporal contexts for speech recognition

Author: Peddinti Vijayaditya
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 22/05/2018
Field of study

This thesis focuses on the development of neural network acoustic models for large vocabulary continuous speech recognition (LVCSR) to satisfy the design goals of low latency and low computational complexity. Low latency enables online speech recognition; and low computational complexity helps reduce the computational cost both during training and inference. Long span sequential dependencies and sequential distortions in the input vector sequence are a major challenge in acoustic modeling. Recurrent neural networks have been shown to effectively model these dependencies. Specifically, bidirectional long short term memory (BLSTM) networks, provide state-of-the-art performance across several LVCSR tasks. However the deployment of bidirectional models for online LVCSR is non-trivial due to their large latency; and unidirectional LSTM models are typically preferred. In this thesis we explore the use of hierarchical temporal convolution to model long span temporal dependencies. We propose a sub-sampled variant of these temporal convolution neural networks, termed time-delay neural networks (TDNNs). These sub-sampled TDNNs reduce the computation complexity by ~5x, compared to TDNNs, during frame randomized pre-training. These models are shown to be effective in modeling long-span temporal contexts, however there is a performance gap compared to (B)LSTMs. As recent advancements in acoustic model training have eliminated the need for frame randomized pre-training we modify the TDNN architecture to use higher sampling rates, as the increased computation can be amortized over the sequence. These variants of sub- sampled TDNNs provide performance superior to unidirectional LSTM networks, while also affording a lower real time factor (RTF) during inference. However we show that the BLSTM models outperform both the TDNN and LSTM models. We propose a hybrid architecture interleaving temporal convolution and LSTM layers which is shown to outperform the BLSTM models. Further we improve these BLSTM models by using higher frame rates at lower layers and show that the proposed TDNN- LSTM model performs similar to these superior BLSTM models, while reducing the overall latency to 200 ms. Finally we describe an online system for reverberation robust ASR, using the above described models in conjunction with other data augmentation techniques like reverberation simulation, which simulates far-field environments, and volume perturbation, which helps tackle volume variation even without gain normalization

Johns Hopkins University

JScholarship

Audio Deepfake Detection: A Survey

Author: Tao Jianhua
Wang Chenglong
Yi Jiangyan
Zhang Chu Yuan
Zhang Xiaohui
Zhao Yan
Publication venue
Publication date: 28/08/2023
Field of study

Audio deepfake detection is an emerging active topic. A growing number of literatures have aimed to study deepfake detection algorithms and achieved effective performance, the problem of which is far from being solved. Although there are some review literatures, there has been no comprehensive survey that provides researchers with a systematic overview of these developments with a unified evaluation. Accordingly, in this survey paper, we first highlight the key differences across various types of deepfake audio, then outline and analyse competitions, datasets, features, classifications, and evaluation of state-of-the-art approaches. For each aspect, the basic techniques, advanced developments and major challenges are discussed. In addition, we perform a unified comparison of representative features and classifiers on ASVspoof 2021, ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively. The survey shows that future research should address the lack of large scale datasets in the wild, poor generalization of existing detection methods to unknown fake attacks, as well as interpretability of detection results

arXiv.org e-Print Archive