10 research outputs found
Automatic language identification using deep neural networks
Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. I. López-Moreno, J. González-DomÃnguez, P. Oldrich, D. R. MartÃnez, J. González-RodrÃguez, "Automatic language identification using deep neural networks", IEEE International Conference on Acoustics, Speech, and Signal Processing ICASSP, Florence (Italy), 2014This work studies the use of deep neural networks (DNNs)
to address automatic language identification (LID). Motivated
by their recent success in acoustic modelling, we adapt DNNs
to the problem of identifying the language of a given spoken
utterance from short-term acoustic features. The proposed approach
is compared to state-of-the-art i-vector based acoustic
systems on two different datasets: Google 5M LID corpus and
NIST LRE 2009. Results show how LID can largely benefit
from using DNNs, especially when a large amount of training
data is available. We found relative improvements up to 70%,
in Cavg, over the baseline system
Exploiting i–vector posterior covariances for short–duration language recognition
Linear models in i-vector space have shown to be an effective solution not only for speaker identification, but also for language recogniton. The i-vector extraction process, however, is affected by several factors, such as noise level, the acoustic content of the utterance and the duration of the spoken segments. These factors influence both the i-vector estimate and its uncertainty, represented by the i-vector posterior covariance matrix. Modeling of i-vector uncertainty with Probabilistic Linear Discriminant Analysis has shown to be effective for short-duration speaker identification. This paper extends the approach to language recognition, analyzing the effects of i-vector covariances on a state-of-the-art Gaussian classifier, and proposes an effective solution for the reduction of the average detection cost (Cavg) for short segments
Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks
Zazo R, Lozano-Diez A, Gonzalez-Dominguez J, T. Toledano D, Gonzalez-Rodriguez J (2016) Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks. PLoS ONE 11(1): e0146917. doi:10.1371/journal.pone.0146917Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (similar to 3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 target languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOS modeling is able to detect these languages and generalizes robustly to unseen OOS languages. Finally, we also analyze the effect of even more limited test data (from 2.25s to 0.1s) proving that with as little as 0.5s an accuracy of over 50% can be achieved.This work has been supported by project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz (TEC2012-37585-C02-01), funded by Ministerio de Economia y Competitividad, Spain
Frame-by-frame language identification in short utterances using deep neural networks
This is the author’s version of a work that was accepted for publication in Neural Networks. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neural Networks, VOL 64, (2015) DOI 10.1016/j.neunet.2014.08.006This work addresses the use of deep neural networks (DNNs) in automatic language identification (LID) focused on short test utterances. Motivated by their recent success in acoustic modelling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from the short-term acoustic features. We show how DNNs are particularly suitable to perform LID in real-time applications, due to their capacity to emit a language identification posterior at each new frame of the test utterance. We then analyse different aspects of the system, such as the amount of required training data, the number of hidden layers, the relevance of contextual information and the effect of the test utterance duration. Finally, we propose several methods to combine frame-by-frame posteriors. Experiments are conducted on two different datasets: the public NIST Language Recognition Evaluation 2009 (3 s task) and a much larger corpus (of 5 million utterances) known as Google 5M LID, obtained from different Google Services. Reported results show relative improvements of DNNs versus the i-vector system of 40% in LRE09 3 second task and 76% in Google 5M LID
On the Use of Deep Feedforward Neural Networks for Automatic Language Identification
In this work, we present a comprehensive study on the use of deep neural networks (DNNs) for automatic language identification (LID). Motivated by the recent success of using DNNs in acoustic modeling for speech recognition, we adapt DNNs to the problem of identifying the language in a given utterance from its short-term acoustic features. We propose two different DNN- based approaches. In the first one, the DNN acts as an end-to-end LID classifier, receiving as input the speech features and providing as output the estimated probabilities of the target languages. In the second approach, the DNN is used to extract bottleneck features that are then used as inputs for a state-of-the-art i-vector system. Experiments are conducted in two different scenarios: the complete NIST Language Recognition Evaluation dataset 2009 (LRE’09) and a subset of the Voice of America (VOA) data from LRE’09, in which all languages have the same amount of training data. Results for both datasets demonstrate that the DNN-based systems significantly outperform a state-of-art i-vector system when dealing with short-duration utterances. Furthermore, the combination of the DNN-based and the classical i-vector system leads to additional performance improvements (up to 45% of relative improvement in both EER and Cavg on 3s and 10s conditions, respectively)
ADGym: Design Choices for Deep Anomaly Detection
Deep learning (DL) techniques have recently found success in anomaly
detection (AD) across various fields such as finance, medical services, and
cloud computing. However, most of the current research tends to view deep AD
algorithms as a whole, without dissecting the contributions of individual
design choices like loss functions and network architectures. This view tends
to diminish the value of preliminary steps like data preprocessing, as more
attention is given to newly designed loss functions, network architectures, and
learning paradigms. In this paper, we aim to bridge this gap by asking two key
questions: (i) Which design choices in deep AD methods are crucial for
detecting anomalies? (ii) How can we automatically select the optimal design
choices for a given AD dataset, instead of relying on generic, pre-existing
solutions? To address these questions, we introduce ADGym, a platform
specifically crafted for comprehensive evaluation and automatic selection of AD
design elements in deep methods. Our extensive experiments reveal that relying
solely on existing leading methods is not sufficient. In contrast, models
developed using ADGym significantly surpass current state-of-the-art
techniques.Comment: NeurIPS 2023. The first three authors contribute equally. Code
available at https://github.com/Minqi824/ADGy
Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition
Tesis doctoral inédita leÃda en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de TecnologÃa Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals.
They can provide compact, multi-level, nonlinear representations of temporal sequences
and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial
neural networks are, therefore, a promising technology that can be used to enhance our
ability to recognize speakers and languages–an ability increasingly in demand in the context
of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to
advance the state-of-the-art of language and speaker recognition through the formulation,
implementation and empirical analysis of novel approaches for large-scale and portable
speech interfaces. Its major contributions are: (1) novel, compact network architectures
for language and speaker recognition, including a variety of network topologies based on
fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination
strategy for classical and neural network approaches for long speech sequences; (3)
the architectural design of the first, public, multilingual, large vocabulary continuous speech
recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent
speaker recognition that is applicable to a range of verification tasks. Experimental results
have demonstrated that artificial neural networks can substantially reduce the number of
model parameters and surpass the performance of previous approaches to language and
speaker recognition, particularly in the cases of long short-term memory recurrent networks
(used to model the input speech signal), end-to-end optimization algorithms (used to predict
languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información
embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias
temporales complejas, con información no lineal y distribuida en distintos niveles semanticos,
mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar
los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues,
una tecnologÃa prometedora para mejorar el reconocimiento automático de locutores e
idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más
demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta
tesis tiene como objetivo la mejora del estado del arte de las tecnologÃas de reconocimiento
de locutor y de idioma mediante la formulación, implementación y análisis empÃrico de
nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso
en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de:
(1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas,
recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos
y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del
primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es
además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para
tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los
resultados experimentales extraÃdos de esta tesis han demostrado que las redes neuronales
artificiales son capaces de reducir el número de parámetros usados por los algoritmos de
reconocimiento tradicionales, asà como de mejorar el rendimiento de dichos sistemas de
forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz
mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización
integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del
sistema con grandes cantidades de datos de entrenamiento
ADBench: Anomaly Detection Benchmark
Given a long list of anomaly detection algorithms developed in the last few
decades, how do they perform with regard to (i) varying levels of supervision,
(ii) different types of anomalies, and (iii) noisy and corrupted data? In this
work, we answer these key questions by conducting (to our best knowledge) the
most comprehensive anomaly detection benchmark with 30 algorithms on 57
benchmark datasets, named ADBench. Our extensive experiments (98,436 in total)
identify meaningful insights into the role of supervision and anomaly types,
and unlock future directions for researchers in algorithm selection and design.
With ADBench, researchers can easily conduct comprehensive and fair evaluations
for newly proposed methods on the datasets (including our contributed ones from
natural language and computer vision domains) against the existing baselines.
To foster accessibility and reproducibility, we fully open-source ADBench and
the corresponding results.Comment: NeurIPS 2022. All authors contribute equally and are listed
alphabetically. Code available at https://github.com/Minqi824/ADBenc