Search CORE

24 research outputs found

Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks

Author: A Graves
A Graves
A Lozano-Diez
A rahman Mohamed
Alicia Lozano-Diez
CM Bishop
D Martinez
D Martinez
D Reynolds
D Yu
Doroteo T. Toledano
F Gers
F Richardson
F Weninger
FA Gers
FA Gers
G Hinton
H Li
Ian McLoughlin
J Gonzalez-Dominguez
J Gonzalez-Dominguez
J Schmidhuber
Javier Gonzalez-Dominguez
Joaquin Gonzalez-Rodriguez
M Van Segbroeck
N Dehak
N Dehak
P Kenny
PA Torres-Carrasquillo
Ruben Zazo
Y Song
YK Muthusamy
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

Zazo R, Lozano-Diez A, Gonzalez-Dominguez J, T. Toledano D, Gonzalez-Rodriguez J (2016) Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks. PLoS ONE 11(1): e0146917. doi:10.1371/journal.pone.0146917Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (similar to 3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 target languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOS modeling is able to detect these languages and generalizes robustly to unseen OOS languages. Finally, we also analyze the effect of even more limited test data (from 2.25s to 0.1s) proving that with as little as 0.5s an accuracy of over 50% can be achieved.This work has been supported by project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz (TEC2012-37585-C02-01), funded by Ministerio de Economia y Competitividad, Spain

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

PubMed Central

Biblos-e Archivo

An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition.

Author: Alicia Lozano-Diez
Doroteo T Toledano
Joaquin Gonzalez-Rodriguez
Ruben Zazo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2017
Field of study

Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Directory of Open Access Journals

Biblos-e Archivo

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer size (number of hidden units) varies.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer size (number of hidden units) varies.</p

FigShare

Representation of language recognition system structure.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

This is a graphical representation the language recognition systems, both the reference (cepstral feature based system) and the bottleneck feature based system.</p

FigShare

Cluster of target languages and approximate amount of data per language in the NIST LRE 2015 training dataset.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

Cluster of target languages and approximate amount of data per language in the NIST LRE 2015 training dataset.</p

FigShare

Example of DNN architecture with bottleneck layer.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

This is a graphical representation of the topology of a DNN with a BN layer, whose outputs (activation values) are used as input feature vectors for the language recognition system.</p

FigShare

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) with different number of hidden layers of the DNN.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) with different number of hidden layers of the DNN.</p

FigShare

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer moves from first to fourth layer in a four hidden layer topology.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer moves from first to fourth layer in a four hidden layer topology.</p

FigShare

Language recognition performance (average EER of all clusters) for the evaluation data of NIST LRE 2015.

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

Language recognition performance (average EER of all clusters) for the evaluation data of NIST LRE 2015.</p

FigShare

Deep neural network (DNN).

Author: Alicia Lozano-Diez (2175330)
Doroteo T. Toledano (2175334)
Joaquin Gonzalez-Rodriguez (2175328)
Ruben Zazo (2175332)
Publication venue
Publication date
Field of study

This is a graphical representation of a standard feedforward DNN architecture. The DNN is fed with an input vector x of dimension D, which is transformed by the hidden layers hj (composed of Nj hidden units) according to a function g and the parameters of the DNN (weights matrices W and bias vectors b). Finally, the output layer O provides the output of the DNN for the target task (for the case of classification, the probability of an input vector to belong to each class C).</p

FigShare