24 research outputs found
Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks
Zazo R, Lozano-Diez A, Gonzalez-Dominguez J, T. Toledano D, Gonzalez-Rodriguez J (2016) Language Identification in Short Utterances Using Long Short-Term Memory (LSTM) Recurrent Neural Networks. PLoS ONE 11(1): e0146917. doi:10.1371/journal.pone.0146917Long Short Term Memory (LSTM) Recurrent Neural Networks (RNNs) have recently outperformed other state-of-the-art approaches, such as i-vector and Deep Neural Networks (DNNs), in automatic Language Identification (LID), particularly when dealing with very short utterances (similar to 3s). In this contribution we present an open-source, end-to-end, LSTM RNN system running on limited computational resources (a single GPU) that outperforms a reference i-vector system on a subset of the NIST Language Recognition Evaluation (8 target languages, 3s task) by up to a 26%. This result is in line with previously published research using proprietary LSTM implementations and huge computational resources, which made these former results hardly reproducible. Further, we extend those previous experiments modeling unseen languages (out of set, OOS, modeling), which is crucial in real applications. Results show that a LSTM RNN with OOS modeling is able to detect these languages and generalizes robustly to unseen OOS languages. Finally, we also analyze the effect of even more limited test data (from 2.25s to 0.1s) proving that with as little as 0.5s an accuracy of over 50% can be achieved.This work has been supported by project CMC-V2: Caracterizacion, Modelado y Compensacion de Variabilidad en la Señal de Voz (TEC2012-37585-C02-01), funded by Ministerio de Economia y Competitividad, Spain
An analysis of the influence of deep neural network (DNN) topology in bottleneck feature based language recognition.
Language recognition systems based on bottleneck features have recently become the state-of-the-art in this research field, showing its success in the last Language Recognition Evaluation (LRE 2015) organized by NIST (U.S. National Institute of Standards and Technology). This type of system is based on a deep neural network (DNN) trained to discriminate between phonetic units, i.e. trained for the task of automatic speech recognition (ASR). This DNN aims to compress information in one of its layers, known as bottleneck (BN) layer, which is used to obtain a new frame representation of the audio signal. This representation has been proven to be useful for the task of language identification (LID). Thus, bottleneck features are used as input to the language recognition system, instead of a classical parameterization of the signal based on cepstral feature vectors such as MFCCs (Mel Frequency Cepstral Coefficients). Despite the success of this approach in language recognition, there is a lack of studies analyzing in a systematic way how the topology of the DNN influences the performance of bottleneck feature-based language recognition systems. In this work, we try to fill-in this gap, analyzing language recognition results with different topologies for the DNN used to extract the bottleneck features, comparing them and against a reference system based on a more classical cepstral representation of the input signal with a total variability model. This way, we obtain useful knowledge about how the DNN configuration influences bottleneck feature-based language recognition systems performance
Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer size (number of hidden units) varies.
<p>Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer size (number of hidden units) varies.</p
Representation of language recognition system structure.
<p>This is a graphical representation the language recognition systems, both the reference (cepstral feature based system) and the bottleneck feature based system.</p
Cluster of target languages and approximate amount of data per language in the NIST LRE 2015 training dataset.
<p>Cluster of target languages and approximate amount of data per language in the NIST LRE 2015 training dataset.</p
Example of DNN architecture with bottleneck layer.
<p>This is a graphical representation of the topology of a DNN with a BN layer, whose outputs (activation values) are used as input feature vectors for the language recognition system.</p
Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) with different number of hidden layers of the DNN.
<p>Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) with different number of hidden layers of the DNN.</p
Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer moves from first to fourth layer in a four hidden layer topology.
<p>Phoneme frame accuracy of DNN (upper part of the figure) and language recognition systems (lower part) for different test durations (3, 10 and 30s) when the bottleneck layer moves from first to fourth layer in a four hidden layer topology.</p
Language recognition performance (average EER of all clusters) for the evaluation data of NIST LRE 2015.
<p>Language recognition performance (average EER of all clusters) for the evaluation data of NIST LRE 2015.</p
Deep neural network (DNN).
<p>This is a graphical representation of a standard feedforward DNN architecture. The DNN is fed with an input vector <i>x</i> of dimension D, which is transformed by the hidden layers <i>h</i><sub><i>j</i></sub> (composed of <i>N</i><sub><i>j</i></sub> hidden units) according to a function <i>g</i> and the parameters of the DNN (weights matrices <i>W</i> and bias vectors <i>b</i>). Finally, the output layer <i>O</i> provides the output of the DNN for the target task (for the case of classification, the probability of an input vector to belong to each class C).</p