27 research outputs found
A Neural Network Approach for Mixing Language Models
The performance of Neural Network (NN)-based language models is steadily
improving due to the emergence of new architectures, which are able to learn
different natural language characteristics. This paper presents a novel
framework, which shows that a significant improvement can be achieved by
combining different existing heterogeneous models in a single architecture.
This is done through 1) a feature layer, which separately learns different
NN-based models and 2) a mixture layer, which merges the resulting model
features. In doing so, this architecture benefits from the learning
capabilities of each model with no noticeable increase in the number of model
parameters or the training time. Extensive experiments conducted on the Penn
Treebank (PTB) and the Large Text Compression Benchmark (LTCB) corpus showed a
significant reduction of the perplexity when compared to state-of-the-art
feedforward as well as recurrent neural network architectures.Comment: Published at IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) 2017. arXiv admin note: text overlap with
arXiv:1703.0806
A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models
Training large vocabulary Neural Network Language Models (NNLMs) is a
difficult task due to the explicit requirement of the output layer
normalization, which typically involves the evaluation of the full softmax
function over the complete vocabulary. This paper proposes a Batch Noise
Contrastive Estimation (B-NCE) approach to alleviate this problem. This is
achieved by reducing the vocabulary, at each time step, to the target words in
the batch and then replacing the softmax by the noise contrastive estimation
approach, where these words play the role of targets and noise samples at the
same time. In doing so, the proposed approach can be fully formulated and
implemented using optimal dense matrix operations. Applying B-NCE to train
different NNLMs on the Large Text Compression Benchmark (LTCB) and the One
Billion Word Benchmark (OBWB) shows a significant reduction of the training
time with no noticeable degradation of the models performance. This paper also
presents a new baseline comparative study of different standard NNLMs on the
large OBWB on a single Titan-X GPU.Comment: Accepted for publication at INTERSPEECH'1
Sequential estimation techniques and application to multiple speaker tracking and language modeling
For many real-word applications, the considered data is given as a time sequence that becomes available in an orderly fashion, where the order incorporates important information about the entities of interest. The work presented in this thesis deals with two such cases by introducing new sequential estimation solutions. More precisely, we introduce a: I. Sequential Bayesian estimation framework to solve the multiple speaker localization, detection and tracking problem. This framework is a complete pipeline that includes 1) new observation estimators, which extract a fixed number of potential locations per time frame; 2) new unsupervised Bayesian detectors, which classify these estimates into noise/speaker classes and 3) new Bayesian filters, which use the speaker class estimates to track multiple speakers. This framework was developed to tackle the low overlap detection rate of multiple speakers and to reduce the number of constraints generally imposed in standard solutions. II. Sequential neural estimation framework for language modeling, which overcomes some of the shortcomings of standard approaches through merging of different models in a hybrid architecture. That is, we introduce two solutions that tightly merge particular models and then show how a generalization can be achieved through a new mixture model. In order to speed-up the training of large vocabulary language models, we introduce a new extension of the noise contrastive estimation approach to batch training.Bei vielen Anwendungen kommen Daten als zeitliche Sequenz vor, deren Reihenfolge wichtige Informationen ĂŒber die betrachteten EntitĂ€ten enthĂ€lt. In der vorliegenden Arbeit werden zwei derartige FĂ€lle bearbeitet, indem neue sequenzielle SchĂ€tzverfahren eingefĂŒhrt werden: I. Ein Framework fĂŒr ein sequenzielles bayessches SchĂ€tzverfahren zur Lokalisation, Erkennung und Verfolgung mehrerer Sprecher. Es besteht aus 1) neuen BeobachtungsschĂ€tzern, welche pro Zeitfenster eine bestimmte Anzahl möglicher Aufenthaltsorte bestimmen; 2) neuen, unĂŒberwachten bayesschen Erkennern, die diese AbschĂ€tzungen nach Sprechern/Rauschen klassifizieren und 3) neuen bayesschen Filtern, die SchĂ€tzungen aus der Sprecher-Klasse zur Verfolgung mehrerer Sprecher verwenden. Dieses Framework wurde speziell zur Verbesserung der i.A. niedrigen Erkennungsrate bei gleichzeitig Sprechenden entwickelt und benötigt weniger Randbedingungen als Standardlösungen. II. Ein sequenzielles neuronales Vorhersageframework fĂŒr Sprachmodelle, das einige Nachteile von StandardansĂ€tzen durch das ZusammenfĂŒhren verschiedener Modelle in einer Hybridarchitektur beseitigt. Konkret stellen wir zwei Lösungen vor, die bestimmte Modelle integrieren, und leiten dann eine Verallgemeinerung durch die Verwendung eines neuen Mischmodells her. Um das Trainieren von Sprachmodellen mit sehr groĂem Vokabular zu beschleunigen, wird eine Erweiterung des rauschkontrastiven SchĂ€tzverfahrens fĂŒr Batch-Training vorgestellt
Sequential Recurrent Neural Networks for Language Modeling
Feedforward Neural Network (FNN)-based language models estimate the
probability of the next word based on the history of the last N words, whereas
Recurrent Neural Networks (RNN) perform the same task based only on the last
word and some context information that cycles in the network. This paper
presents a novel approach, which bridges the gap between these two categories
of networks. In particular, we propose an architecture which takes advantage of
the explicit, sequential enumeration of the word history in FNN structure while
enhancing each word representation at the projection layer through recurrent
context information that evolves in the network. The context integration is
performed using an additional word-dependent weight matrix that is also learned
during the training. Extensive experiments conducted on the Penn Treebank (PTB)
and the Large Text Compression Benchmark (LTCB) corpus showed a significant
reduction of the perplexity when compared to state-of-the-art feedforward as
well as recurrent neural network architectures.Comment: published (INTERSPEECH 2016), 5 pages, 3 figures, 4 table
A Multiple Hypothesis Gaussian Mixture Filter for Acoustic Source Localization and Tracking
In this work, we address the problem of tracking an acoustic source based on measured time differences of arrival (TDOA). The classical solution to this problem consists in using a detector, which estimates the TDOA for each microphone pair, and then applying a tracking algorithm, which integrates the âmeasuredâ TDOAs in time. Such a two-stage approach presumes 1) that TDOAs can be estimated reliably; and 2) that the errors in detection behave in a well-defined fashion. The presence of noise and reverberation, however, causes larger errors in the TDOA estimates and, thereby, ultimately lowers the tracking performance. We propose to counteract this effect by considering a multiple hypothesis filter, which propagates the TDOA estimation uncertainty to the tracking stage. That is achieved by considering multiple TDOA estimates and then integrating the resulting TDOA observations in the framework of a Gaussian mixture filter. Experimental results show that the proposed filter has a significantly lower angular error than a multiple hypothesis particle filter
Towards a World-English Language Model for On-Device Virtual Assistants
Neural Network Language Models (NNLMs) for Virtual Assistants (VAs) are
generally language-, region-, and in some cases, device-dependent, which
increases the effort to scale and maintain them. Combining NNLMs for one or
more of the categories is one way to improve scalability. In this work, we
combine regional variants of English to build a ``World English'' NNLM for
on-device VAs. In particular, we investigate the application of adapter
bottlenecks to model dialect-specific characteristics in our existing
production NNLMs {and enhance the multi-dialect baselines}. We find that
adapter modules are more effective in modeling dialects than specializing
entire sub-networks. Based on this insight and leveraging the design of our
production models, we introduce a new architecture for World English NNLM that
meets the accuracy, latency, and memory constraints of our single-dialect
models.Comment: Accepted in ICASSP 202
QUANTIFYING THE BENEFITS OF SPEECH RECOGNITION FOR AN AIR TRAFFIC MANAGEMENT APPLICATION
Abstract: The project AcListantÂź (Active Listening Assistant), which uses
automatic speech recognition to recognize the commands in air traffic controller to
pilot communication, has achieved command recognition rates above 95%. These
high rates were obtained with an Assistance-Based Speech Recognition (ABSR). An
Arrival Manager (AMAN) cannot exactly predict the next actions of a controller, but
it knows which commands are plausible in the current situation and which not.
Therefore, the AMAN generates a set of possible commands every 20 seconds,
which serves as context information for the speech recognizer.
Different validation trials have been performed with controllers from DĂŒsseldorf,
Frankfurt, Munich, Prague and Vienna in DLRâs air traffic simulator in
Braunschweig from 2014 to 2015. Decision makers of air navigation providers
(ANSPs) are primary not interested in high recognition rates, respectively, low error
rates. They are interested in reducing costs and efforts. Therefore, the validation
trials, that were performed at the end of 2015, aimed at quantifying the benefits of
using speech recognition with respect to both efficiency and controller workload.
The paper describes the experiments performed to show that with ABSR support
controller workload for radar label maintenance could be reduced by a factor of three
and that ABSR enables fuel savings of 50 to 65 liters per fligh