78,865 research outputs found
Porting concepts from DNNs back to GMMs
Deep neural networks (DNNs) have been shown to outperform Gaussian Mixture Models (GMM) on a variety of speech recognition benchmarks. In this paper we analyze the differences between the DNN and GMM modeling techniques and port the best ideas from the DNN-based modeling to a GMM-based system. By going both deep (multiple layers) and wide (multiple parallel sub-models) and by sharing model parameters, we are able to close the gap between the two modeling techniques on the TIMIT database. Since the 'deep' GMMs retain the maximum-likelihood trained Gaussians as first layer, advanced techniques such as speaker adaptation and model-based noise robustness can be readily incorporated. Regardless of their similarities, the DNNs and the deep GMMs still show a sufficient amount of complementarity to allow effective system combination
Dual Language Models for Code Switched Speech Recognition
In this work, we present a simple and elegant approach to language modeling
for bilingual code-switched text. Since code-switching is a blend of two or
more different languages, a standard bilingual language model can be improved
upon by using structures of the monolingual language models. We propose a novel
technique called dual language models, which involves building two
complementary monolingual language models and combining them using a
probabilistic model for switching between the two. We evaluate the efficacy of
our approach using a conversational Mandarin-English speech corpus. We prove
the robustness of our model by showing significant improvements in perplexity
measures over the standard bilingual language model without the use of any
external information. Similar consistent improvements are also reflected in
automatic speech recognition error rates.Comment: Accepted at Interspeech 201
Adaptive smartphone-based sensor fusion for estimating competitive rowing kinematic metrics.
Competitive rowing highly values boat position and velocity data for real-time feedback during training, racing and post-training analysis. The ubiquity of smartphones with embedded position (GPS) and motion (accelerometer) sensors motivates their possible use in these tasks. In this paper, we investigate the use of two real-time digital filters to achieve highly accurate yet reasonably priced measurements of boat speed and distance traveled. Both filters combine acceleration and location data to estimate boat distance and speed; the first using a complementary frequency response-based filter technique, the second with a Kalman filter formalism that includes adaptive, real-time estimates of effective accelerometer bias. The estimates of distance and speed from both filters were validated and compared with accurate reference data from a differential GPS system with better than 1 cm precision and a 5 Hz update rate, in experiments using two subjects (an experienced club-level rower and an elite rower) in two different boats on a 300 m course. Compared with single channel (smartphone GPS only) measures of distance and speed, the complementary filter improved the accuracy and precision of boat speed, boat distance traveled, and distance per stroke by 44%, 42%, and 73%, respectively, while the Kalman filter improved the accuracy and precision of boat speed, boat distance traveled, and distance per stroke by 48%, 22%, and 82%, respectively. Both filters demonstrate promise as general purpose methods to substantially improve estimates of important rowing performance metrics
DeepMood: Modeling Mobile Phone Typing Dynamics for Mood Detection
The increasing use of electronic forms of communication presents new
opportunities in the study of mental health, including the ability to
investigate the manifestations of psychiatric diseases unobtrusively and in the
setting of patients' daily lives. A pilot study to explore the possible
connections between bipolar affective disorder and mobile phone usage was
conducted. In this study, participants were provided a mobile phone to use as
their primary phone. This phone was loaded with a custom keyboard that
collected metadata consisting of keypress entry time and accelerometer
movement. Individual character data with the exceptions of the backspace key
and space bar were not collected due to privacy concerns. We propose an
end-to-end deep architecture based on late fusion, named DeepMood, to model the
multi-view metadata for the prediction of mood scores. Experimental results
show that 90.31% prediction accuracy on the depression score can be achieved
based on session-level mobile phone typing dynamics which is typically less
than one minute. It demonstrates the feasibility of using mobile phone metadata
to infer mood disturbance and severity.Comment: KDD 201
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Fast and Accurate OOV Decoder on High-Level Features
This work proposes a novel approach to out-of-vocabulary (OOV) keyword search
(KWS) task. The proposed approach is based on using high-level features from an
automatic speech recognition (ASR) system, so called phoneme posterior based
(PPB) features, for decoding. These features are obtained by calculating
time-dependent phoneme posterior probabilities from word lattices, followed by
their smoothing. For the PPB features we developed a special novel very fast,
simple and efficient OOV decoder. Experimental results are presented on the
Georgian language from the IARPA Babel Program, which was the test language in
the OpenKWS 2016 evaluation campaign. The results show that in terms of maximum
term weighted value (MTWV) metric and computational speed, for single ASR
systems, the proposed approach significantly outperforms the state-of-the-art
approach based on using in-vocabulary proxies for OOV keywords in the indexed
database. The comparison of the two OOV KWS approaches on the fusion results of
the nine different ASR systems demonstrates that the proposed OOV decoder
outperforms the proxy-based approach in terms of MTWV metric given the
comparable processing speed. Other important advantages of the OOV decoder
include extremely low memory consumption and simplicity of its implementation
and parameter optimization.Comment: Interspeech 2017, August 2017, Stockholm, Sweden. 201
- …