22,119 research outputs found
Exponential Moving Average Model in Parallel Speech Recognition Training
As training data rapid growth, large-scale parallel training with multi-GPUs
cluster is widely applied in the neural network model learning currently.We
present a new approach that applies exponential moving average method in
large-scale parallel training of neural network model. It is a non-interference
strategy that the exponential moving average model is not broadcasted to
distributed workers to update their local models after model synchronization in
the training process, and it is implemented as the final model of the training
system. Fully-connected feed-forward neural networks (DNNs) and deep
unidirectional Long short-term memory (LSTM) recurrent neural networks (RNNs)
are successfully trained with proposed method for large vocabulary continuous
speech recognition on Shenma voice search data in Mandarin. The character error
rate (CER) of Mandarin speech recognition further degrades than
state-of-the-art approaches of parallel training.Comment: 5 page
Frame Stacking and Retaining for Recurrent Neural Network Acoustic Model
Frame stacking is broadly applied in end-to-end neural network training like
connectionist temporal classification (CTC), and it leads to more accurate
models and faster decoding. However, it is not well-suited to conventional
neural network based on context-dependent state acoustic model, if the decoder
is unchanged. In this paper, we propose a novel frame retaining method which is
applied in decoding. The system which combined frame retaining with frame
stacking could reduces the time consumption of both training and decoding. Long
short-term memory (LSTM) recurrent neural networks (RNNs) using it achieve
almost linear training speedup and reduces relative 41\% real time factor
(RTF). At the same time, recognition performance is no degradation or improves
sightly on Shenma voice search dataset in Mandarin.Comment: 5 page
Semi-supervised and Population Based Training for Voice Commands Recognition
We present a rapid design methodology that combines automated hyper-parameter
tuning with semi-supervised training to build highly accurate and robust models
for voice commands classification. Proposed approach allows quick evaluation of
network architectures to fit performance and power constraints of available
hardware, while ensuring good hyper-parameter choices for each network in
real-world scenarios. Leveraging the vast amount of unlabeled data with a
student/teacher based semi-supervised method, classification accuracy is
improved from 84% to 94% in the validation set. For model optimization, we
explore the hyper-parameter space through population based training and obtain
an optimized model in the same time frame as it takes to train a single model
Collaborative Deep Learning Across Multiple Data Centers
Valuable training data is often owned by independent organizations and
located in multiple data centers. Most deep learning approaches require to
centralize the multi-datacenter data for performance purpose. In practice,
however, it is often infeasible to transfer all data to a centralized data
center due to not only bandwidth limitation but also the constraints of privacy
regulations. Model averaging is a conventional choice for data parallelized
training, but its ineffectiveness is claimed by previous studies as deep neural
networks are often non-convex. In this paper, we argue that model averaging can
be effective in the decentralized environment by using two strategies, namely,
the cyclical learning rate and the increased number of epochs for local model
training. With the two strategies, we show that model averaging can provide
competitive performance in the decentralized mode compared to the
data-centralized one. In a practical environment with multiple data centers, we
conduct extensive experiments using state-of-the-art deep network architectures
on different types of data. Results demonstrate the effectiveness and
robustness of the proposed method.Comment: Submitted to AAAI 201
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors
Multimodal research is an emerging field of artificial intelligence, and one
of the main research problems in this field is multimodal fusion. The fusion of
multimodal data is the process of integrating multiple unimodal representations
into one compact multimodal representation. Previous research in this field has
exploited the expressiveness of tensors for multimodal representation. However,
these methods often suffer from exponential increase in dimensions and in
computational complexity introduced by transformation of input into tensor. In
this paper, we propose the Low-rank Multimodal Fusion method, which performs
multimodal fusion using low-rank tensors to improve efficiency. We evaluate our
model on three different tasks: multimodal sentiment analysis, speaker trait
analysis, and emotion recognition. Our model achieves competitive results on
all these tasks while drastically reducing computational complexity. Additional
experiments also show that our model can perform robustly for a wide range of
low-rank settings, and is indeed much more efficient in both training and
inference compared to other methods that utilize tensor representations.Comment: * Equal contribution. 10 pages. Accepted by ACL 201
Federated Learning for Keyword Spotting
We propose a practical approach based on federated learning to solve
out-of-domain issues with continuously running embedded speech-based models
such as wake word detectors. We conduct an extensive empirical study of the
federated averaging algorithm for the "Hey Snips" wake word based on a
crowdsourced dataset that mimics a federation of wake word users. We
empirically demonstrate that using an adaptive averaging strategy inspired from
Adam in place of standard weighted model averaging highly reduces the number of
communication rounds required to reach our target performance. The associated
upstream communication costs per user are estimated at 8 MB, which is a
reasonable in the context of smart home voice assistants. Additionally, the
dataset used for these experiments is being open sourced with the aim of
fostering further transparent research in the application of federated learning
to speech data.Comment: Accepted for publication to ICASSP 201
gpuRIR: A Python Library for Room Impulse Response Simulation with GPU Acceleration
The Image Source Method (ISM) is one of the most employed techniques to
calculate acoustic Room Impulse Responses (RIRs), however, its computational
complexity grows fast with the reverberation time of the room and its
computation time can be prohibitive for some applications where a huge number
of RIRs are needed. In this paper, we present a new implementation that
dramatically improves the computation speed of the ISM by using Graphic
Processing Units (GPUs) to parallelize both the simulation of multiple RIRs and
the computation of the images inside each RIR. Additional speedups were
achieved by exploiting the mixed precision capabilities of the newer GPUs and
by using lookup tables. We provide a Python library under GNU license that can
be easily used without any knowledge about GPU programming and we show that it
is about 100 times faster than other state of the art CPU libraries. It may
become a powerful tool for many applications that need to perform a large
number of acoustic simulations, such as training machine learning systems for
audio signal processing, or for real-time room acoustics simulations for
immersive multimedia systems, such as augmented or virtual reality.Comment: Submitted to Multimedia Tools and Application
Revisiting Distributed Synchronous SGD
Distributed training of deep learning models on large-scale training data is
typically conducted with asynchronous stochastic optimization to maximize the
rate of updates, at the cost of additional noise introduced from asynchrony. In
contrast, the synchronous approach is often thought to be impractical due to
idle time wasted on waiting for straggling workers. We revisit these
conventional beliefs in this paper, and examine the weaknesses of both
approaches. We demonstrate that a third approach, synchronous optimization with
backup workers, can avoid asynchronous noise while mitigating for the worst
stragglers. Our approach is empirically validated and shown to converge faster
and to better test accuracies.Comment: 10 page
Minimum Latency Training Strategies for Streaming Sequence-to-Sequence ASR
Recently, a few novel streaming attention-based sequence-to-sequence (S2S)
models have been proposed to perform online speech recognition with linear-time
decoding complexity. However, in these models, the decisions to generate tokens
are delayed compared to the actual acoustic boundaries since their
unidirectional encoders lack future information. This leads to an inevitable
latency during inference. To alleviate this issue and reduce latency, we
propose several strategies during training by leveraging external hard
alignments extracted from the hybrid model. We investigate to utilize the
alignments in both the encoder and the decoder. On the encoder side, (1)
multi-task learning and (2) pre-training with the framewise classification task
are studied. On the decoder side, we (3) remove inappropriate alignment paths
beyond an acceptable latency during the alignment marginalization, and (4)
directly minimize the differentiable expected latency loss. Experiments on the
Cortana voice search task demonstrate that our proposed methods can
significantly reduce the latency, and even improve the recognition accuracy in
certain cases on the decoder side. We also present some analysis to understand
the behaviors of streaming S2S models.Comment: Accepted at IEEE ICASSP 202
Elastic Functional Coding of Riemannian Trajectories
Visual observations of dynamic phenomena, such as human actions, are often
represented as sequences of smoothly-varying features . In cases where the
feature spaces can be structured as Riemannian manifolds, the corresponding
representations become trajectories on manifolds. Analysis of these
trajectories is challenging due to non-linearity of underlying spaces and
high-dimensionality of trajectories. In vision problems, given the nature of
physical systems involved, these phenomena are better characterized on a
low-dimensional manifold compared to the space of Riemannian trajectories. For
instance, if one does not impose physical constraints of the human body, in
data involving human action analysis, the resulting representation space will
have highly redundant features. Learning an effective, low-dimensional
embedding for action representations will have a huge impact in the areas of
search and retrieval, visualization, learning, and recognition. The difficulty
lies in inherent non-linearity of the domain and temporal variability of
actions that can distort any traditional metric between trajectories. To
overcome these issues, we use the framework based on transported square-root
velocity fields (TSRVF); this framework has several desirable properties,
including a rate-invariant metric and vector space representations. We propose
to learn an embedding such that each action trajectory is mapped to a single
point in a low-dimensional Euclidean space, and the trajectories that differ
only in temporal rates map to the same point. We utilize the TSRVF
representation, and accompanying statistical summaries of Riemannian
trajectories, to extend existing coding methods such as PCA, KSVD and Label
Consistent KSVD to Riemannian trajectories or more generally to Riemannian
functions.Comment: Under major revision at IEEE T-PAMI, 201
- …