714 research outputs found
Character-Based Handwritten Text Transcription with Attention Networks
The paper approaches the task of handwritten text recognition (HTR) with
attentional encoder-decoder networks trained on sequences of characters, rather
than words. We experiment on lines of text from popular handwriting datasets
and compare different activation functions for the attention mechanism used for
aligning image pixels and target characters. We find that softmax attention
focuses heavily on individual characters, while sigmoid attention focuses on
multiple characters at each step of the decoding. When the sequence alignment
is one-to-one, softmax attention is able to learn a more precise alignment at
each step of the decoding, whereas the alignment generated by sigmoid attention
is much less precise. When a linear function is used to obtain attention
weights, the model predicts a character by looking at the entire sequence of
characters and performs poorly because it lacks a precise alignment between the
source and target. Future research may explore HTR in natural scene images,
since the model is capable of transcribing handwritten text without the need
for producing segmentations or bounding boxes of text in images
Deep Learning on Operational Facility Data Related to Large-Scale Distributed Area Scientific Workflows
Distributed computing platforms provide a robust mechanism to perform
large-scale computations by splitting the task and data among multiple
locations, possibly located thousands of miles apart geographically. Although
such distribution of resources can lead to benefits, it also comes with its
associated problems such as rampant duplication of file transfers increasing
congestion, long job completion times, unexpected site crashing, suboptimal
data transfer rates, unpredictable reliability in a time range, and suboptimal
usage of storage elements. In addition, each sub-system becomes a potential
failure node that can trigger system wide disruptions. In this vision paper, we
outline our approach to leveraging Deep Learning algorithms to discover
solutions to unique problems that arise in a system with computational
infrastructure that is spread over a wide area. The presented vision, motivated
by a real scientific use case from Belle II experiments, is to develop
multilayer neural networks to tackle forecasting, anomaly detection and
optimization challenges in a complex and distributed data movement environment.
Through this vision based on Deep Learning principles, we aim to achieve
reduced congestion events, faster file transfer rates, and enhanced site
reliability
Brain Intelligence: Go Beyond Artificial Intelligence
Artificial intelligence (AI) is an important technology that supports daily
social life and economic activities. It contributes greatly to the sustainable
growth of Japan's economy and solves various social problems. In recent years,
AI has attracted attention as a key for growth in developed countries such as
Europe and the United States and developing countries such as China and India.
The attention has been focused mainly on developing new artificial intelligence
information communication technology (ICT) and robot technology (RT). Although
recently developed AI technology certainly excels in extracting certain
patterns, there are many limitations. Most ICT models are overly dependent on
big data, lack a self-idea function, and are complicated. In this paper, rather
than merely developing next-generation artificial intelligence technology, we
aim to develop a new concept of general-purpose intelligence cognition
technology called Beyond AI. Specifically, we plan to develop an intelligent
learning model called Brain Intelligence (BI) that generates new ideas about
events without having experienced them by using artificial life with an imagine
function. We will also conduct demonstrations of the developed BI intelligence
learning model on automatic driving, precision medical care, and industrial
robots.Comment: 15 pages, Mobile Networks and Applications, 201
Unsupervised Domain Adaptation for Robust Speech Recognition via Variational Autoencoder-Based Data Augmentation
Domain mismatch between training and testing can lead to significant
degradation in performance in many machine learning scenarios. Unfortunately,
this is not a rare situation for automatic speech recognition deployments in
real-world applications. Research on robust speech recognition can be regarded
as trying to overcome this domain mismatch issue. In this paper, we address the
unsupervised domain adaptation problem for robust speech recognition, where
both source and target domain speech are presented, but word transcripts are
only available for the source domain speech. We present novel
augmentation-based methods that transform speech in a way that does not change
the transcripts. Specifically, we first train a variational autoencoder on both
source and target domain data (without supervision) to learn a latent
representation of speech. We then transform nuisance attributes of speech that
are irrelevant to recognition by modifying the latent representations, in order
to augment labeled training data with additional data whose distribution is
more similar to the target domain. The proposed method is evaluated on the
CHiME-4 dataset and reduces the absolute word error rate (WER) by as much as
35% compared to the non-adapted baseline.Comment: Accepted to IEEE ASRU 201
Novel Cascaded Gaussian Mixture Model-Deep Neural Network Classifier for Speaker Identification in Emotional Talking Environments
This research is an effort to present an effective approach to enhance
text-independent speaker identification performance in emotional talking
environments based on novel classifier called cascaded Gaussian Mixture
Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing,
implementing and evaluating a new approach for speaker identification in
emotional talking environments based on cascaded Gaussian Mixture Model-Deep
Neural Network as a classifier. The results point out that the cascaded GMM-DNN
classifier improves speaker identification performance at various emotions
using two distinct speech databases: Emirati speech database (Arabic United
Arab Emirates dataset) and Speech Under Simulated and Actual Stress (SUSAS)
English dataset. The proposed classifier outperforms classical classifiers such
as Multilayer Perceptron (MLP) and Support Vector Machine (SVM) in each
dataset. Speaker identification performance that has been attained based on the
cascaded GMM-DNN is similar to that acquired from subjective assessment by
human listeners.Comment: 15 page
Unsupervised speech representation learning using WaveNet autoencoders
We consider the task of unsupervised extraction of meaningful latent
representations of speech by applying autoencoding neural networks to speech
waveforms. The goal is to learn a representation able to capture high level
semantic content from the signal, e.g.\ phoneme identities, while being
invariant to confounding low level details in the signal such as the underlying
pitch contour or background noise. Since the learned representation is tuned to
contain only phonetic content, we resort to using a high capacity WaveNet
decoder to infer information discarded by the encoder from previous samples.
Moreover, the behavior of autoencoder models depends on the kind of constraint
that is applied to the latent representation. We compare three variants: a
simple dimensionality reduction bottleneck, a Gaussian Variational Autoencoder
(VAE), and a discrete Vector Quantized VAE (VQ-VAE). We analyze the quality of
learned representations in terms of speaker independence, the ability to
predict phonetic content, and the ability to accurately reconstruct individual
spectrogram frames. Moreover, for discrete encodings extracted using the
VQ-VAE, we measure the ease of mapping them to phonemes. We introduce a
regularization scheme that forces the representations to focus on the phonetic
content of the utterance and report performance comparable with the top entries
in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.Comment: Accepted to IEEE TASLP, final version available at
http://dx.doi.org/10.1109/TASLP.2019.293886
End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results
We replace the Hidden Markov Model (HMM) which is traditionally used in in
continuous speech recognition with a bi-directional recurrent neural network
encoder coupled to a recurrent neural network decoder that directly emits a
stream of phonemes. The alignment between the input and output sequences is
established using an attention mechanism: the decoder emits each symbol based
on a context created with a subset of input symbols elected by the attention
mechanism. We report initial results demonstrating that this new approach
achieves phoneme error rates that are comparable to the state-of-the-art
HMM-based decoders, on the TIMIT dataset.Comment: As accepted to: Deep Learning and Representation Learning Workshop,
NIPS 201
Modular End-to-end Automatic Speech Recognition Framework for Acoustic-to-word Model
End-to-end (E2E) systems have played a more and more important role in
automatic speech recognition (ASR) and achieved great performance. However, E2E
systems recognize output word sequences directly with the input acoustic
feature, which can only be trained on limited acoustic data. The extra text
data is widely used to improve the results of traditional artificial neural
network-hidden Markov model (ANN-HMM) hybrid systems. The involving of extra
text data to standard E2E ASR systems may break the E2E property during
decoding. In this paper, a novel modular E2E ASR system is proposed. The
modular E2E ASR system consists of two parts: an acoustic-to-phoneme (A2P)
model and a phoneme-to-word (P2W) model. The A2P model is trained on acoustic
data, while extra data including large scale text data can be used to train the
P2W model. This additional data enables the modular E2E ASR system to model not
only the acoustic part but also the language part. During the decoding phase,
the two models will be integrated and act as a standard acoustic-to-word (A2W)
model. In other words, the proposed modular E2E ASR system can be easily
trained with extra text data and decoded in the same way as a standard E2E ASR
system. Experimental results on the Switchboard corpus show that the modular
E2E model achieves better word error rate (WER) than standard A2W models.Comment: Accepted by IEEE TASL
Multilingual Language Processing From Bytes
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads
text as bytes and outputs span annotations of the form [start, length, label]
where start positions, lengths, and labels are separate entries in our
vocabulary. Because we operate directly on unicode bytes rather than
language-specific words or characters, we can analyze text in many languages
with a single model. Due to the small vocabulary size, these multilingual
models are very compact, but produce results similar to or better than the
state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that
use only the provided training datasets (no external data sources). Our models
are learning "from scratch" in that they do not rely on any elements of the
standard pipeline in Natural Language Processing (including tokenization), and
thus can run in standalone fashion on raw text
Combining Static and Dynamic Features for Multivariate Sequence Classification
Model precision in a classification task is highly dependent on the feature
space that is used to train the model. Moreover, whether the features are
sequential or static will dictate which classification method can be applied as
most of the machine learning algorithms are designed to deal with either one or
another type of data. In real-life scenarios, however, it is often the case
that both static and dynamic features are present, or can be extracted from the
data. In this work, we demonstrate how generative models such as Hidden Markov
Models (HMM) and Long Short-Term Memory (LSTM) artificial neural networks can
be used to extract temporal information from the dynamic data. We explore how
the extracted information can be combined with the static features in order to
improve the classification performance. We evaluate the existing techniques and
suggest a hybrid approach, which outperforms other methods on several public
datasets.Comment: Presented at IEEE DSAA 201
- …