6,959 research outputs found
Light Gated Recurrent Units for Speech Recognition
A field that has directly benefited from the recent advances in deep learning
is Automatic Speech Recognition (ASR). Despite the great achievements of the
past decades, however, a natural and robust human-machine speech interaction
still appears to be out of reach, especially in challenging environments
characterized by significant noise and reverberation. To improve robustness,
modern speech recognizers often employ acoustic models based on Recurrent
Neural Networks (RNNs), that are naturally able to exploit large time contexts
and long-term speech modulations. It is thus of great interest to continue the
study of proper techniques for improving the effectiveness of RNNs in
processing speech signals.
In this paper, we revise one of the most popular RNN models, namely Gated
Recurrent Units (GRUs), and propose a simplified architecture that turned out
to be very effective for ASR. The contribution of this work is two-fold: First,
we analyze the role played by the reset gate, showing that a significant
redundancy with the update gate occurs. As a result, we propose to remove the
former from the GRU design, leading to a more efficient and compact single-gate
model. Second, we propose to replace hyperbolic tangent with ReLU activations.
This variation couples well with batch normalization and could help the model
learn long-term dependencies without numerical issues.
Results show that the proposed architecture, called Light GRU (Li-GRU), not
only reduces the per-epoch training time by more than 30% over a standard GRU,
but also consistently improves the recognition accuracy across different tasks,
input features, noisy conditions, as well as across different ASR paradigms,
ranging from standard DNN-HMM speech recognizers to end-to-end CTC models.Comment: Copyright 2018 IEE
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Deep Thermal Imaging: Proximate Material Type Recognition in the Wild through Deep Learning of Spatial Surface Temperature Patterns
We introduce Deep Thermal Imaging, a new approach for close-range automatic
recognition of materials to enhance the understanding of people and ubiquitous
technologies of their proximal environment. Our approach uses a low-cost mobile
thermal camera integrated into a smartphone to capture thermal textures. A deep
neural network classifies these textures into material types. This approach
works effectively without the need for ambient light sources or direct contact
with materials. Furthermore, the use of a deep learning network removes the
need to handcraft the set of features for different materials. We evaluated the
performance of the system by training it to recognise 32 material types in both
indoor and outdoor environments. Our approach produced recognition accuracies
above 98% in 14,860 images of 15 indoor materials and above 89% in 26,584
images of 17 outdoor materials. We conclude by discussing its potentials for
real-time use in HCI applications and future directions.Comment: Proceedings of the 2018 CHI Conference on Human Factors in Computing
System
Incremental Learning Using a Grow-and-Prune Paradigm with Efficient Neural Networks
Deep neural networks (DNNs) have become a widely deployed model for numerous
machine learning applications. However, their fixed architecture, substantial
training cost, and significant model redundancy make it difficult to
efficiently update them to accommodate previously unseen data. To solve these
problems, we propose an incremental learning framework based on a
grow-and-prune neural network synthesis paradigm. When new data arrive, the
neural network first grows new connections based on the gradients to increase
the network capacity to accommodate new data. Then, the framework iteratively
prunes away connections based on the magnitude of weights to enhance network
compactness, and hence recover efficiency. Finally, the model rests at a
lightweight DNN that is both ready for inference and suitable for future
grow-and-prune updates. The proposed framework improves accuracy, shrinks
network size, and significantly reduces the additional training cost for
incoming data compared to conventional approaches, such as training from
scratch and network fine-tuning. For the LeNet-300-100 and LeNet-5 neural
network architectures derived for the MNIST dataset, the framework reduces
training cost by up to 64% (63%) and 67% (63%) compared to training from
scratch (network fine-tuning), respectively. For the ResNet-18 architecture
derived for the ImageNet dataset and DeepSpeech2 for the AN4 dataset, the
corresponding training cost reductions against training from scratch (network
fine-tunning) are 64% (60%) and 67% (62%), respectively. Our derived models
contain fewer network parameters but achieve higher accuracy relative to
conventional baselines
Artificial Intelligent-Based Wake Word Detection at Edge Device
Deep Neural Network based wake word (such as Hi Alexa or Hey Siri) systems allow increasingly accurate speech communication between humans and machines. However, this setup requires high processing power or cloud services which may not be accessible by edge devices. Currently, the accuracy of machine learning methods for cloudless edge devices in voice activation hovers below 90%. This paper explores wake word implementation on edge devices using a 2-Dimensional Convolutional Neural Network (CNN) with improved and balanced accuracy and latency. The proposed CNN model is created, trained and quantized using TensorFlow on a PC and exported to a Raspberry Pi Zero 2 W. The quantization method reduces the model size by 20% and spectral gating is adopted to lower wake word inaccuracy detection in moderately noisy environment. The proposed system achieved more than 90% wake word detection accuracy across 30 to 50 dB background noise with an average of 1.03 second of response time for the intended user. The result shows low-powered edge device still offers competitive performance for detecting wake word without cloud services
- …