5,953 research outputs found
Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension
This paper presents a waveform modeling and generation method using
hierarchical recurrent neural networks (HRNN) for speech bandwidth extension
(BWE). Different from conventional BWE methods which predict spectral
parameters for reconstructing wideband speech waveforms, this BWE method models
and predicts waveform samples directly without using vocoders. Inspired by
SampleRNN which is an unconditional neural audio generator, the HRNN model
represents the distribution of each wideband or high-frequency waveform sample
conditioned on the input narrowband waveform samples using a neural network
composed of long short-term memory (LSTM) layers and feed-forward (FF) layers.
The LSTM layers form a hierarchical structure and each layer operates at a
specific temporal resolution to efficiently capture long-span dependencies
between temporal sequences. Furthermore, additional conditions, such as the
bottleneck (BN) features derived from narrowband speech using a deep neural
network (DNN)-based state classifier, are employed as auxiliary input to
further improve the quality of generated wideband speech. The experimental
results of comparing several waveform modeling methods show that the HRNN-based
method can achieve better speech quality and run-time efficiency than the
dilated convolutional neural network (DCNN)-based method and the plain
sample-level recurrent neural network (SRNN)-based method. Our proposed method
also outperforms the conventional vocoder-based BWE method using LSTM-RNNs in
terms of the subjective quality of the reconstructed wideband speech.Comment: Accepted by IEEE Transactions on Audio, Speech and Language
Processin
Audio Super Resolution using Neural Networks
We introduce a new audio processing technique that increases the sampling
rate of signals such as speech or music using deep convolutional neural
networks. Our model is trained on pairs of low and high-quality audio examples;
at test-time, it predicts missing samples within a low-resolution signal in an
interpolation process similar to image super-resolution. Our method is simple
and does not involve specialized audio processing techniques; in our
experiments, it outperforms baselines on standard speech and music benchmarks
at upscaling ratios of 2x, 4x, and 6x. The method has practical applications in
telephony, compression, and text-to-speech generation; it demonstrates the
effectiveness of feed-forward convolutional architectures on an audio
generation task.Comment: Presented at the 5th International Conference on Learning
Representations (ICLR) 2017, Workshop Track, Toulon, Franc
Beyond the Memory Wall: A Case for Memory-centric HPC System for Deep Learning
As the models and the datasets to train deep learning (DL) models scale,
system architects are faced with new challenges, one of which is the memory
capacity bottleneck, where the limited physical memory inside the accelerator
device constrains the algorithm that can be studied. We propose a
memory-centric deep learning system that can transparently expand the memory
capacity available to the accelerators while also providing fast inter-device
communication for parallel training. Our proposal aggregates a pool of memory
modules locally within the device-side interconnect, which are decoupled from
the host interface and function as a vehicle for transparent memory capacity
expansion. Compared to conventional systems, our proposal achieves an average
2.8x speedup on eight DL applications and increases the system-wide memory
capacity to tens of TBs.Comment: Published as a conference paper at the 51st IEEE/ACM International
Symposium on Microarchitecture (MICRO-51), 201
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training
Data parallelism can boost the training speed of convolutional neural
networks (CNN), but could suffer from significant communication costs caused by
gradient aggregation. To alleviate this problem, several scalar quantization
techniques have been developed to compress the gradients. But these techniques
could perform poorly when used together with decentralized aggregation
protocols like ring all-reduce (RAR), mainly due to their inability to directly
aggregate compressed gradients. In this paper, we empirically demonstrate the
strong linear correlations between CNN gradients, and propose a gradient vector
quantization technique, named GradiVeQ, to exploit these correlations through
principal component analysis (PCA) for substantial gradient dimension
reduction. GradiVeQ enables direct aggregation of compressed gradients, hence
allows us to build a distributed learning system that parallelizes GradiVeQ
gradient compression and RAR communications. Extensive experiments on popular
CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient
aggregation time of the original RAR by more than 5X without noticeable
accuracy loss, and reduces the end-to-end training time by almost 50%. The
results also show that GradiVeQ is compatible with scalar quantization
techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up
gain under the same compression ratio.Comment: Accepted at NeurIPS 201
Kafnets: kernel-based non-parametric activation functions for neural networks
Neural networks are generally built by interleaving (adaptable) linear layers
with (fixed) nonlinear activation functions. To increase their flexibility,
several authors have proposed methods for adapting the activation functions
themselves, endowing them with varying degrees of flexibility. None of these
approaches, however, have gained wide acceptance in practice, and research in
this topic remains open. In this paper, we introduce a novel family of flexible
activation functions that are based on an inexpensive kernel expansion at every
neuron. Leveraging over several properties of kernel-based models, we propose
multiple variations for designing and initializing these kernel activation
functions (KAFs), including a multidimensional scheme allowing to nonlinearly
combine information from different paths in the network. The resulting KAFs can
approximate any mapping defined over a subset of the real line, either convex
or nonconvex. Furthermore, they are smooth over their entire domain, linear in
their parameters, and they can be regularized using any known scheme, including
the use of penalties to enforce sparseness. To the best of our
knowledge, no other known model satisfies all these properties simultaneously.
In addition, we provide a relatively complete overview on alternative
techniques for adapting the activation functions, which is currently lacking in
the literature. A large set of experiments validates our proposal.Comment: Preprint submitted to Neural Networks (Elsevier
Deep Scattering Spectrum
A scattering transform defines a locally translation invariant representation
which is stable to time-warping deformations. It extends MFCC representations
by computing modulation spectrum coefficients of multiple orders, through
cascades of wavelet convolutions and modulus operators. Second-order scattering
coefficients characterize transient phenomena such as attacks and amplitude
modulation. A frequency transposition invariant representation is obtained by
applying a scattering transform along log-frequency. State-the-of-art
classification results are obtained for musical genre and phone classification
on GTZAN and TIMIT databases, respectively
Faster Asynchronous SGD
Asynchronous distributed stochastic gradient descent methods have trouble
converging because of stale gradients. A gradient update sent to a parameter
server by a client is stale if the parameters used to calculate that gradient
have since been updated on the server. Approaches have been proposed to
circumvent this problem that quantify staleness in terms of the number of
elapsed updates. In this work, we propose a novel method that quantifies
staleness in terms of moving averages of gradient statistics. We show that this
method outperforms previous methods with respect to convergence speed and
scalability to many clients. We also discuss how an extension to this method
can be used to dramatically reduce bandwidth costs in a distributed training
context. In particular, our method allows reduction of total bandwidth usage by
a factor of 5 with little impact on cost convergence. We also describe (and
link to) a software library that we have used to simulate these algorithms
deterministically on a single machine.Comment: 10 page
2PFPCE: Two-Phase Filter Pruning Based on Conditional Entropy
Deep Convolutional Neural Networks~(CNNs) offer remarkable performance of
classifications and regressions in many high-dimensional problems and have been
widely utilized in real-word cognitive applications. However, high
computational cost of CNNs greatly hinder their deployment in
resource-constrained applications, real-time systems and edge computing
platforms. To overcome this challenge, we propose a novel filter-pruning
framework, two-phase filter pruning based on conditional entropy, namely
\textit{2PFPCE}, to compress the CNN models and reduce the inference time with
marginal performance degradation. In our proposed method, we formulate filter
pruning process as an optimization problem and propose a novel filter selection
criteria measured by conditional entropy. Based on the assumption that the
representation of neurons shall be evenly distributed, we also develop a
maximum-entropy filter freeze technique that can reduce over fitting. Two
filter pruning strategies -- global and layer-wise strategies, are compared.
Our experiment result shows that combining these two strategies can achieve a
higher neural network compression ratio than applying only one of them under
the same accuracy drop threshold. Two-phase pruning, that is, combining both
global and layer-wise strategies, achieves 10 X FLOPs reduction and 46%
inference time reduction on VGG-16, with 2% accuracy drop.Comment: 8 pages, 6 figure
Cyber Physical Systems: Prospects and Challenges
Cyber physical systems CPSs embodies the conception as well as the
implementation of the integration of the state-of-art technologies in sensing,
communication, computing, and control. Such systems incorporate new trends such
as cloud computing, mobile computing, mobile sensing, new modes of
communications, wearables, etc. In this article we give an exposition of the
architecture of a typical CPS system and the prospects of such systems in the
development of the modern world. We illustrate the three major challenges faced
by a CPS system: the need for rigorous numerical computation, the limitation of
the current wireless communication bandwidth, and the computation/storage
limitation by mobility and energy consumption. We address each one of these
exposing the current techniques devised to solve each one of them
- …