1,601 research outputs found
Masked Conditional Neural Networks for Environmental Sound Classification
The ConditionaL Neural Network (CLNN) exploits the nature of the temporal
sequencing of the sound signal represented in a spectrogram, and its variant
the Masked ConditionaL Neural Network (MCLNN) induces the network to learn in
frequency bands by embedding a filterbank-like sparseness over the network's
links using a binary mask. Additionally, the masking automates the exploration
of different feature combinations concurrently analogous to handcrafting the
optimum combination of features for a recognition task. We have evaluated the
MCLNN performance using the Urbansound8k dataset of environmental sounds.
Additionally, we present a collection of manually recorded sounds for rail and
road traffic, YorNoise, to investigate the confusion rates among machine
generated sounds possessing low-frequency components. MCLNN has achieved
competitive results without augmentation and using 12% of the trainable
parameters utilized by an equivalent model based on state-of-the-art
Convolutional Neural Networks on the Urbansound8k. We extended the Urbansound8k
dataset with YorNoise, where experiments have shown that common tonal
properties affect the classification performance.Comment: Conditional Neural Networks, CLNN, Masked Conditional Neural
Networks, MCLNN, Restricted Boltzmann Machine, RBM, Conditional Restricted
Boltz-mann Machine, CRBM, Deep Belief Nets, Environmental Sound Recognition,
ESR, YorNois
Conditional BERT Contextual Augmentation
We propose a novel data augmentation method for labeled sentences called
conditional BERT contextual augmentation. Data augmentation methods are often
applied to prevent overfitting and improve generalization of deep neural
network models. Recently proposed contextual augmentation augments labeled
sentences by randomly replacing words with more varied substitutions predicted
by language model. BERT demonstrates that a deep bidirectional language model
is more powerful than either an unidirectional language model or the shallow
concatenation of a forward and backward model. We retrofit BERT to conditional
BERT by introducing a new conditional masked language model\footnote{The term
"conditional masked language model" appeared once in original BERT paper, which
indicates context-conditional, is equivalent to term "masked language model".
In our paper, "conditional masked language model" indicates we apply extra
label-conditional constraint to the "masked language model".} task. The well
trained conditional BERT can be applied to enhance contextual augmentation.
Experiments on six various different text classification tasks show that our
method can be easily applied to both convolutional or recurrent neural networks
classifier to obtain obvious improvement.Comment: 9 pages, 1 figur
Customizing an Adversarial Example Generator with Class-Conditional GANs
Adversarial examples are intentionally crafted data with the purpose of
deceiving neural networks into misclassification. When we talk about strategies
to create such examples, we usually refer to perturbation-based methods that
fabricate adversarial examples by applying invisible perturbations onto normal
data. The resulting data reserve their visual appearance to human observers,
yet can be totally unrecognizable to DNN models, which in turn leads to
completely misleading predictions. In this paper, however, we consider crafting
adversarial examples from existing data as a limitation to example diversity.
We propose a non-perturbation-based framework that generates native adversarial
examples from class-conditional generative adversarial networks.As such, the
generated data will not resemble any existing data and thus expand example
diversity, raising the difficulty in adversarial defense. We then extend this
framework to pre-trained conditional GANs, in which we turn an existing
generator into an "adversarial-example generator". We conduct experiments on
our approach for MNIST and CIFAR10 datasets and have satisfactory results,
showing that this approach can be a potential alternative to previous attack
strategies
Environmental Sound Recognition using Masked Conditional Neural Networks
Neural network based architectures used for sound recognition are usually
adapted from other application domains, which may not harness sound related
properties. The ConditionaL Neural Network (CLNN) is designed to consider the
relational properties across frames in a temporal signal, and its extension the
Masked ConditionaL Neural Network (MCLNN) embeds a filterbank behavior within
the network, which enforces the network to learn in frequency bands rather than
bins. Additionally, it automates the exploration of different feature
combinations analogous to handcrafting the optimum combination of features for
a recognition task. We applied the MCLNN to the environmental sounds of the
ESC-10 dataset. The MCLNN achieved competitive accuracies compared to
state-of-the-art convolutional neural networks and hand-crafted attempts.Comment: Boltzmann Machine, RBM, Conditional RBM, CRBM, Deep Neural Network,
DNN, Conditional Neural Network, CLNN, Masked Conditional Neural Net-work,
MCLNN, Environmental Sound Recognition, ESR, Advanced Data Mining and
Applications (ADMA) Year: 201
Automatic Classification of Music Genre using Masked Conditional Neural Networks
Neural network based architectures used for sound recognition are usually
adapted from other application domains such as image recognition, which may not
harness the time-frequency representation of a signal. The ConditionaL Neural
Networks (CLNN) and its extension the Masked ConditionaL Neural Networks
(MCLNN) are designed for multidimensional temporal signal recognition. The CLNN
is trained over a window of frames to preserve the inter-frame relation, and
the MCLNN enforces a systematic sparseness over the network's links that mimics
a filterbank-like behavior. The masking operation induces the network to learn
in frequency bands, which decreases the network susceptibility to
frequency-shifts in time-frequency representations. Additionally, the mask
allows an exploration of a range of feature combinations concurrently analogous
to the manual handcrafting of the optimum collection of features for a
recognition task. MCLNN have achieved competitive performance on the Ballroom
music dataset compared to several hand-crafted attempts and outperformed models
based on state-of-the-art Convolutional Neural Networks.Comment: Restricted Boltzmann Machine; RBM; Conditional RBM; CRBM; Deep Belief
Net; DBN; Conditional Neural Network; CLNN; Masked Conditional Neural
Network; MCLNN; Music Information Retrieval; MIR. IEEE International
Conference on Data Mining (ICDM), 201
SketchyGAN: Towards Diverse and Realistic Sketch to Image Synthesis
Synthesizing realistic images from human drawn sketches is a challenging
problem in computer graphics and vision. Existing approaches either need exact
edge maps, or rely on retrieval of existing photographs. In this work, we
propose a novel Generative Adversarial Network (GAN) approach that synthesizes
plausible images from 50 categories including motorcycles, horses and couches.
We demonstrate a data augmentation technique for sketches which is fully
automatic, and we show that the augmented data is helpful to our task. We
introduce a new network building block suitable for both the generator and
discriminator which improves the information flow by injecting the input image
at multiple scales. Compared to state-of-the-art image translation methods, our
approach generates more realistic images and achieves significantly higher
Inception Scores.Comment: Accepted to CVPR 201
A study on speech enhancement using exponent-only floating point quantized neural network (EOFP-QNN)
Numerous studies have investigated the effectiveness of neural network
quantization on pattern classification tasks. The present study, for the first
time, investigated the performance of speech enhancement (a regression task in
speech processing) using a novel exponent-only floating-point quantized neural
network (EOFP-QNN). The proposed EOFP-QNN consists of two stages:
mantissa-quantization and exponent-quantization. In the mantissa-quantization
stage, EOFP-QNN learns how to quantize the mantissa bits of the model
parameters while preserving the regression accuracy using the least mantissa
precision. In the exponent-quantization stage, the exponent part of the
parameters is further quantized without causing any additional performance
degradation. We evaluated the proposed EOFP quantization technique on two types
of neural networks, namely, bidirectional long short-term memory (BLSTM) and
fully convolutional neural network (FCN), on a speech enhancement task.
Experimental results showed that the model sizes can be significantly reduced
(the model sizes of the quantized BLSTM and FCN models were only 18.75% and
21.89%, respectively, compared to those of the original models) while
maintaining satisfactory speech-enhancement performance
Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition
Although attention based end-to-end models have achieved promising
performance in speech recognition, the multi-pass forward computation in
beam-search increases inference time cost, which limits their practical
applications. To address this issue, we propose a non-autoregressive end-to-end
speech recognition system called LASO (listen attentively, and spell once).
Because of the non-autoregressive property, LASO predicts a textual token in
the sequence without the dependence on other tokens. Without beam-search, the
one-pass propagation much reduces inference time cost of LASO. And because the
model is based on the attention based feedforward structure, the computation
can be implemented in parallel efficiently. We conduct experiments on publicly
available Chinese dataset AISHELL-1. LASO achieves a character error rate of
6.4%, which outperforms the state-of-the-art autoregressive transformer model
(6.7%). The average inference latency is 21 ms, which is 1/50 of the
autoregressive transformer model.Comment: accepted by INTERSPEECH202
Phasebook and Friends: Leveraging Discrete Representations for Source Separation
Deep learning based speech enhancement and source separation systems have
recently reached unprecedented levels of quality, to the point that performance
is reaching a new ceiling. Most systems rely on estimating the magnitude of a
target source by estimating a real-valued mask to be applied to a
time-frequency representation of the mixture signal. A limiting factor in such
approaches is a lack of phase estimation: the phase of the mixture is most
often used when reconstructing the estimated time-domain signal. Here, we
propose "magbook", "phasebook", and "combook", three new types of layers based
on discrete representations that can be used to estimate complex time-frequency
masks. Magbook layers extend classical sigmoidal units and a recently
introduced convex softmax activation for mask-based magnitude estimation.
Phasebook layers use a similar structure to give an estimate of the phase mask
without suffering from phase wrapping issues. Combook layers are an alternative
to the magbook-phasebook combination that directly estimate complex masks. We
present various training and inference schemes involving these representations,
and explain in particular how to include them in an end-to-end learning
framework. We also present an oracle study to assess upper bounds on
performance for various types of masks using discrete phase representations. We
evaluate the proposed methods on the wsj0-2mix dataset, a well-studied corpus
for single-channel speaker-independent speaker separation, matching the
performance of state-of-the-art mask-based approaches without requiring
additional phase reconstruction steps
Masked Conditional Neural Networks for Automatic Sound Events Recognition
Deep neural network architectures designed for application domains other than
sound, especially image recognition, may not optimally harness the
time-frequency representation when adapted to the sound recognition problem. In
this work, we explore the ConditionaL Neural Network (CLNN) and the Masked
ConditionaL Neural Network (MCLNN) for multi-dimensional temporal signal
recognition. The CLNN considers the inter-frame relationship, and the MCLNN
enforces a systematic sparseness over the network's links to enable learning in
frequency bands rather than bins allowing the network to be frequency shift
invariant mimicking a filterbank. The mask also allows considering several
combinations of features concurrently, which is usually handcrafted through
exhaustive manual search. We applied the MCLNN to the environmental sound
recognition problem using the ESC-10 and ESC-50 datasets. MCLNN achieved
competitive performance, using 12% of the parameters and without augmentation,
compared to state-of-the-art Convolutional Neural Networks.Comment: Restricted Boltzmann Machine, RBM, Conditional RBM, CRBM, Deep Belief
Net, DBN, Conditional Neural Network, CLNN, Masked Conditional Neural
Network, MCLNN, Environmental Sound Recognition, ES
- …