1,815 research outputs found
DNN and CNN with Weighted and Multi-task Loss Functions for Audio Event Detection
This report presents our audio event detection system submitted for Task 2,
"Detection of rare sound events", of DCASE 2017 challenge. The proposed system
is based on convolutional neural networks (CNNs) and deep neural networks
(DNNs) coupled with novel weighted and multi-task loss functions and
state-of-the-art phase-aware signal enhancement. The loss functions are
tailored for audio event detection in audio streams. The weighted loss is
designed to tackle the common issue of imbalanced data in background/foreground
classification while the multi-task loss enables the networks to simultaneously
model the class distribution and the temporal structures of the target events
for recognition. Our proposed systems significantly outperform the challenge
baseline, improving F-score from 72.7% to 90.0% and reducing detection error
rate from 0.53 to 0.18 on average on the development data. On the evaluation
data, our submission obtains an average F1-score of 88.3% and an error rate of
0.22 which are significantly better than those obtained by the DCASE baseline
(i.e. an F1-score of 64.1% and an error rate of 0.64).Comment: DCASE 2017 technical repor
Deep Learning for Single and Multi-Session i-Vector Speaker Recognition
The promising performance of Deep Learning (DL) in speech recognition has
motivated the use of DL in other speech technology applications such as speaker
recognition. Given i-vectors as inputs, the authors proposed an impostor
selection algorithm and a universal model adaptation process in a hybrid system
based on Deep Belief Networks (DBN) and Deep Neural Networks (DNN) to
discriminatively model each target speaker. In order to have more insight into
the behavior of DL techniques in both single and multi-session speaker
enrollment tasks, some experiments have been carried out in this paper in both
scenarios. Additionally, the parameters of the global model, referred to as
universal DBN (UDBN), are normalized before adaptation. UDBN normalization
facilitates training DNNs specifically with more than one hidden layer.
Experiments are performed on the NIST SRE 2006 corpus. It is shown that the
proposed impostor selection algorithm and UDBN adaptation process enhance the
performance of conventional DNNs 8-20 % and 16-20 % in terms of EER for the
single and multi-session tasks, respectively. In both scenarios, the proposed
architectures outperform the baseline systems obtaining up to 17 % reduction in
EER
Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples
Deep neural networks (DNNs) have demonstrated impressive performance on a
wide array of tasks, but they are usually considered opaque since internal
structure and learned parameters are not interpretable. In this paper, we
re-examine the internal representations of DNNs using adversarial images, which
are generated by an ensemble-optimization algorithm. We find that: (1) the
neurons in DNNs do not truly detect semantic objects/parts, but respond to
objects/parts only as recurrent discriminative patches; (2) deep visual
representations are not robust distributed codes of visual concepts because the
representations of adversarial images are largely not consistent with those of
real images, although they have similar visual appearance, both of which are
different from previous findings. To further improve the interpretability of
DNNs, we propose an adversarial training scheme with a consistent loss such
that the neurons are endowed with human-interpretable concepts. The induced
interpretable representations enable us to trace eventual outcomes back to
influential neurons. Therefore, human users can know how the models make
predictions, as well as when and why they make errors
End-to-End Multimodal Speech Recognition
Transcription or sub-titling of open-domain videos is still a challenging
domain for Automatic Speech Recognition (ASR) due to the data's challenging
acoustics, variable signal processing and the essentially unrestricted domain
of the data. In previous work, we have shown that the visual channel --
specifically object and scene features -- can help to adapt the acoustic model
(AM) and language model (LM) of a recognizer, and we are now expanding this
work to end-to-end approaches. In the case of a Connectionist Temporal
Classification (CTC)-based approach, we retain the separation of AM and LM,
while for a sequence-to-sequence (S2S) approach, both information sources are
adapted together, in a single model. This paper also analyzes the behavior of
CTC and S2S models on noisy video data (How-To corpus), and compares it to
results on the clean Wall Street Journal (WSJ) corpus, providing insight into
the robustness of both approaches.Comment: 5 pages, 5 figures, Accepted at IEEE International Conference on
Acoustics, Speech and Signal Processing 2018 (ICASSP 2018
Graph based manifold regularized deep neural networks for automatic speech recognition
Deep neural networks (DNNs) have been successfully applied to a wide variety
of acoustic modeling tasks in recent years. These include the applications of
DNNs either in a discriminative feature extraction or in a hybrid acoustic
modeling scenario. Despite the rapid progress in this area, a number of
challenges remain in training DNNs. This paper presents an effective way of
training DNNs using a manifold learning based regularization framework. In this
framework, the parameters of the network are optimized to preserve underlying
manifold based relationships between speech feature vectors while minimizing a
measure of loss between network outputs and targets. This is achieved by
incorporating manifold based locality constraints in the objective criterion of
DNNs. Empirical evidence is provided to demonstrate that training a network
with manifold constraints preserves structural compactness in the hidden layers
of the network. Manifold regularization is applied to train bottleneck DNNs for
feature extraction in hidden Markov model (HMM) based speech recognition. The
experiments in this work are conducted on the Aurora-2 spoken digits and the
Aurora-4 read news large vocabulary continuous speech recognition tasks. The
performance is measured in terms of word error rate (WER) on these tasks. It is
shown that the manifold regularized DNNs result in up to 37% reduction in WER
relative to standard DNNs.Comment: 12 pages including citations, 2 figure
Detecting Adversarial Image Examples in Deep Networks with Adaptive Noise Reduction
Recently, many studies have demonstrated deep neural network (DNN)
classifiers can be fooled by the adversarial example, which is crafted via
introducing some perturbations into an original sample. Accordingly, some
powerful defense techniques were proposed. However, existing defense techniques
often require modifying the target model or depend on the prior knowledge of
attacks. In this paper, we propose a straightforward method for detecting
adversarial image examples, which can be directly deployed into unmodified
off-the-shelf DNN models. We consider the perturbation to images as a kind of
noise and introduce two classic image processing techniques, scalar
quantization and smoothing spatial filter, to reduce its effect. The image
entropy is employed as a metric to implement an adaptive noise reduction for
different kinds of images. Consequently, the adversarial example can be
effectively detected by comparing the classification results of a given sample
and its denoised version, without referring to any prior knowledge of attacks.
More than 20,000 adversarial examples against some state-of-the-art DNN models
are used to evaluate the proposed method, which are crafted with different
attack techniques. The experiments show that our detection method can achieve a
high overall F1 score of 96.39% and certainly raises the bar for defense-aware
attacks.Comment: 14 pages,
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8482346&isnumber=435869
Towards a Robust Deep Neural Network in Texts: A Survey
Deep neural networks (DNNs) have achieved remarkable success in various tasks
(e.g., image classification, speech recognition, and natural language
processing). However, researches have shown that DNN models are vulnerable to
adversarial examples, which cause incorrect predictions by adding imperceptible
perturbations into normal inputs. Studies on adversarial examples in image
domain have been well investigated, but in texts the research is not enough,
let alone a comprehensive survey in this field. In this paper, we aim at
presenting a comprehensive understanding of adversarial attacks and
corresponding mitigation strategies in texts. Specifically, we first give a
taxonomy of adversarial attacks and defenses in texts from the perspective of
different natural language processing (NLP) tasks, and then introduce how to
build a robust DNN model via testing and verification. Finally, we discuss the
existing challenges of adversarial attacks and defenses in texts and present
the future research directions in this emerging field
Unsupervised Feature Learning Based on Deep Models for Environmental Audio Tagging
Environmental audio tagging aims to predict only the presence or absence of
certain acoustic events in the interested acoustic scene. In this paper we make
contributions to audio tagging in two parts, respectively, acoustic modeling
and feature learning. We propose to use a shrinking deep neural network (DNN)
framework incorporating unsupervised feature learning to handle the multi-label
classification task. For the acoustic modeling, a large set of contextual
frames of the chunk are fed into the DNN to perform a multi-label
classification for the expected tags, considering that only chunk (or
utterance) level rather than frame-level labels are available. Dropout and
background noise aware training are also adopted to improve the generalization
capability of the DNNs. For the unsupervised feature learning, we propose to
use a symmetric or asymmetric deep de-noising auto-encoder (sDAE or aDAE) to
generate new data-driven features from the Mel-Filter Banks (MFBs) features.
The new features, which are smoothed against background noise and more compact
with contextual information, can further improve the performance of the DNN
baseline. Compared with the standard Gaussian Mixture Model (GMM) baseline of
the DCASE 2016 audio tagging challenge, our proposed method obtains a
significant equal error rate (EER) reduction from 0.21 to 0.13 on the
development set. The proposed aDAE system can get a relative 6.7% EER reduction
compared with the strong DNN baseline on the development set. Finally, the
results also show that our approach obtains the state-of-the-art performance
with 0.15 EER on the evaluation set of the DCASE 2016 audio tagging task while
EER of the first prize of this challenge is 0.17.Comment: 10 pages, dcase 2016 challeng
Out-of-Distribution Detection using Multiple Semantic Label Representations
Deep Neural Networks are powerful models that attained remarkable results on
a variety of tasks. These models are shown to be extremely efficient when
training and test data are drawn from the same distribution. However, it is not
clear how a network will act when it is fed with an out-of-distribution
example. In this work, we consider the problem of out-of-distribution detection
in neural networks. We propose to use multiple semantic dense representations
instead of sparse representation as the target label. Specifically, we propose
to use several word representations obtained from different corpora or
architectures as target labels. We evaluated the proposed model on computer
vision, and speech commands detection tasks and compared it to previous
methods. Results suggest that our method compares favorably with previous work.
Besides, we present the efficiency of our approach for detecting wrongly
classified and adversarial examples
VAE-based regularization for deep speaker embedding
Deep speaker embedding has achieved state-of-the-art performance in speaker
recognition. A potential problem of these embedded vectors (called `x-vectors')
are not Gaussian, causing performance degradation with the famous PLDA back-end
scoring. In this paper, we propose a regularization approach based on
Variational Auto-Encoder (VAE). This model transforms x-vectors to a latent
space where mapped latent codes are more Gaussian, hence more suitable for PLDA
scoring
- …