2,176 research outputs found
Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise
Automatic speaker verification (ASV) technology is recently finding its way
to end-user applications for secure access to personal data, smart services or
physical facilities. Similar to other biometric technologies, speaker
verification is vulnerable to spoofing attacks where an attacker masquerades as
a particular target speaker via impersonation, replay, text-to-speech (TTS) or
voice conversion (VC) techniques to gain illegitimate access to the system. We
focus on TTS and VC that represent the most flexible, high-end spoofing
attacks. Most of the prior studies on synthesized or converted speech detection
report their findings using high-quality clean recordings. Meanwhile, the
performance of spoofing detectors in the presence of additive noise, an
important consideration in practical ASV implementations, remains largely
unknown. To this end, we analyze the suitability of state-of-the-art synthetic
speech detectors under additive noise with a special focus on front-end
features. Our comparison includes eight acoustic feature sets, five related to
spectral magnitude and three to spectral phase information. Our extensive
experiments on ASVSpoof 2015 corpus reveal several important findings. Firstly,
all the countermeasures break down even at relatively high signal-to-noise
ratios (SNRs) and fail to generalize to noisy conditions. Secondly, speech
enhancement is not found helpful. Thirdly, GMM back-end generally outperforms
the more involved i-vector back-end. Fourthly, concerning the compared
features, the Mel-frequency cepstral coefficients (MFCCs) and subband spectral
centroid magnitude coefficients (SCMCs) perform the best on average though the
winner method depends on SNR and noise type. Finally, a study with two score
fusion strategies shows that combining different feature based systems improves
recognition accuracy for known and unknown attacks in both clean and noisy
conditions.Comment: 23 Pages, 7 figure
Speech Dereverberation Based on Integrated Deep and Ensemble Learning Algorithm
Reverberation, which is generally caused by sound reflections from walls,
ceilings, and floors, can result in severe performance degradation of acoustic
applications. Due to a complicated combination of attenuation and time-delay
effects, the reverberation property is difficult to characterize, and it
remains a challenging task to effectively retrieve the anechoic speech signals
from reverberation ones. In the present study, we proposed a novel integrated
deep and ensemble learning algorithm (IDEA) for speech dereverberation. The
IDEA consists of offline and online phases. In the offline phase, we train
multiple dereverberation models, each aiming to precisely dereverb speech
signals in a particular acoustic environment; then a unified fusion function is
estimated that aims to integrate the information of multiple dereverberation
models. In the online phase, an input utterance is first processed by each of
the dereverberation models. The outputs of all models are integrated
accordingly to generate the final anechoic signal. We evaluated the IDEA on
designed acoustic environments, including both matched and mismatched
conditions of the training and testing data. Experimental results confirm that
the proposed IDEA outperforms single deep-neural-network-based dereverberation
model with the same model architecture and training data
Voice Activity Detection: Merging Source and Filter-based Information
Voice Activity Detection (VAD) refers to the problem of distinguishing speech
segments from background noise. Numerous approaches have been proposed for this
purpose. Some are based on features derived from the power spectral density,
others exploit the periodicity of the signal. The goal of this paper is to
investigate the joint use of source and filter-based features. Interestingly, a
mutual information-based assessment shows superior discrimination power for the
source-related features, especially the proposed ones. The features are further
the input of an artificial neural network-based classifier trained on a
multi-condition database. Two strategies are proposed to merge source and
filter information: feature and decision fusion. Our experiments indicate an
absolute reduction of 3% of the equal error rate when using decision fusion.
The final proposed system is compared to four state-of-the-art methods on 150
minutes of data recorded in real environments. Thanks to the robustness of its
source-related features, its multi-condition training and its efficient
information fusion, the proposed system yields over the best state-of-the-art
VAD a substantial increase of accuracy across all conditions (24% absolute on
average)
Robust Downbeat Tracking Using an Ensemble of Convolutional Networks
In this paper, we present a novel state of the art system for automatic
downbeat tracking from music signals. The audio signal is first segmented in
frames which are synchronized at the tatum level of the music. We then extract
different kind of features based on harmony, melody, rhythm and bass content to
feed convolutional neural networks that are adapted to take advantage of each
feature characteristics. This ensemble of neural networks is combined to obtain
one downbeat likelihood per tatum. The downbeat sequence is finally decoded
with a flexible and efficient temporal model which takes advantage of the
metrical continuity of a song. We then perform an evaluation of our system on a
large base of 9 datasets, compare its performance to 4 other published
algorithms and obtain a significant increase of 16.8 percent points compared to
the second best system, for altogether a moderate cost in test and training.
The influence of each step of the method is studied to show its strengths and
shortcomings
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
Between Homomorphic Signal Processing and Deep Neural Networks: Constructing Deep Algorithms for Polyphonic Music Transcription
This paper presents a new approach in understanding how deep neural networks
(DNNs) work by applying homomorphic signal processing techniques. Focusing on
the task of multi-pitch estimation (MPE), this paper demonstrates the
equivalence relation between a generalized cepstrum and a DNN in terms of their
structures and functionality. Such an equivalence relation, together with pitch
perception theories and the recently established
rectified-correlations-on-a-sphere (RECOS) filter analysis, provide an
alternative way in explaining the role of the nonlinear activation function and
the multi-layer structure, both of which exist in a cepstrum and a DNN. To
validate the efficacy of this new approach, a new feature designed in the same
fashion is proposed for pitch salience function. The new feature outperforms
the one-layer spectrum in the MPE task and, as predicted, it addresses the
issue of the missing fundamental effect and also achieves better robustness to
noise
A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification
One of the biggest challenges of acoustic scene classification (ASC) is to
find proper features to better represent and characterize environmental sounds.
Environmental sounds generally involve more sound sources while exhibiting less
structure in temporal spectral representations. However, the background of an
acoustic scene exhibits temporal homogeneity in acoustic properties, suggesting
it could be characterized by distribution statistics rather than temporal
details. In this work, we investigated using auditory summary statistics as the
feature for ASC tasks. The inspiration comes from a recent neuroscience study,
which shows the human auditory system tends to perceive sound textures through
time-averaged statistics. Based on these statistics, we further proposed to use
linear discriminant analysis to eliminate redundancies among these statistics
while keeping the discriminative information, providing an extreme com-pact
representation for acoustic scenes. Experimental results show the outstanding
performance of the proposed feature over the conventional handcrafted features.Comment: Accepted as a conference paper of Interspeech 201
Mixup-Based Acoustic Scene Classification Using Multi-Channel Convolutional Neural Network
Audio scene classification, the problem of predicting class labels of audio
scenes, has drawn lots of attention during the last several years. However, it
remains challenging and falls short of accuracy and efficiency. Recently,
Convolutional Neural Network (CNN)-based methods have achieved better
performance with comparison to the traditional methods. Nevertheless,
conventional single channel CNN may fail to consider the fact that additional
cues may be embedded in the multi-channel recordings. In this paper, we explore
the use of Multi-channel CNN for the classification task, which aims to extract
features from different channels in an end-to-end manner. We conduct the
evaluation compared with the conventional CNN and traditional Gaussian Mixture
Model-based methods. Moreover, to improve the classification accuracy further,
this paper explores the using of mixup method. In brief, mixup trains the
neural network on linear combinations of pairs of the representation of audio
scene examples and their labels. By employing the mixup approach for data
argumentation, the novel model can provide higher prediction accuracy and
robustness in contrast with previous models, while the generalization error can
also be reduced on the evaluation data
Musical notes classification with Neuromorphic Auditory System using FPGA and a Convolutional Spiking Network
In this paper, we explore the capabilities of a sound
classification system that combines both a novel FPGA cochlear
model implementation and a bio-inspired technique based on a
trained convolutional spiking network. The neuromorphic
auditory system that is used in this work produces a form of
representation that is analogous to the spike outputs of the
biological cochlea. The auditory system has been developed using
a set of spike-based processing building blocks in the frequency
domain. They form a set of band pass filters in the spike-domain
that splits the audio information in 128 frequency channels, 64
for each of two audio sources. Address Event Representation
(AER) is used to communicate the auditory system with the
convolutional spiking network. A layer of convolutional spiking
network is developed and trained on a computer with the ability
to detect two kinds of sound: artificial pure tones in the presence
of white noise and electronic musical notes. After the training
process, the presented system is able to distinguish the different
sounds in real-time, even in the presence of white noise.Ministerio de Economía y Competitividad TEC2012-37868-C04-0
Machine Learning For Distributed Acoustic Sensors, Classic versus Image and Deep Neural Networks Approach
Distributed Acoustic Sensing (DAS) using fiber optic cables is a promising
new technology for pipeline monitoring and protection. In this work, we applied
and compared two approaches for event detection using DAS: Classic machine
learning approach and the approach based on image processing and deep learning.
Although with both approaches acceptable performance can be achieved, the
preliminary results show that image based deep learning is more promising
approach, offering six times lower event detection delay and twelve times lower
execution time
- …