341 research outputs found
Listening for Sirens: Locating and Classifying Acoustic Alarms in City Scenes
This paper is about alerting acoustic event detection and sound source
localisation in an urban scenario. Specifically, we are interested in spotting
the presence of horns, and sirens of emergency vehicles. In order to obtain a
reliable system able to operate robustly despite the presence of traffic noise,
which can be copious, unstructured and unpredictable, we propose to treat the
spectrograms of incoming stereo signals as images, and apply semantic
segmentation, based on a Unet architecture, to extract the target sound from
the background noise. In a multi-task learning scheme, together with signal
denoising, we perform acoustic event classification to identify the nature of
the alerting sound. Lastly, we use the denoised signals to localise the
acoustic source on the horizon plane, by regressing the direction of arrival of
the sound through a CNN architecture. Our experimental evaluation shows an
average classification rate of 94%, and a median absolute error on the
localisation of 7.5{\deg} when operating on audio frames of 0.5s, and of
2.5{\deg} when operating on frames of 2.5s. The system offers excellent
performance in particularly challenging scenarios, where the noise level is
remarkably high.Comment: 6 pages, 9 figure
End-to-End Speech Recognition From the Raw Waveform
State-of-the-art speech recognition systems rely on fixed, hand-crafted
features such as mel-filterbanks to preprocess the waveform before the training
pipeline. In this paper, we study end-to-end systems trained directly from the
raw waveform, building on two alternatives for trainable replacements of
mel-filterbanks that use a convolutional architecture. The first one is
inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015),
and the second one by the scattering transform (Zeghidour et al., 2017). We
propose two modifications to these architectures and systematically compare
them to mel-filterbanks, on the Wall Street Journal dataset. The first
modification is the addition of an instance normalization layer, which greatly
improves on the gammatone-based trainable filterbanks and speeds up the
training of the scattering-based filterbanks. The second one relates to the
low-pass filter used in these approaches. These modifications consistently
improve performances for both approaches, and remove the need for a careful
initialization in scattering-based trainable filterbanks. In particular, we
show a consistent improvement in word error rate of the trainable filterbanks
relatively to comparable mel-filterbanks. It is the first time end-to-end
models trained from the raw signal significantly outperform mel-filterbanks on
a large vocabulary task under clean recording conditions.Comment: Accepted for presentation at Interspeech 201
Brian Hears: Online Auditory Processing Using Vectorization Over Channels
The human cochlea includes about 3000 inner hair cells which filter sounds at frequencies between 20 Hz and 20 kHz. This massively parallel frequency analysis is reflected in models of auditory processing, which are often based on banks of filters. However, existing implementations do not exploit this parallelism. Here we propose algorithms to simulate these models by vectorizing computation over frequency channels, which are implemented in “Brian Hears,” a library for the spiking neural network simulator package “Brian.” This approach allows us to use high-level programming languages such as Python, because with vectorized operations, the computational cost of interpretation represents a small fraction of the total cost. This makes it possible to define and simulate complex models in a simple way, while all previous implementations were model-specific. In addition, we show that these algorithms can be naturally parallelized using graphics processing units, yielding substantial speed improvements. We demonstrate these algorithms with several state-of-the-art cochlear models, and show that they compare favorably with existing, less flexible, implementations
Exploring Filterbank Learning for Keyword Spotting
Despite their great performance over the years, handcrafted speech features
are not necessarily optimal for any particular speech application.
Consequently, with greater or lesser success, optimal filterbank learning has
been studied for different speech processing tasks. In this paper, we fill in a
gap by exploring filterbank learning for keyword spotting (KWS). Two approaches
are examined: filterbank matrix learning in the power spectral domain and
parameter learning of a psychoacoustically-motivated gammachirp filterbank.
Filterbank parameters are optimized jointly with a modern deep residual neural
network-based KWS back-end. Our experimental results reveal that, in general,
there are no statistically significant differences, in terms of KWS accuracy,
between using a learned filterbank and handcrafted speech features. Thus, while
we conclude that the latter are still a wise choice when using modern KWS
back-ends, we also hypothesize that this could be a symptom of information
redundancy, which opens up new research possibilities in the field of
small-footprint KWS
Speaker Identification and Spoken word Recognition in Noisy Environment using Different Techniques
In this work, an attempt is made to design ASR systems through software/computer programs which would perform Speaker Identification, Spoken word recognition and combination of both speaker identification and Spoken word recognition in general noisy environment. Automatic Speech Recognition system is designed for Limited vocabulary of Telugu language words/control commands. The experiments are conducted to find the better combination of feature extraction technique and classifier model that will perform well in general noisy environment (Home/Office environment where noise is around 15-35 dB). A recently proposed features extraction technique Gammatone frequency coefficients which is reported as the best fit to the human auditory system is chosen for the experiments along with the more common feature extraction techniques MFCC and PLP as part of Front end process (i.e. speech features extraction). Two different Artificial Neural Network classifiers Learning Vector Quantization (LVQ) neural networks and Radial Basis Function (RBF) neural networks along with Hidden Markov Models (HMMs) are chosen for the experiments as part of Back end process (i.e. training/modeling the ASRs). The performance of different ASR systems that are designed by utilizing the 9 different combinations (3 feature extraction techniques and 3 classifier models) are analyzed in terms of spoken word recognition and speaker identification accuracy success rate, design time of ASRs, and recognition / identification response time .The testing speech samples are recorded in general noisy conditions i.e.in the existence of air conditioning noise, fan noise, computer key board noise and far away cross talk noise. ASR systems designed and analyzed programmatically in MATLAB 2013(a) Environment
Learning to detect dysarthria from raw speech
Speech classifiers of paralinguistic traits traditionally learn from diverse
hand-crafted low-level features, by selecting the relevant information for the
task at hand. We explore an alternative to this selection, by learning jointly
the classifier, and the feature extraction. Recent work on speech recognition
has shown improved performance over speech features by learning from the
waveform. We extend this approach to paralinguistic classification and propose
a neural network that can learn a filterbank, a normalization factor and a
compression power from the raw speech, jointly with the rest of the
architecture. We apply this model to dysarthria detection from sentence-level
audio recordings. Starting from a strong attention-based baseline on which
mel-filterbanks outperform standard low-level descriptors, we show that
learning the filters or the normalization and compression improves over fixed
features by 10% absolute accuracy. We also observe a gain over OpenSmile
features by learning jointly the feature extraction, the normalization, and the
compression factor with the architecture. This constitutes a first attempt at
learning jointly all these operations from raw audio for a speech
classification task.Comment: 5 pages, 3 figures, submitted to ICASS
Environmental Sound Classification with Parallel Temporal-spectral Attention
Convolutional neural networks (CNN) are one of the best-performing neural
network architectures for environmental sound classification (ESC). Recently,
temporal attention mechanisms have been used in CNN to capture the useful
information from the relevant time frames for audio classification, especially
for weakly labelled data where the onset and offset times of the sound events
are not applied. In these methods, however, the inherent spectral
characteristics and variations are not explicitly exploited when obtaining the
deep features. In this paper, we propose a novel parallel temporal-spectral
attention mechanism for CNN to learn discriminative sound representations,
which enhances the temporal and spectral features by capturing the importance
of different time frames and frequency bands. Parallel branches are constructed
to allow temporal attention and spectral attention to be applied respectively
in order to mitigate interference from the segments without the presence of
sound events. The experiments on three environmental sound classification (ESC)
datasets and two acoustic scene classification (ASC) datasets show that our
method improves the classification performance and also exhibits robustness to
noise.Comment: submitted to INTERSPEECH202
- …