541 research outputs found
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
Feature Learning from Spectrograms for Assessment of Personality Traits
Several methods have recently been proposed to analyze speech and
automatically infer the personality of the speaker. These methods often rely on
prosodic and other hand crafted speech processing features extracted with
off-the-shelf toolboxes. To achieve high accuracy, numerous features are
typically extracted using complex and highly parameterized algorithms. In this
paper, a new method based on feature learning and spectrogram analysis is
proposed to simplify the feature extraction process while maintaining a high
level of accuracy. The proposed method learns a dictionary of discriminant
features from patches extracted in the spectrogram representations of training
speech segments. Each speech segment is then encoded using the dictionary, and
the resulting feature set is used to perform classification of personality
traits. Experiments indicate that the proposed method achieves state-of-the-art
results with a significant reduction in complexity when compared to the most
recent reference methods. The number of features, and difficulties linked to
the feature extraction process are greatly reduced as only one type of
descriptors is used, for which the 6 parameters can be tuned automatically. In
contrast, the simplest reference method uses 4 types of descriptors to which 6
functionals are applied, resulting in over 20 parameters to be tuned.Comment: 12 pages, 3 figure
Learning spectro-temporal representations of complex sounds with parameterized neural networks
Deep Learning models have become potential candidates for auditory
neuroscience research, thanks to their recent successes on a variety of
auditory tasks. Yet, these models often lack interpretability to fully
understand the exact computations that have been performed. Here, we proposed a
parametrized neural network layer, that computes specific spectro-temporal
modulations based on Gabor kernels (Learnable STRFs) and that is fully
interpretable. We evaluated predictive capabilities of this layer on Speech
Activity Detection, Speaker Verification, Urban Sound Classification and Zebra
Finch Call Type Classification. We found out that models based on Learnable
STRFs are on par for all tasks with different toplines, and obtain the best
performance for Speech Activity Detection. As this layer is fully
interpretable, we used quantitative measures to describe the distribution of
the learned spectro-temporal modulations. The filters adapted to each task and
focused mostly on low temporal and spectral modulations. The analyses show that
the filters learned on human speech have similar spectro-temporal parameters as
the ones measured directly in the human auditory cortex. Finally, we observed
that the tasks organized in a meaningful way: the human vocalizations tasks
closer to each other and bird vocalizations far away from human vocalizations
and urban sounds tasks
The joint optimization of spectro-temporal features and deep neural nets for robust ASR
status: publishe
Deep learning for time series classification: a review
Time Series Classification (TSC) is an important and challenging problem in
data mining. With the increase of time series data availability, hundreds of
TSC algorithms have been proposed. Among these methods, only a few have
considered Deep Neural Networks (DNNs) to perform this task. This is surprising
as deep learning has seen very successful applications in the last years. DNNs
have indeed revolutionized the field of computer vision especially with the
advent of novel deeper architectures such as Residual and Convolutional Neural
Networks. Apart from images, sequential data such as text and audio can also be
processed with DNNs to reach state-of-the-art performance for document
classification and speech recognition. In this article, we study the current
state-of-the-art performance of deep learning algorithms for TSC by presenting
an empirical study of the most recent DNN architectures for TSC. We give an
overview of the most successful deep learning applications in various time
series domains under a unified taxonomy of DNNs for TSC. We also provide an
open source deep learning framework to the TSC community where we implemented
each of the compared approaches and evaluated them on a univariate TSC
benchmark (the UCR/UEA archive) and 12 multivariate time series datasets. By
training 8,730 deep learning models on 97 time series datasets, we propose the
most exhaustive study of DNNs for TSC to date.Comment: Accepted at Data Mining and Knowledge Discover
Audio Deepfake Detection: A Survey
Audio deepfake detection is an emerging active topic. A growing number of
literatures have aimed to study deepfake detection algorithms and achieved
effective performance, the problem of which is far from being solved. Although
there are some review literatures, there has been no comprehensive survey that
provides researchers with a systematic overview of these developments with a
unified evaluation. Accordingly, in this survey paper, we first highlight the
key differences across various types of deepfake audio, then outline and
analyse competitions, datasets, features, classifications, and evaluation of
state-of-the-art approaches. For each aspect, the basic techniques, advanced
developments and major challenges are discussed. In addition, we perform a
unified comparison of representative features and classifiers on ASVspoof 2021,
ADD 2023 and In-the-Wild datasets for audio deepfake detection, respectively.
The survey shows that future research should address the lack of large scale
datasets in the wild, poor generalization of existing detection methods to
unknown fake attacks, as well as interpretability of detection results
- …