1,475 research outputs found
Robust Speech Detection for Noisy Environments
This paper presents a robust voice activity detector (VAD) based on hidden Markov models (HMM) to improve speech recognition systems in stationary and non-stationary noise environments: inside motor vehicles (like cars or planes) or inside buildings close to high traffic places (like in a control tower for air traffic control (ATC)). In these environments, there is a high stationary noise level caused by vehicle motors and additionally, there could be people speaking at certain distance from the main speaker producing non-stationary noise. The VAD presented in this paper is characterized by a new front-end and a noise level adaptation process that increases significantly the VAD robustness for different signal to noise ratios (SNRs). The feature vector used by the VAD includes the most relevant Mel Frequency Cepstral Coefficients (MFCC), normalized log energy and delta log energy. The proposed VAD has been evaluated and compared to other well-known VADs using three databases containing different noise conditions: speech in clean environments (SNRs mayor que 20 dB), speech recorded in stationary noise environments (inside or close to motor vehicles), and finally, speech in non stationary environments (including noise from bars, television and far-field speakers). In the three cases, the detection error obtained with the proposed VAD is the lowest for all SNRs compared to AceroÂżs VAD (reference of this work) and other well-known VADs like AMR, AURORA or G729 annex b
Purging of silence for robust speaker identification in colossal database
The aim of this work is to develop an effective speaker recognition system under noisy environments for large data sets. The important phases involved in typical identification systems are feature extraction, training and testing. During the feature extraction phase, the speaker-specific information is processed based on the characteristics of the voice signal. Effective methods have been proposed for the silence removal in order to achieve accurate recognition under noisy environments in this work. Pitch and Pitch-strength parameters are extracted as distinct features from the input speech spectrum. Multi-linear principle component analysis (MPCA) is is utilized to minimize the complexity of the parameter matrix. Silence removal using zero crossing rate (ZCR) and endpoint detection algorithm (EDA) methods are applied on the source utterance during the feature extraction phase. These features are useful in later classification phase, where the identification is made on the basis of support vector machine (SVM) algorithms. Forward loking schostic (FOLOS) is the efficient large-scale SVM algorithm that has been employed for the effective classification among speakers. The evaluation findings indicate that the methods suggested increase the performance for large amounts of data in noise ecosystems
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Voice biometric system security: Design and analysis of countermeasures for replay attacks.
PhD ThesisVoice biometric systems use automatic speaker veri cation (ASV) technology for
user authentication. Even if it is among the most convenient means of biometric
authentication, the robustness and security of ASV in the face of spoo ng attacks
(or presentation attacks) is of growing concern and is now well acknowledged
by the research community. A spoo ng attack involves illegitimate access to
personal data of a targeted user. Replay is among the simplest attacks to
mount | yet di cult to detect reliably and is the focus of this thesis.
This research focuses on the analysis and design of existing and novel countermeasures
for replay attack detection in ASV, organised in two major parts.
The rst part of the thesis investigates existing methods for spoo ng detection
from several perspectives. I rst study the generalisability of hand-crafted features
for replay detection that show promising results on synthetic speech detection.
I nd, however, that it is di cult to achieve similar levels of performance
due to the acoustically di erent problem under investigation. In addition, I show
how class-dependent cues in a benchmark dataset (ASVspoof 2017) can lead to
the manipulation of class predictions. I then analyse the performance of several
countermeasure models under varied replay attack conditions. I nd that it is
di cult to account for the e ects of various factors in a replay attack: acoustic
environment, playback device and recording device, and their interactions.
Subsequently, I developed and studied a convolutional neural network (CNN)
model that demonstrates comparable performance to the one that ranked rst
in the ASVspoof 2017 challenge. Here, the experiment analyses what the CNN
has learned for replay detection using a method from interpretable machine
learning. The ndings suggest that the model highly attends at the rst few
milliseconds of test recordings in order to make predictions. Then, I perform
an in-depth analysis of a benchmark dataset (ASVspoof 2017) for spoo ng detection
and demonstrate that any machine learning countermeasure model can
still exploit the artefacts I identi ed in this dataset.
The second part of the thesis studies the design of countermeasures for ASV,
focusing on model robustness and avoiding dataset biases. First, I proposed
an ensemble model combining shallow and deep machine learning methods for
spoo ng detection, and then demonstrate its e ectiveness on the latest benchmark
datasets (ASVspoof 2019). Next, I proposed the use of speech endpoint detection
for reliable and robust model predictions on the ASVspoof 2017 dataset.
For this, I created a publicly available collection of hand-annotations of speech
endpoints for the same dataset, and new benchmark results for both frame-based
and utterance-based countermeasures are also developed.
I then proposed spectral subband modelling using CNNs for replay detection.
My results indicate that models that learn subband-speci c information
substantially outperform models trained on complete spectrograms. Finally, I
proposed to use variational autoencoders | deep unsupervised generative models
| as an alternative backend for spoo ng detection and demonstrate encouraging
results when compared with the traditional Gaussian mixture mode
Dual-level segmentation method for feature extraction enhancement strategy in speech emotion recognition
The speech segmentation approach could be one of the significant factors contributing to a Speech Emotion Recognition (SER) system's overall performance. An utterance may contain more than one perceived emotion, the boundaries between the changes of emotion in an utterance are challenging to determine. Speech segmented through the conventional fixed window did not correspond to the signal changes, due to the random segment point, an arbitrary segmented frame is produced, the segment boundary might be within the sentence or in-between emotional changes. This study introduced an improvement of segment-based segmentation on a fixed-window Relative Time Interval (RTI) by using Signal Change (SC) segmentation approach to discover the signal boundary concerning the signal transition. A segment-based feature extraction enhancement strategy using a dual-level segmentation method was proposed: RTI-SC segmentation utilizing the conventional approach. Instead of segmenting the whole utterance at the relative time interval, this study implements peak analysis to obtain segment boundaries defined by the maximum peak value within each temporary RTI segment. In peak selection, over-segmentation might occur due to connections with the input signal, impacting the boundary selection decision. Two approaches in finding the maximum peaks were implemented, firstly; peak selection by distance allocation, and secondly; peak selection by Maximum function. The substitution of the temporary RTI segment with the segment concerning signal change was intended to capture better high-level statistical-based features within the signal transition. The signal's prosodic, spectral, and wavelet properties were integrated to structure a fine feature set based on the proposed method. 36 low-level descriptors and 12 statistical features and their derivative were extracted on each segment resulted in a fixed vector dimension. Correlation-based Feature Subset Selection (CFS) with the Best First search method was applied for dimensionality reduction before Support Vector Machine (SVM) with Sequential Minimal Optimization (SMO) was implemented for classification. The performance of the feature fusion constructed from the proposed method was evaluated through speaker-dependent and speaker-independent tests on EMO-DB and RAVDESS databases. The result indicated that the prosodic and spectral feature derived from the dual-level segmentation method offered a higher recognition rate for most speaker-independent tasks with a significant improvement of the overall accuracy of 82.2% (150 features), the highest accuracy among other segmentation approaches used in this study. The proposed method outperformed the baseline approach in a single emotion assessment in both full dimensions and an optimized set. The highest accuracy for every emotion was mostly contributed by the proposed method. Using the EMO-DB database, accuracy was enhanced, specifically, happy (67.6%), anger (89%), fear (85.5%), disgust (79.3%), while neutral and sadness emotion obtained a similar accuracy with the baseline method (91%) and (93.5%) respectively. A 100% accuracy for boredom emotion (female speaker) was observed in the speaker-dependent test, the highest single emotion classified, reported in this study
An on-line VAD based on Multi-Normalisation Scoring (MNS) of observation likelihoods
Preprint del artĂculo pĂşblicado online el 31 de mayo 2018Voice activity detection (VAD) is an essential task in expert systems that rely on oral interfaces. The VAD module detects the presence of human speech and separates speech segments from silences and non-speech noises. The most popular current on-line VAD systems are based on adaptive parameters which seek to cope with varying channel and noise conditions. The main disadvantages of this approach are the need for some initialisation time to properly adjust the parameters to the incoming signal and uncertain performance in the case of poor estimation of the initial parameters. In this paper we propose a novel on-line VAD based only on previous training which does not introduce any delay. The technique is based on a strategy that we have called Multi-Normalisation Scoring (MNS). It consists of obtaining a vector of multiple observation likelihood scores from normalised mel-cepstral coefficients previously computed from different databases. A classifier is then used to label the incoming observation likelihood vector. Encouraging results have been obtained with a Multi-Layer Perceptron (MLP). This technique can generalise for unseen noise levels and types. A validation experiment with two current standard ITU-T VAD algorithms demonstrates the good performance of the method. Indeed, lower classification error rates are obtained for non-speech frames, while results for speech frames are similar.This work was partially supported by the EU (ERDF) under grant TEC2015-67163-C2-1-R (RESTORE) (MINECO/ERDF, EU) and by the Basque Government under grant KK-2017/00043 (BerbaOla)
- …