46 research outputs found
VOICE BIOMETRICS UNDER MISMATCHED NOISE CONDITIONS
This thesis describes research into effective voice biometrics (speaker recognition) under mismatched noise conditions. Over the last two decades, this class of biometrics has been the subject of considerable research due to its various applications in such areas as telephone banking, remote access control and surveillance. One of the main challenges associated with the deployment of voice biometrics in practice is that of undesired variations in speech characteristics caused by environmental noise. Such variations can in turn lead to a mismatch between the corresponding test and reference material from the same speaker. This is found to adversely affect the performance of speaker recognition in terms of accuracy.
To address the above problem, a novel approach is introduced and investigated. The proposed method is based on minimising the noise mismatch between reference speaker models and the given test utterance, and involves a new form of Test-Normalisation (T-Norm) for further enhancing matching scores under the aforementioned adverse operating conditions. Through experimental investigations, based on the two main classes of speaker recognition (i.e. verification/ open-set identification), it is shown that the proposed approach can significantly improve the performance accuracy under mismatched noise conditions.
In order to further improve the recognition accuracy in severe mismatch conditions, an approach to enhancing the above stated method is proposed. This, which involves providing a closer adjustment of the reference speaker models to the noise condition in the test utterance, is shown to considerably increase the accuracy in extreme cases of noisy test data. Moreover, to tackle the computational burden associated with the use of the enhanced approach with open-set identification, an efficient algorithm for its realisation in this context is introduced and evaluated.
The thesis presents a detailed description of the research undertaken, describes the experimental investigations and provides a thorough analysis of the outcomes
Autoregressive models for text independent speaker identification in noisy environments
The closed-set speaker identification problem is defined as the search within a set of persons for the speaker of a certain
utterance. It is reported that the Gaussian mixture model (GMM) classifier achieves very high classification accuracies (in the
range 95% - 100%) when both the training and testing utterances are recorded in sound proof studio, i.e., there is neither additive
noise nor spectral distortion to the speech signals.
However, in real life applications, speech is usually corrupted by noise and band-limitation. Moreover, there is a mismatch between the recording conditions of the training and testing environments. As a
result, the classification accuracy of GMM-based systems deteriorates significantly. In this thesis, we propose a two-step procedure for improving the speaker identification performance under noisy environment. In the first step, we introduce a new classifier: vector autoregressive Gaussian mixture (VARGM) model. Unlike the
GMM, the new classifier models correlations between successive feature vectors. We also integrate the proposed method into the
framework of the universal background model (UBM). In addition, we develop the learning procedure according to the maximum likelihood
(ML) criterion. Based on a thorough experimental evaluation, the proposed method achieves an improvement of 3 to 5% in the
identification accuracy.
In the second step, we propose a new compensation technique based on the generalized maximum likelihood (GML) decision rule. In particular, we assume a general form for the distribution of the
noise-corrupted utterances, which contains two types of parameters: clean speech-related parameters and noise-related parameters. While the clean speech related parameters are estimated during the
training phase, the noise related parameters are estimated from the corrupted speech in the testing phase. We applied the proposed
method to utterances of 50 speakers selected from the TIMIT database, artificially corrupted by convolutive and additive noise.
The signal to noise ratio (SNR) varies from 0 to 20 dB. Simulation results reveal that the proposed method achieves good robustness
against variation in the SNR. For utterances corrupted by covolutive noise, the improvement in the classification accuracy ranges from 70% for SNR = 0 dB to around 4% for SNR = 10dB, compared to the standard ML decision rule. For utterances corrupted by additive noise, the improvement in the classification accuracy ranges from 1% to 10% for SNRs ranging from 0 to 20 dB.
The proposed VARGM classifier is also applied to the speech emotion classification problem. In particular, we use the Berlin emotional speech database to validate the classification performance of the proposed VARGM classifier. The proposed technique provides a classification accuracy of 76% versus 71% for the hidden Markov model, 67% for the k-nearest neighbors, 55% for feed-forward neural networks. The model gives also better discrimination between
high-arousal emotions (joy, anger, fear), low arousal emotions (sadness, boredom), and neutral emotions than the HMM.
Another interesting application of the VARGM model is the blind equalization of multi input multiple output (MIMO) communication
channels. Based on VARGM modeling of MIMO channels, we propose a four-step equalization procedure. First, the received data vectors are fitted into a VARGM model using the expectation maximization (EM) algorithm. The constructed VARGM model is then used to filter the received data. A Baysian decision rule is then applied to
identify the transmitted symbols up to a permutation and phase ambiguities, which are finally resolved using a small training
sequence. Moreover, we propose a fast and easily implementable model order selection technique. The new equalization algorithm is
compared to the whitening method and found to provide less symbol error probability. The proposed technique is also applied to
frequency-flat slow fading channels and found to provide a more accurate estimate of the channel response than that provided by the blind de-convolution exploiting channel encoding (BDCC) method and at a higher information rate
Deep learning-based automatic analysis of social interactions from wearable data for healthcare applications
PhD ThesisSocial interactions of people with Late Life Depression (LLD) could be an objective measure
of social functioning due to the association between LLD and poor social functioning. The
utilisation of wearable computing technologies is a relatively new approach within healthcare
and well-being application sectors. Recently, the design and development of wearable
technologies and systems for health and well-being monitoring have attracted attention both
of the clinical and scientific communities. Mainly because the current clinical practice of –
typically rather sporadic – clinical behaviour assessments are often administered in artificial
settings. As a result, it does not provide a realistic impression of a patient’s condition
and thus does not lead to sufficient diagnosis and care. However, wearable behaviour
monitors have the potential for continuous, objective assessment of behaviour and wider
social interactions and thereby allowing for capturing naturalistic data without any constraints
on the place of recording or any typical limitations of the lab-setting research. Such data from
naturalistic ambient environments would facilitate automated transmission and analysis by
having no constraints on the recordings, allowing for a more timely and accurate assessment
of depressive symptoms. In response to this artificial setting issue, this thesis focuses on
the analysis and assessment of the different aspects of social interactions in naturalistic
environments using deep learning algorithms. That could lead to improvements in both
diagnosis and treatment.
The advantages of using deep learning are that there is no need for hand-crafted features
engineering and this leads to using the raw data with minimal pre-processing compared to
classical machine learning approaches and also its scalability and ability to generalise. The
main dataset used in this thesis is recorded by a wrist worn device designed at Newcastle
University. This device has multiple sensors including microphone, tri-axial accelerometer,
light sensor and proximity sensor. In this thesis, only microphone and tri-axial accelerometer
are used for the social interaction analysis. The other sensors are not used since they need
more calibration from the user which in this will be the elderly people with depression.
Hence, it was not feasible in this scenario. Novel deep learning models are proposed to
automatically analyse two aspects of social interactions (the verbal interactions/acoustic
communications and physical activities/movement patterns). Verbal Interactions include
the total quantity of speech, who is talking to whom and when and how much engagement
the wearer contributed in the conversations. The physical activity analysis includes activity
recognition and the quantity of each activity and sleep patterns.
This thesis is composed of three main stages, two of them discuss the acoustic analysis
and the third stage describes the movement pattern analysis. The acoustic analysis starts
with speech detection in which each segment of the recording is categorised as speech or
non-speech. This segment classification is achieved by a novel deep learning model that
leverages bi-directional Long Short-Term Memory with gated activation units combined
with Maxout Networks as well as a combination of two optimisers. After detecting speech
segments from audio data, the next stage is detecting how much engagement the wearer has
in any conversation throughout these speech events based on detecting the wearer of the
device using a variant model of the previous one that combines the convolutional autoencoder
with bi-directional Long Short-Term Memory. Following this, the system then detects the
spoken parts of the main speaker/wearer and therefore detects the conversational turn-taking
but only includes the turn taking between the wearer and other speakers and not every speaker
in the conversation. This stage did not take into account the semantics of the speakers due
to the ethical constraints of the main dataset (Depression dataset) and therefore it was not
possible to listen to the data by any means or even have any information about the contents.
So, it is a good idea to be considered for future work.
Stage 3 involves the physical activity analysis that is inferring the elementary physical
activities and movement patterns. These elementary patterns include sedentary actions,
walking, mixed activities, cycling, using vehicles as well as the sleep patterns. The predictive
model used is based on Random Forests and Hidden Markov Models. In all stages the
methods presented in this thesis have been compared to the state-of-the-art in processing
audio, accelerometer data, respectively, to thoroughly assess their contribution. Following
these stages is a thorough analysis of the interplay between acoustic interaction and physical
movement patterns and the depression key clinical variables resulting to the outcomes of
the previous stages. The main reason for not using deep learning in this stage unlike the
previous stages is that the main dataset (Depression dataset) did not have any annotations
for the speech or even the activity due to the ethical constraints as mentioned. Furthermore,
the training dataset (Discussion dataset) did not have any annotations for the accelerometer
data where the data is recorded freely and there is no camera attached to device to make it
possible to be annotated afterwards.Newton-Mosharafa Fund and
the mission sector and cultural affairs, ministry of Higher Education in Egypt
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
Advances in Subspace-based Solutions for Diarization in the Broadcast Domain
La motivación de esta tesis es la necesidad de soluciones robustas al problema de diarización. Estas técnicas de diarización deben proporcionar valor añadido a la creciente cantidad disponible de datos multimedia mediante la precisa discriminación de los locutores presentes en la señal de audio. Desafortunadamente, hasta tiempos recientes este tipo de tecnologías solamente era viable en condiciones restringidas, quedando por tanto lejos de una solución general. Las razones detrás de las limitadas prestaciones de los sistemas de diarización son múltiples. La primera causa a tener en cuenta es la alta complejidad de la producción de la voz humana, en particular acerca de los procesos fisiológicos necesarios para incluir las características discriminativas de locutor en la señal de voz. Esta complejidad hace del proceso inverso, la estimación de dichas características a partir del audio, una tarea ineficiente por medio de las técnicas actuales del estado del arte. Consecuentemente, en su lugar deberán tenerse en cuenta aproximaciones. Los esfuerzos en la tarea de modelado han proporcionado modelos cada vez más elaborados, aunque no buscando la explicación última de naturaleza fisiológica de la señal de voz. En su lugar estos modelos aprenden relaciones entre la señales acústicas a partir de un gran conjunto de datos de entrenamiento. El desarrollo de modelos aproximados genera a su vez una segunda razón, la variabilidad de dominio. Debido al uso de relaciones aprendidas a partir de un conjunto de entrenamiento concreto, cualquier cambio de dominio que modifique las condiciones acústicas con respecto a los datos de entrenamiento condiciona las relaciones asumidas, pudiendo causar fallos consistentes en los sistemas.Nuestra contribución a las tecnologías de diarización se ha centrado en el entorno de radiodifusión. Este dominio es actualmente un entorno todavía complejo para los sistemas de diarización donde ninguna simplificación de la tarea puede ser tenida en cuenta. Por tanto, se deberá desarrollar un modelado eficiente del audio para extraer la información de locutor y como inferir el etiquetado correspondiente. Además, la presencia de múltiples condiciones acústicas debido a la existencia de diferentes programas y/o géneros en el domino requiere el desarrollo de técnicas capaces de adaptar el conocimiento adquirido en un determinado escenario donde la información está disponible a aquellos entornos donde dicha información es limitada o sencillamente no disponible.Para este propósito el trabajo desarrollado a lo largo de la tesis se ha centrado en tres subtareas: caracterización de locutor, agrupamiento y adaptación de modelos. La primera subtarea busca el modelado de un fragmento de audio para obtener representaciones precisas de los locutores involucrados, poniendo de manifiesto sus propiedades discriminativas. En este área se ha llevado a cabo un estudio acerca de las actuales estrategias de modelado, especialmente atendiendo a las limitaciones de las representaciones extraídas y poniendo de manifiesto el tipo de errores que pueden generar. Además, se han propuesto alternativas basadas en redes neuronales haciendo uso del conocimiento adquirido. La segunda tarea es el agrupamiento, encargado de desarrollar estrategias que busquen el etiquetado óptimo de los locutores. La investigación desarrollada durante esta tesis ha propuesto nuevas estrategias para estimar el mejor reparto de locutores basadas en técnicas de subespacios, especialmente PLDA. Finalmente, la tarea de adaptación de modelos busca transferir el conocimiento obtenido de un conjunto de entrenamiento a dominios alternativos donde no hay datos para extraerlo. Para este propósito los esfuerzos se han centrado en la extracción no supervisada de información de locutor del propio audio a diarizar, sinedo posteriormente usada en la adaptación de los modelos involucrados.<br /
Feature extraction and information fusion in face and palmprint multimodal biometrics
ThesisMultimodal biometric systems that integrate the biometric traits from several
modalities are able to overcome the limitations of single modal biometrics. Fusing
the information at an earlier level by consolidating the features given by different
traits can give a better result due to the richness of information at this stage. In this
thesis, three novel methods are derived and implemented on face and palmprint
modalities, taking advantage of the multimodal biometric fusion at feature level.
The benefits of the proposed method are the enhanced capabilities in discriminating
information in the fused features and capturing all of the information required to
improve the classification performance. Multimodal biometric proposed here
consists of several stages such as feature extraction, fusion, recognition and
classification.
Feature extraction gathers all important information from the raw images. A
new local feature extraction method has been designed to extract information from
the face and palmprint images in the form of sub block windows. Multiresolution
analysis using Gabor transform and DCT is computed for each sub block window to
produce compact local features for the face and palmprint images. Multiresolution
Gabor analysis captures important information in the texture of the images while
DCT represents the information in different frequency components. Important
features with high discrimination power are then preserved by selecting several low
frequency coefficients in order to estimate the model parameters.
The local features extracted are fused in a new matrix interleaved method. The
new fused feature vector is higher in dimensionality compared to the original feature
vectors from both modalities, thus it carries high discriminating power and contains
rich statistical information. The fused feature vector also has larger data points in
the feature space which is advantageous for the training process using statistical
methods. The underlying statistical information in the fused feature vectors is
captured using GMM where several numbers of modal parameters are estimated
from the distribution of fused feature vector.
Maximum likelihood score is used to measure a degree of certainty to perform
recognition while maximum likelihood score normalization is used for classification
process. The use of likelihood score normalization is found to be able to suppress an
imposter likelihood score when the background model parameters are estimated
from a pool of users which include statistical information of an imposter. The
present method achieved the highest recognition accuracy 97% and 99.7% when
tested using FERET-PolyU dataset and ORL-PolyU dataset respectively.Universiti Malaysia Perlis and Ministry of Higher Education
Malaysi
USING DEEP LEARNING-BASED FRAMEWORK FOR CHILD SPEECH EMOTION RECOGNITION
Biological languages of the body through which human emotion can be detected abound including heart rate, facial expressions, movement of the eyelids and dilation of the eyes, body postures, skin conductance, and even the speech we make. Speech emotion recognition research started some three decades ago, and the popular Interspeech Emotion Challenge has helped to propagate this research area. However, most speech recognition research is focused on adults and there is very little research on child speech. This dissertation is a description of the development and evaluation of a child speech emotion recognition framework. The higher-level components of the framework are designed to sort and separate speech based on the speaker’s age, ensuring that focus is only on speeches made by children. The framework uses Baddeley’s Theory of Working Memory to model a Working Memory Recurrent Network that can process and recognize emotions from speech. Baddeley’s Theory of Working Memory offers one of the best explanations on how the human brain holds and manipulates temporary information which is very crucial in the development of neural networks that learns effectively. Experiments were designed and performed to provide answers to the research questions, evaluate the proposed framework, and benchmark the performance of the framework with other methods. Satisfactory results were obtained from the experiments and in many cases, our framework was able to outperform other popular approaches. This study has implications for various applications of child speech emotion recognition such as child abuse detection and child learning robots
IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli