63 research outputs found
Exploring some limits of Gaussian PLDA modeling for i-vector distributions
International audienceGaussian-PLDA (G-PLDA) modeling for i-vector based speaker verification has proven to be competitive versus heavy-tailed PLDA (HT-PLDA) based on Student's t-distribution, when the latter is much more computationally expensive. However , its results are achieved using a length-normalization, which projects i-vectors on the non-linear and finite surface of a hypersphere. This paper investigates the limits of linear and Gaussian G-PLDA modeling when distribution of data is spherical. In particular, assumptions of homoscedasticity are questionable: the model assumes that the within-speaker variability can be estimated by a unique and linear parameter. A non-probabilistic approach is proposed, competitive with state-of-the-art, which reveals some limits of the Gaussian modeling in terms of goodness of fit. We carry out an analysis of residue, which finds out a relation between the dispersion of a speaker-class and its location and, thus, shows that homoscedasticity assumptions are not fulfilled
The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems
This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates
REPRESENTATION LEARNING FOR ACTION RECOGNITION
The objective of this research work is to develop discriminative representations for human
actions. The motivation stems from the fact that there are many issues encountered while
capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration),
inter-action similarity, background motion, and occlusion of actors. Hence, obtaining
a representation which can address all the variations in the same action while maintaining
discrimination with other actions is a challenging task. In literature, actions have been represented
either using either low-level or high-level features. Low-level features describe
the motion and appearance in small spatio-temporal volumes extracted from a video. Due
to the limited space-time volume used for extracting low-level features, they are not able
to account for viewpoint and actor variations or variable length actions. On the other hand,
high-level features handle variations in actors, viewpoints, and duration but the resulting
representation is often high-dimensional which introduces the curse of dimensionality. In
this thesis, we propose new representations for describing actions by combining the advantages
of both low-level and high-level features. Specifically, we investigate various linear
and non-linear decomposition techniques to extract meaningful attributes in both high-level
and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build
action-specific dictionaries. Each dictionary retains only the discriminative information
for a particular action and hence reduces inter-action similarity. Then, a sparsity-based
classification method is proposed to classify the low-rank representation of clips obtained
using these dictionaries. We show that this representation based on dictionary learning improves
the classification performance across actions. Also, a few of the actions consist of
rapid body deformations that hinder the extraction of local features from body movements.
Hence, we propose to use a dictionary which is trained on convolutional neural network
(CNN) features of the human body in various poses to reliably identify actors from the
background. Particularly, we demonstrate the efficacy of sparse representation in the identification
of the human body under rapid and substantial deformation.
In the first two approaches, sparsity-based representation is developed to improve discriminability
using class-specific dictionaries that utilize action labels. However, developing
an unsupervised representation of actions is more beneficial as it can be used to both
recognize similar actions and localize actions. We propose to exploit inter-action similarity
to train a universal attribute model (UAM) in order to learn action attributes (common and
distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation,
a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains
redundant attributes of all other actions, we use factor analysis to extract a novel lowvi
dimensional action-vector representation for each clip. Action-vectors are shown to suppress
background motion and highlight actions of interest in both trimmed and untrimmed
clips that contributes to action recognition without the help of any classifiers.
It is observed during our experiments that action-vector cannot effectively discriminate
between actions which are visually similar to each other. Hence, we subject action-vectors
to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic
LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary
information across action-vectors using different local features followed by discriminative
embedding provides the best classification performance. Further, we explore
non-linear embedding of action-vectors using Siamese networks especially for fine-grained
action recognition. A visualization of the hidden layer output in Siamese networks shows
its ability to effectively separate visually similar actions. This leads to better classification
performance than linear embedding on fine-grained action recognition.
All of the above approaches are presented on large unconstrained datasets with hundreds
of examples per action. However, actions in surveillance videos like snatch thefts are
difficult to model because of the diverse variety of scenarios in which they occur and very
few labeled examples. Hence, we propose to utilize the universal attribute model (UAM)
trained on large action datasets to represent such actions. Specifically, we show that there
are similarities between certain actions in the large datasets with snatch thefts which help
in extracting a representation for snatch thefts using the attributes from the UAM. This
representation is shown to be effective in distinguishing snatch thefts from regular actions
with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing
actions which provide better discrimination than existing representations. The
first approach presents a dictionary learning based sparse representation for effective discrimination
of actions. Also, we propose a sparse representation for the human body based
on dictionaries in order to recognize actions with rapid body deformations. In the next
approach, a low-dimensional representation called action-vector for unsupervised action
recognition is presented. Further, linear and non-linear embedding of action-vectors is
proposed for addressing inter-action similarity and fine-grained action recognition, respectively.
Finally, we propose a representation for locating snatch thefts among thousands of
regular interactions in surveillance videos
A study into automatic speaker verification with aspects of deep learning
Advancements in automatic speaker verification (ASV) can be considered to be primarily limited to improvements in modelling and classification techniques, capable of capturing ever larger amounts of speech data.
This thesis begins by presenting a fairly extensive review of developments in ASV, up to the current state-of-the-art with i-vectors and PLDA. A series of practical tuning experiments then follows. It is found somewhat surprisingly, that even the training of the total variability matrix required for i-vector extraction, is potentially susceptible to unwanted variabilities.
The thesis then explores the use of deep learning in ASV. A literature review is first made, with two training methodologies appearing evident: indirectly using a deep neural network trained for automatic speech recognition, and directly with speaker related output classes. The review finds that interest in direct training appears to be increasing, underpinned with the intent to discover new robust 'speaker embedding' representations.
Last a preliminary experiment is presented, investigating the use of a deep convolutional network for speaker identification. The small set of results show that the network successfully identifies two test speakers, out of 84 possible speakers enrolled. It is hoped that subsequent research might lead to new robust speaker representations or features
Speaker Recognition in Unconstrained Environments
Speaker recognition is applied in smart home devices, interactive voice response systems, call centers, online banking and payment solutions as well as in forensic scenarios. This dissertation is concerned with speaker recognition systems in unconstrained environments. Before this dissertation, research on making better decisions in unconstrained environments was insufficient. Aside from decision making, unconstrained environments imply two other subjects: security and privacy. Within the scope of this dissertation, these research subjects are regarded as both security against short-term replay attacks and privacy preservation within state-of-the-art biometric voice comparators in the light of a potential leak of biometric data. The aforementioned research subjects are united in this dissertation to sustain good decision making processes facing uncertainty from varying signal quality and to strengthen security as well as preserve privacy.
Conventionally, biometric comparators are trained to classify between mated and non-mated reference,--,probe pairs under idealistic conditions but are expected to operate well in the real world. However, the more the voice signal quality degrades, the more erroneous decisions are made. The severity of their impact depends on the requirements of a biometric application. In this dissertation, quality estimates are proposed and employed for the purpose of making better decisions on average in a formalized way (quantitative method), while the specifications of decision requirements of a biometric application remain unknown. By using the Bayesian decision framework, the specification of application-depending decision requirements is formalized, outlining operating points: the decision thresholds. The assessed quality conditions combine ambient and biometric noise, both of which occurring in commercial as well as in forensic application scenarios. Dual-use (civil and governmental) technology is investigated. As it seems unfeasible to train systems for every possible signal degradation, a low amount of quality conditions is used. After examining the impact of degrading signal quality on biometric feature extraction, the extraction is assumed ideal in order to conduct a fair benchmark. This dissertation proposes and investigates methods for propagating information about quality to decision making. By employing quality estimates, a biometric system's output (comparison scores) is normalized in order to ensure that each score encodes the least-favorable decision trade-off in its value. Application development is segregated from requirement specification. Furthermore, class discrimination and score calibration performance is improved over all decision requirements for real world applications.
In contrast to the ISOIEC 19795-1:2006 standard on biometric performance (error rates), this dissertation is based on biometric inference for probabilistic decision making (subject to prior probabilities and cost terms). This dissertation elaborates on the paradigm shift from requirements by error rates to requirements by beliefs in priors and costs. Binary decision error trade-off plots are proposed, interrelating error rates with prior and cost beliefs, i.e., formalized decision requirements. Verbal tags are introduced to summarize categories of least-favorable decisions: the plot's canvas follows from Bayesian decision theory. Empirical error rates are plotted, encoding categories of decision trade-offs by line styles. Performance is visualized in the latent decision subspace for evaluating empirical performance regarding changes in prior and cost based decision requirements.
Security against short-term audio replay attacks (a collage of sound units such as phonemes and syllables) is strengthened. The unit-selection attack is posed by the ASVspoof 2015 challenge (English speech data), representing the most difficult to detect voice presentation attack of this challenge. In this dissertation, unit-selection attacks are created for German speech data, where support vector machine and Gaussian mixture model classifiers are trained to detect collage edges in speech representations based on wavelet and Fourier analyses. Competitive results are reached compared to the challenged submissions.
Homomorphic encryption is proposed to preserve the privacy of biometric information in the case of database leakage. In this dissertation, log-likelihood ratio scores, representing biometric evidence objectively, are computed in the latent biometric subspace. Conventional comparators rely on the feature extraction to ideally represent biometric information, latent subspace comparators are trained to find ideal representations of the biometric information in voice reference and probe samples to be compared. Two protocols are proposed for the the two-covariance comparison model, a special case of probabilistic linear discriminant analysis. Log-likelihood ratio scores are computed in the encrypted domain based on encrypted representations of the biometric reference and probe. As a consequence, the biometric information conveyed in voice samples is, in contrast to many existing protection schemes, stored protected and without information loss. The first protocol preserves privacy of end-users, requiring one public/private key pair per biometric application. The latter protocol preserves privacy of end-users and comparator vendors with two key pairs. Comparators estimate the biometric evidence in the latent subspace, such that the subspace model requires data protection as well. In both protocols, log-likelihood ratio based decision making meets the requirements of the ISOIEC 24745:2011 biometric information protection standard in terms of unlinkability, irreversibility, and renewability properties of the protected voice data
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Recommended from our members
A Data-Driven Perspective on Residential Electricity Modeling and Structural Health Monitoring
In recent years, due to the increasing efficiency and availability of information technologies for collecting massive amounts of data (e.g., smart meters and sensors), a variety of advanced technologies and decision-making strategies in the civil engineering sector have shifted in leaps and bounds to a data-driven manner. While there is still no consensus in industry and academia on the latest advances, challenges, and trends in some innovative data-driven methods related to, e.g., deep learning and neural networks, it is undeniable that these techniques have been proven to be considerably effective in helping our academics and engineers solve many real-life tasks related to the smart city framework.
This dissertation systematically presents the investigation and development of the cutting-edge data-driven methods related to two specific areas of civil engineering, namely, Residential Electricity Modeling (REM) and Structural Health Monitoring (SHM). For both components, the presentation of this dissertation starts with a brief review of classical data-driven methods used in particular problems, gradually progresses to an exploration of the related state-of-the-art technologies, and eventually lands on our proposed novel data-driven strategies and algorithms. In addition to the classical and state-of-the-art modeling techniques focused on these two areas, this dissertation also put great emphasis on the proposed effective feature extraction and selection approaches.
These approaches are aimed to optimize model performance and to save computational resources, for achieving the ideal characterization of the information embedded in the collected raw data that is most relevant to the problem objectives, especially for the case of modeling deep neural networks. For the problems on REM, the proposed methods are validated with real recorded data from multi-family residential buildings, while for SHM, the algorithms are validated with data from numerically simulated systems as well as real bridge structures
UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD
Performance of automatic speaker verification (ASV) systems is very sensitive
to mismatch between training (source) and testing (target) domains. The
best way to address domain mismatch is to perform matched condition training
– gather sufficient labeled samples from the target domain and use them in
training. However, in many cases this is too expensive or impractical. Usually,
gaining access to unlabeled target domain data, e.g., from open source online
media, and labeled data from other domains is more feasible. This work focuses
on making ASV systems robust to uncontrolled (‘wild’) conditions, with
the help of some unlabeled data acquired from such conditions.
Given acoustic features from both domains, we propose learning a mapping
function – a deep convolutional neural network (CNN) with an encoder-decoder
architecture – between features of both the domains. We explore training the
network in two different scenarios: training on paired speech samples from
both domains and training on unpaired data. In the former case, where the
paired data is usually obtained via simulation, the CNN is treated as a nonii
ABSTRACT
linear regression function and is trained to minimize L2 loss between original
and predicted features from target domain. We provide empirical evidence that
this approach introduces distortions that affect verification performance. To
address this, we explore training the CNN using adversarial loss (along with
L2), which makes the predicted features indistinguishable from the original
ones, and thus, improve verification performance.
The above framework using simulated paired data, though effective, cannot
be used to train the network on unpaired data obtained by independently
sampling speech from both domains. In this case, we first train a CNN using
adversarial loss to map features from target to source. We, then, map the
predicted features back to the target domain using an auxiliary network, and
minimize a cycle-consistency loss between the original and reconstructed target
features.
Our unsupervised adaptation approach complements its supervised counterpart,
where adaptation is done using labeled data from both domains. We
focus on three domain mismatch scenarios: (1) sampling frequency mismatch
between the domains, (2) channel mismatch, and (3) robustness to far-field and
noisy speech acquired from wild conditions
Deep Neural Network Architectures for Large-scale, Robust and Small-Footprint Speaker and Language Recognition
Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura : 27-04-2017Artificial neural networks are powerful learners of the information embedded in speech signals.
They can provide compact, multi-level, nonlinear representations of temporal sequences
and holistic optimization algorithms capable of surpassing former leading paradigms. Artificial
neural networks are, therefore, a promising technology that can be used to enhance our
ability to recognize speakers and languages–an ability increasingly in demand in the context
of new, voice-enabled interfaces used today by millions of users. The aim of this thesis is to
advance the state-of-the-art of language and speaker recognition through the formulation,
implementation and empirical analysis of novel approaches for large-scale and portable
speech interfaces. Its major contributions are: (1) novel, compact network architectures
for language and speaker recognition, including a variety of network topologies based on
fully-connected, recurrent, convolutional, and locally connected layers; (2) a bottleneck combination
strategy for classical and neural network approaches for long speech sequences; (3)
the architectural design of the first, public, multilingual, large vocabulary continuous speech
recognition system; and (4) a novel, end-to-end optimization algorithm for text-dependent
speaker recognition that is applicable to a range of verification tasks. Experimental results
have demonstrated that artificial neural networks can substantially reduce the number of
model parameters and surpass the performance of previous approaches to language and
speaker recognition, particularly in the cases of long short-term memory recurrent networks
(used to model the input speech signal), end-to-end optimization algorithms (used to predict
languages or speakers), short testing utterances, and large training data collections.Las redes neuronales artificiales son sistemas de aprendizaje capaces de extraer la información
embebida en las señales de voz. Son capaces de modelar de forma eficiente secuencias
temporales complejas, con información no lineal y distribuida en distintos niveles semanticos,
mediante el uso de algoritmos de optimización integral con la capacidad potencial de mejorar
los sistemas aprendizaje automático existentes. Las redes neuronales artificiales son, pues,
una tecnología prometedora para mejorar el reconocimiento automático de locutores e
idiomas; siendo el reconocimiento de de locutores e idiomas, tareas con cada vez más
demanda en los nuevos sistemas de control por voz, que ya utilizan millones de personas. Esta
tesis tiene como objetivo la mejora del estado del arte de las tecnologías de reconocimiento
de locutor y de idioma mediante la formulación, implementación y análisis empírico de
nuevos enfoques basados en redes neuronales, aplicables a dispositivos portátiles y a su uso
en gran escala. Las principales contribuciones de esta tesis incluyen la propuesta original de:
(1) arquitecturas eficientes que hacen uso de capas neuronales densas, localmente densas,
recurrentes y convolucionales; (2) una nueva estrategia de combinación de enfoques clásicos
y enfoques basados en el uso de las denominadas redes de cuello de botella; (3) el diseño del
primer sistema público de reconocimiento de voz, de vocabulario abierto y continuo, que es
además multilingüe; y (4) la propuesta de un nuevo algoritmo de optimización integral para
tareas de reconocimiento de locutor, aplicable también a otras tareas de verificación. Los
resultados experimentales extraídos de esta tesis han demostrado que las redes neuronales
artificiales son capaces de reducir el número de parámetros usados por los algoritmos de
reconocimiento tradicionales, así como de mejorar el rendimiento de dichos sistemas de
forma substancial. Dicha mejora relativa puede acentuarse a través del modelado de voz
mediante redes recurrentes de memoria a largo plazo, el uso de algoritmos de optimización
integral, el uso de locuciones de evaluation de corta duración y mediante la optimización del
sistema con grandes cantidades de datos de entrenamiento
- …