10 research outputs found
The case for automatic higher-level features in forensic speaker recognition
Abstract Approaches from standard automatic speaker recognition, which rely on cepstral features, suffer the problem of lack of interpretability for forensic applications. But the growing practice of using "higher-level" features in automatic systems offers promise in this regard. We provide an overview of automatic higher-level systems and discuss potential advantages, as well as issues, for their use in the forensic context
The case for automatic higher-level features in forensic speaker recognition
Abstract Approaches from standard automatic speaker recognition, which rely on cepstral features, suffer the problem of lack of interpretability for forensic applications. But the growing practice of using "higher-level" features in automatic systems offers promise in this regard. We provide an overview of automatic higher-level systems and discuss potential advantages, as well as issues, for their use in the forensic context
Recommended from our members
Detection of the uniqueness of a human voice: towards machine learning for improved data efficiency
The aim of this thesis is to characterise voice characteristics that can establish the identity of the person who is speaking, independent of the language used. The fundamental goal of the work is to understand how humans recognise a speaker. The voice parameters such as: speech rate, natural pauses & intended or unintended speaker pauses, fundamental frequencies, phoneme generation, volume etc. since the combination of all the voice parameters cannot be easily imitated by another person. It is an assumption that different speakers speak differently, however, it is important to understand and remember that the same speaker’s voice will change over time. For example, the speaker cannot speak/talk/say the same thing in exactly the same way time after time. However, these differences/variations in speech can be audible and measured by using combinations of voice parameters.
The aim is to eliminate a speaker whom we are not looking for. Individuals use words to communicate with others and the same method to communicate with machines too. Humans successfully use speech software (which is speech to text) to talk to telephones instead of tapping words on the keyboard. But machines are proven to be good at converting speech to text, although not at identifying who is speaking.
Problems remain in recognising an individual from their speech whilst proving reliable, repeatable & robust otherwise the speaker could, for example, find themselves locked out of their online voice accessed. For example, the risks are asymmetric - if one in 100 people is locked out of an account that is not too serious, as customer services will ask for answers to security questions. However, if one in 100 people get into bank account fraudulently this is a bigger problem.
A speaker’s voice varies in frequency, tone, and volume sufficiently enough to uniquely identify an individual. However, other factors can contribute to this uniqueness: the size and shape of the mouth, throat, nose, and vocal cords. Sound is produced by air passing from the lungs through the throat, vocal cords and then mouth. A voice makes different sounds based on the position of mouth and throat. It is the variation of these attributes that allows for identification.
Speaker recognition systems are already available, but their overall accuracy is limited because of several issues such as extracted features based on very short time window of speech and models fail to capture useful information of a speaker since current speech recognition systems and extracted features are language-dependent. By using the voice parameters,the work here was able to eliminate 80 percent of population to be able to identify a person. Recognising 1 out 100 is difficult, but identifying 1 out 5 is comparatively easy
Un sistema biometrico per la verifica di identitĂ basato sul riconoscimento del parlato
Con questa tesi ho voluto esplorare ed approfondire il riconoscimento biometrico basato sul parlato, l'obiettivo è di realizzare un applicazione informatica per riconoscere un individuo basandosi sul riconoscimento di un suo campione di voc
Effects of Equipment Variations on Speaker Recognition Error Rates
The purpose of this study was to examine the effects that equipment variation has on speaker recognition performance. Specifically microphone variation is investigated. The study examines the error rates of a speaker recognition system when microphones vary between the enrollment and testing phases. The study also examines the error rates of a speaker recognition system when microphones differ in similar environments and conditions. The metric for evaluation of effect is the false identity acceptance and the false identity rejection error rates.School of Electrical & Computer Engineerin
Local representations and random sampling for speaker verification
In text-independent speaker verification, studies focused on compensating intra-speaker variabilities at the modeling stage through the last decade. Intra-speaker variabilities may be due to channel effects, phonetic content or the speaker himself in the form of speaking style, emotional state, health or other similar factors. Joint Factor Analysis, Total Variability Space compensation, Nuisance Attribute Projection are some of the most successful approaches for inter-session variability compensation in the literature. In this thesis, we criticize the assumptions of low dimensionality of channel space in these methods and propose to partition the acoustic space into local regions. Intra-speaker variability compensation may be done in each local space separately. Two architectures are proposed depending on whether the subsequent modeling and scoring steps will also be done locally or globally. We have also focused on a particular component of intra-speaker variability, namely within-session variability. The main source of within-session variability is the differences in the phonetic content of speech segments in a single utterance. The variabilities in phonetic content may be either due to across acoustic event variabilities or due to differences in the actual realizations of the acoustic events. We propose a method to combat these variabilities through random sampling of training utterance. The method is shown to be effective both in short and long test utterances
Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release
Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform
Deliverable D1.2 Visual, text and audio information analysis for hypervideo, first release
Enriching videos by offering continuative and related information via, e.g., audiostreams, web pages, as well as other videos, is typically hampered by its demand for massive editorial work. While there exist several automatic and semi-automatic methods that analyze audio/video content, one needs to decide which method offers appropriate information for our intended use-case scenarios. We review the technology options for video analysis that we have access to, and describe which training material we opted for to feed our algorithms. For all methods, we offer extensive qualitative and quantitative results, and give an outlook on the next steps within the project
Speaker characterization using adult and children’s speech
Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech.
This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. The main focus of this research is on detecting these characteristics from children’s speech, which is previously reported as a more challenging subject compared to adult. Furthermore, the impact of different frequency bands on the performances of several recognition systems is studied, and the performance obtained using children’s speech is compared with the corresponding results from experiments using adults’ speech.
Speaker characterization is performed by fitting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied. Due to lack of data, parametric model adaptation methods have been applied to adapt the universal background model (UBM) to the char acteristics of utterances. An effective approach involves adapting the UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are concatenated to form a Gaussian mean super-vector for a given utterance. Finally, a classification or regression algorithm is used to identify the speaker characteristics. While effective, Gaussian mean super-vectors are of a high dimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker recognition, recent advances using the i-vector framework have increased the classification accuracy. This framework, which provides a compact representation of an utterance in the form of a low dimensional feature vector, applies a simple factor analysis on GMM means
Reconnaissance automatique du locuteur par des GMM Ă grande marge
Depuis plusieurs dizaines d'années, la reconnaissance automatique du locuteur (RAL) fait l'objet de travaux de recherche entrepris par de nombreuses équipes dans le monde. La majorité des systèmes actuels sont basés sur l'utilisation des Modèles de Mélange de lois Gaussiennes (GMM) et/ou des modèles discriminants SVM, i.e., les machines à vecteurs de support. Nos travaux ont pour objectif général la proposition d'utiliser de nouveaux modèles GMM à grande marge pour la RAL qui soient une alternative aux modèles GMM génératifs classiques et à l'approche discriminante état de l'art GMM-SVM. Nous appelons ces modèles LM-dGMM pour Large Margin diagonal GMM. Nos modèles reposent sur une récente technique discriminante pour la séparation multi-classes, qui a été appliquée en reconnaissance de la parole. Exploitant les propriétés des systèmes GMM utilisés en RAL, nous présentons dans cette thèse des variantes d'algorithmes d'apprentissage discriminant des GMM minimisant une fonction de perte à grande marge. Des tests effectués sur les tâches de reconnaissance du locuteur de la campagne d'évaluation NIST-SRE 2006 démontrent l'intérêt de ces modèles en reconnaissance.Most of state-of-the-art speaker recognition systems are based on Gaussian Mixture Models (GMM), trained using maximum likelihood estimation and maximum a posteriori (MAP) estimation. The generative training of the GMM does not however directly optimize the classification performance. For this reason, discriminative models, e.g., Support Vector Machines (SVM), have been an interesting alternative since they address directly the classification problem, and they lead to good performances. Recently a new discriminative approach for multiway classification has been proposed, the Large Margin Gaussian mixture models (LM-GMM). As in SVM, the parameters of LM-GMM are trained by solving a convex optimization problem. However they differ from SVM by using ellipsoids to model the classes directly in the input space, instead of half-spaces in an extended high-dimensional space. While LM-GMM have been used in speech recognition, they have not been used in speaker recognition (to the best of our knowledge). In this thesis, we propose simplified, fast and more efficient versions of LM-GMM which exploit the properties and characteristics of speaker recognition applications and systems, the LM-dGMM models. In our LM-dGMM modeling, each class is initially modeled by a GMM trained by MAP adaptation of a Universal Background Model (UBM) or directly initialized by the UBM. The models mean vectors are then re-estimated under some Large Margin constraints. We carried out experiments on full speaker recognition tasks under the NIST-SRE 2006 core condition. The experimental results are very satisfactory and show that our Large Margin modeling approach is very promising