Search CORE

510 research outputs found

Formant trajectories in forensic speaker recognition

Author: Enzinger Ewald
Publication venue
Publication date: 01/01/2009
Field of study

Die vorliegende Arbeit untersucht das Leistungsverhalten eines Ansatzes der forensischen Sprechererkennung, der auf parametrischen Repräsentationen von Formantverläufen basiert. Quadratische und kubische Polynomfunktionen werden dabei an Formantverläufe von Diphthongen angenähert. Die resultierenden Koeffizienten sowie die ersten drei bzw. vier Komponenten der Diskreten Kosinustransformation (DCT) werden in Folge verwendet, um die dynamischen Eigenschaften der zugrundeliegenden akustischen Merkmale der Sprache und damit der Sprechercharakteristika zu erfassen. Am Ende steht eine Repräsentation bestehend aus wenigen dekorrelierten Parametern, die für die forensische Sprechererkennung verwendet werden. Die in der Untersuchung durchgeführte Evaluierung beinhaltet die Berechnung von Likelihood-Ratio-Werten für die Anwendung im Bayesschen Ansatz für die Bewertung von forensischen Beweisstücken. Die Vorteile dieses Systems und die derzeitigen Beschränkungen werden behandelt. Für die Berechnung der Likelihood-Ratio-Werte wird eine von Aitken & Lucy (2004) entwickelte multivariate Kernel-Density-Formel verwendet, die sowohl Zwischen-Sprecher- als auch Inner-Sprecher-Variabilität berücksichtigt. Automatische Kalibrierungs- und Fusionstechniken, wie sie in Systemen zur automatischen Sprecheridentifikation verwendet werden, werden auf die Ergebniswerte angewendet. Um die Bedeutung von Längenaspekten von Diphthongen für die forensische Sprechererkennung näher zu untersuchen wird ein Experiment durchgeführt, in dem der Effekt von Zeitnormalisierung sowie die Modellierung der Dauer durch einen expliziten Parameter evaluiert werden. Die Leistungsfähigkeit der parametrischen Repräsentationen verglichen mit anderen Methoden sowie die Effekte der Kalibrierung und Fusion werden unter Verwendung üblicher Bewertungswerkzeuge wie des Erkennungsfehlerabwägungs-(DET)-Diagramms, des Tippett-Diagramms und des angewandten Fehlerwahrscheinlichkeits-(APE)-Diagramms, sowie numerischer Kennziffern wie der Gleichfehlerrate (EER) und der Cllr-Metrik evaluiert.The present work investigates the performance of an approach for forensic speaker recognition that is based on parametric representations of formant trajectories. Quadratic and cubic polynomial functions are fitted to formant contours of diphthongs. The resulting coefficients as well as the first three to four components derived from discrete cosine transform (DCT) are used in order to capture the dynamic properties of the underlying speech acoustics, and thus of the speaker characteristics. This results in a representation based on only a small number of decorrelated parameters that are in turn used for forensic speaker recognition. The evaluation conducted in the study incorporates the calculation of likelihood ratios for use in the Bayesian approach of evidence evaluation. The advantages of this framework and its current limitations are discussed. For the calculation of the likelihood ratios a multivariate kernel density formula developed by Aitken & Lucy (2004) is used which takes both between-speaker and within-speaker variability into account. Automatic calibration and fusion techniques as they are used in automatic speaker identification systems are applied to the resulting scores. To further investigate the importance of duration aspects of the diphthongs for speaker recognition an experiment is undertaken that evaluates the effect of time-normalisation as well as modelling segment durations using an explicit parameter. The performance of the parametric representation approach compared with other methods as well as the effects of calibration and fusion are evaluated using standard evaluation tools like the detection error trade-off (DET) plots, the applied probability of error (APE) plot, the Tippett plot as well as numerical indices like the EER and the Cllr metric

A Likelihood-Ratio Based Forensic Voice Comparison in Standard Thai

Author: Pingjai Supawan
Publication venue
Publication date: 01/01/2019
Field of study

This research uses a likelihood ratio (LR) framework to assess the discriminatory power of a range of acoustic parameters extracted from speech samples produced by male speakers of Standard Thai. The thesis aims to answer two main questions: 1) to what extent the tested linguistic-phonetic segments of Standard Thai perform in forensic voice comparison (FVC); and 2) how such linguistic-phonetic segments are profitably combined through logistic regression using the FoCal Toolkit (Brümmer, 2007). The segments focused on in this study are the four consonants /s, ʨh, n, m/ and the two diphthongs [ɔi, ai]. First of all, using the alveolar fricative /s/, two different sets of features were compared in terms of their performance in FVC. The first comprised the spectrum-based distributional features of four spectral moments, namely mean, variance, skew and kurtosis; the second consisted of the coefficients of the Discrete Cosine Transform (DCTs) applied to a spectrum. As DCTs were found to perform better, they were subsequently used to model the consonant spectrum of the remaining consonants. The consonant spectrum was extracted at the center point of the /s, ʨh, n, m/ consonants with a Hamming window of 31.25 msec. For the diphthongs [ɔi] - [nɔi L] and [ai] - [mai HL], the cubic polynomials fitted to the F2 and F1-F3 formants were tested separately. The quadratic polynomials fitted to the tonal F0 contours of [ɔi] - [nɔi L] and [ai] - [mai HL] were tested as well. Long-term F0 distribution (LTF0) was also trialed. The results show the promising discriminatory power of the Standard Thai acoustic features and segments tested in this thesis. The main findings are as follows. 1. The fricative /s/ performed better with the DCTs (Cllr = 0.70) than with the spectral moments (Cllr = 0.92). 2. The nasals /n, m/ (Cllr = 0.47) performed better than the affricate /tɕh/ (Cllr = 0.54) and the fricative /s/ (Cllr = 0.70) when their DCT coefficients were parameterized. 3. F1-F3 trajectories (Cllr = 0.42 and Cllr = 0.49) outperformed F2 trajectory (Cllr = 0.69 and Cllr = 0.67) for both diphthongs [ɔi] and [ai]. 4. F1-F3 trajectories of the diphthong [ɔi] (Cllr = 0.42) outperformed those of [ai] (Cllr = 0.49). 5. Tonal F0 (Cllr = 0.52) outperformed LTF0 (Cllr = 0.74). 6. Overall, better results were obtained when DCTs of /n/ - [na: HL] and /n/ - [nɔi L] were fused. (Cllr = 0.40 with the largest consistent-with-fact SSLog10LR = 2.53). In light of the findings, we can conclude that Standard Thai is generally amenable to FVC, especially when linguistic-phonetic segments are being combined; it is recommended that the latter procedure be followed when dealing with forensically realistic casework

The Australian National University

On the speaker discriminatory power asymmetry regarding acoustic-phonetic parameters and the impact of speaking style

Author: Anders Eriksson
Julio Cesar Cavalcanti
Julio Cesar Cavalcanti
Plinio A. Barbosa
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2023
Field of study

This study aimed to assess what we refer to as the speaker discriminatory power asymmetry and its forensic implications in comparisons performed in different speaking styles: spontaneous dialogues vs. interviews. We also addressed the impact of data sampling on the speaker's discriminatory performance concerning different acoustic-phonetic estimates. The participants were 20 male speakers, Brazilian Portuguese speakers from the same dialectal area. The speech material consisted of spontaneous telephone conversations between familiar individuals, and interviews conducted between each individual participant and the researcher. Nine acoustic-phonetic parameters were chosen for the comparisons, spanning from temporal and melodic to spectral acoustic-phonetic estimates. Ultimately, an analysis based on the combination of different parameters was also conducted. Two speaker discriminatory metrics were examined: Cost Log-likelihood-ratio (Cllr) and Equal Error Rate (EER) values. A general speaker discriminatory trend was suggested when assessing the parameters individually. Parameters pertaining to the temporal acoustic-phonetic class depicted the weakest performance in terms of speaker contrasting power as evidenced by the relatively higher Cllr and EER values. Moreover, from the set of acoustic parameters assessed, spectral parameters, mainly high formant frequencies, i.e., F3 and F4, were the best performing in terms of speaker discrimination, depicting the lowest EER and Cllr scores. The results appear to suggest a speaker discriminatory power asymmetry concerning parameters from different acoustic-phonetic classes, in which temporal parameters tended to present a lower discriminatory power. The speaking style mismatch also seemed to considerably impact the speaker comparison task, by undermining the overall discriminatory performance. A statistical model based on the combination of different acoustic-phonetic estimates was found to perform best in this case. Finally, data sampling has proven to be of crucial relevance for the reliability of discriminatory power assessment

Directory of Open Access Journals

Reconocimiento automático de locutor e idioma mediante caracterización acústica de unidades lingüísticas

Author: Franco-Pedroso Javier
Publication venue
Publication date: 01/01/2016
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones . Fecha de lectura: 30-06-201

The definition of the relevant population and the collection of data for likelihood ratio-based forensic voice comparison

Author: Hughes Vincent
Publication venue
Publication date: 01/01/2014
Field of study

Within the field of forensic speech science there is increasing acceptance of the likelihood ratio (LR) as the logically and legally correct framework for evaluating forensic voice comparison (FVC) evidence. However, only a small proportion of experts cur- rently use the numerical LR in casework. This is due primarily to the difficulties involved in accounting for the inherent, and arguably unique, complexity of speech in a fully data-driven, numerical LR analysis. This thesis addresses two such issues: the definition of the relevant population and the amount of data required for system testing. Firstly, experiments are presented which explore the extent to which LRs are affected by different definitions of the relevant population with regard to sources of systematic sociolinguistic between-speaker variation (regional background, socio-economic class and age) using both linguistic-phonetic and ASR variables. Results show that different definitions of the relevant population can have a substantial effect on the magnitude of LRs, depending on the input variable. However, system validity results suggest that narrow controls over sociolinguistic sources of variation should be preferred to general controls. Secondly, experiments are presented which evaluate the effects of development, test and reference sample size on LRs. Consistent with general principles in statistics, more precise results are found using more data across all experiments. There is also considerable evidence of a relationship between sample size sensitivity and the dimensionality and speaker discriminatory power of the input variable. Further, there are potential trade-offs in the size of each set depending on which element of LR output the analyst is interested in. The results in this thesis will contribute towards im- proving the extent to which LR methods account for the linguistic-phonetic complexity of speech evidence. In accounting for this complexity, this work will also increase the practical viability of applying the numerical LR to FVC casework

An application of an auditory periphery model in speaker identification

Author: Islam Md. Atiqul
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2021
Field of study

The number of applications of automatic Speaker Identification (SID) is growing due to the advanced technologies for secure access and authentication in services and devices. In 2016, in a study, the Cascade of Asymmetric Resonators with Fast Acting Compression (CAR FAC) cochlear model achieved the best performance among seven recent cochlear models to fit a set of human auditory physiological data. Motivated by the performance of the CAR-FAC, I apply this cochlear model in an SID task for the first time to produce a similar performance to a human auditory system. This thesis investigates the potential of the CAR-FAC model in an SID task. I investigate the capability of the CAR-FAC in text-dependent and text-independent SID tasks. This thesis also investigates contributions of different parameters, nonlinearities, and stages of the CAR-FAC that enhance SID accuracy. The performance of the CAR-FAC is compared with another recent cochlear model called the Auditory Nerve (AN) model. In addition, three FFT-based auditory features – Mel frequency Cepstral Coefficient (MFCC), Frequency Domain Linear Prediction (FDLP), and Gammatone Frequency Cepstral Coefficient (GFCC), are also included to compare their performance with cochlear features. This comparison allows me to investigate a better front-end for a noise-robust SID system. Three different statistical classifiers: a Gaussian Mixture Model with Universal Background Model (GMM-UBM), a Support Vector Machine (SVM), and an I-vector were used to evaluate the performance. These statistical classifiers allow me to investigate nonlinearities in the cochlear front-ends. The performance is evaluated under clean and noisy conditions for a wide range of noise levels. Techniques to improve the performance of a cochlear algorithm are also investigated in this thesis. It was found that the application of a cube root and DCT on cochlear output enhances the SID accuracy substantially

Development and Properties of Kernel-based Methods for the Interpretation and Presentation of Forensic Evidence

Author: Armstrong Douglas
Publication venue: Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange
Publication date: 01/01/2017
Field of study

The inference of the source of forensic evidence is related to model selection. Many forms of evidence can only be represented by complex, high-dimensional random vectors and cannot be assigned a likelihood structure. A common approach to circumvent this is to measure the similarity between pairs of objects composing the evidence. Such methods are ad-hoc and unstable approaches to the judicial inference process. While these methods address the dimensionality issue they also engender dependencies between scores when 2 scores have 1 object in common that are not taken into account in these models. The model developed in this research captures the dependencies between pairwise scores from a hierarchical sample and models them in the kernel space using a linear model. Our model is flexible to accommodate any kernel satisfying basic conditions and as a result is applicable to any type of complex high-dimensional data. An important result of this work is the asymptotic multivariate normality of the scores as the data dimension increases. As a result, we can: 1) model very high-dimensional data when other methods fail; 2) determine the source of multiple samples from a single trace in one calculation. Our model can be used to address high-dimension model selection problems in different situations and we show how to use it to assign Bayes factors to forensic evidence. We will provide examples of real-life problems using data from very small particles and dust analyzed by SEM/EDX, and colors of fibers quantified by microspectrophotometry

Public Research Access Institutional Repository and Information Exchange

Elemental Characterization of Printing Inks and Strengthening the Evaluation of Forensic Glass Evidence

Author: Corzo Ruthmara
Publication venue: FIU Digital Commons
Publication date: 01/01/2018
Field of study

Improvements in printing technology have exacerbated the problem of document counterfeiting, prompting the need for analytical techniques that better characterize inks for forensic analysis. In this study, 319 printing inks (toner, inkjet, offset, and intaglio) were analyzed directly on the paper substrate using Scanning Electron Microscopy-Energy Dispersive Spectroscopy (SEM-EDS) and Laser Ablation-Inductively Coupled Plasma-Mass Spectrometry (LA-ICP-MS). As anticipated, the high sensitivity of LA-ICP-MS resulted in excellent discrimination (\u3e 99%) between ink samples originating from different sources. Moreover, LA-ICP-MS provided ≥ 90% correct association for ink samples originating from the same source. SEM-EDS resulted in good discrimination for toner and intaglio inks (\u3e 97%) and excellent correct association (100%) for all four ink types. However, the technique showed limited utility for the discrimination of inkjet and offset inks. A searchable ink database, the Forensic Ink Analysis and Comparison System (FIACS), was developed in order to provide a tool that allows the analyst to compare a questioned ink sample to a reference population. The FIACS database provided a correct classification rate of 94-100% for LA-ICP-MS and 67-100% for SEM-EDS. An important consideration in forensic chemistry is the interpretation of the evidence. Typically, a match criterion is used to compare the known and questioned sample. However, this approach suffers from several disadvantages, which can be overcome with an alternative approach: the likelihood ratio (LR). Two LA-ICP-MS glass databases were used to evaluate the performance of the LR: a vehicle windshield database (420 samples) and a casework database (385 samples). Compared to the match criterion, the likelihood ratio led to improved false exclusion rates (\u3c 1.5%) and similar false inclusion rates (\u3c 1.0%). In addition, the LR limited the magnitude of the misleading evidence, providing only weak support for the incorrect proposition. The likelihood ratio was also tested through an inter-laboratory study including 10 LA-ICP-MS participants. Good correct association rates (94-100%) were obtained for same-source samples for all three inter-laboratory exercises. Moreover, the LR showed a strong support for an association. Finally, all different-source samples were correctly excluded with the LR, resulting in no false inclusions

Voice Modeling Methods for Automatic Speaker Recognition

Author: Stadelmann Thilo
Publication venue: Philipps-Universität Marburg
Publication date: 01/01/2010
Field of study

Building a voice model means to capture the characteristics of a speaker´s voice in a data structure. This data structure is then used by a computer for further processing, such as comparison with other voices. Voice modeling is a vital step in the process of automatic speaker recognition that itself is the foundation of several applied technologies: (a) biometric authentication, (b) speech recognition and (c) multimedia indexing. Several challenges arise in the context of automatic speaker recognition. First, there is the problem of data shortage, i.e., the unavailability of sufficiently long utterances for speaker recognition. It stems from the fact that the speech signal conveys different aspects of the sound in a single, one-dimensional time series: linguistic (what is said?), prosodic (how is it said?), individual (who said it?), locational (where is the speaker?) and emotional features of the speech sound itself (to name a few) are contained in the speech signal, as well as acoustic background information. To analyze a specific aspect of the sound regardless of the other aspects, analysis methods have to be applied to a specific time scale (length) of the signal in which this aspect stands out of the rest. For example, linguistic information (i.e., which phone or syllable has been uttered?) is found in very short time spans of only milliseconds of length. On the contrary, speakerspecific information emerges the better the longer the analyzed sound is. Long utterances, however, are not always available for analysis. Second, the speech signal is easily corrupted by background sound sources (noise, such as music or sound effects). Their characteristics tend to dominate a voice model, if present, such that model comparison might then be mainly due to background features instead of speaker characteristics. Current automatic speaker recognition works well under relatively constrained circumstances, such as studio recordings, or when prior knowledge on the number and identity of occurring speakers is available. Under more adverse conditions, such as in feature films or amateur material on the web, the achieved speaker recognition scores drop below a rate that is acceptable for an end user or for further processing. For example, the typical speaker turn duration of only one second and the sound effect background in cinematic movies render most current automatic analysis techniques useless. In this thesis, methods for voice modeling that are robust with respect to short utterances and background noise are presented. The aim is to facilitate movie analysis with respect to occurring speakers. Therefore, algorithmic improvements are suggested that (a) improve the modeling of very short utterances, (b) facilitate voice model building even in the case of severe background noise and (c) allow for efficient voice model comparison to support the indexing of large multimedia archives. The proposed methods improve the state of the art in terms of recognition rate and computational efficiency. Going beyond selective algorithmic improvements, subsequent chapters also investigate the question of what is lacking in principle in current voice modeling methods. By reporting on a study with human probands, it is shown that the exclusion of time coherence information from a voice model induces an artificial upper bound on the recognition accuracy of automatic analysis methods. A proof-of-concept implementation confirms the usefulness of exploiting this kind of information by halving the error rate. This result questions the general speaker modeling paradigm of the last two decades and presents a promising new way. The approach taken to arrive at the previous results is based on a novel methodology of algorithm design and development called “eidetic design". It uses a human-in-the-loop technique that analyses existing algorithms in terms of their abstract intermediate results. The aim is to detect flaws or failures in them intuitively and to suggest solutions. The intermediate results often consist of large matrices of numbers whose meaning is not clear to a human observer. Therefore, the core of the approach is to transform them to a suitable domain of perception (such as, e.g., the auditory domain of speech sounds in case of speech feature vectors) where their content, meaning and flaws are intuitively clear to the human designer. This methodology is formalized, and the corresponding workflow is explicated by several use cases. Finally, the use of the proposed methods in video analysis and retrieval are presented. This shows the applicability of the developed methods and the companying software library sclib by means of improved results using a multimodal analysis approach. The sclib´s source code is available to the public upon request to the author. A summary of the contributions together with an outlook to short- and long-term future work concludes this thesis

Publikations- und Dokumentenserver der Universitätsbibliothek Marburg

The effect of sampling variability on overall performance and individual speakers’ behaviour in likelihood ratio-based forensic voice comparison

Author: Wang Xiao
Publication venue
Publication date: 01/10/2021
Field of study

In the past years, there is increasing awareness and acceptance among forensic speech scientists of using Bayesian reasoning and likelihood ratio (LR) framework for forensic voice comparison (FVC) and expressing expert conclusions. Numerous studies have explored overall performance using numerical LRs. Given that the data used for validation is a sample coming from an unknown distribution, little attention has been paid to the effect of sampling variability or individuals’ behaviour. This thesis investigates these issues using linguistic-phonetic variables. First, it investigates how different configurations of training, test and reference speakers affect overall performance. The results show that variability in overall performance is mostly caused by varying the test speakers, while less variability is caused by sampling variability in the reference and training speakers. Second, this thesis explores the effect of sampling variability on overall performance and individuals’ behaviour in relation to the use of linguistic-phonetic features. Results show that sampling variability affects overall performance to different extents using different features, while combining more features does not always improve overall performance. Sampling variability has limited effects on individuals in same-speaker comparisons, and most speakers are less affected by sampling variability in different-speaker comparisons when four or more features are used. Third, this thesis explores the effect of sampling variability on overall performance in relation to score distributions. Results reveal that system validity and reliability are more affected by different- speaker score skewness, and less affected by same-speaker score skewness. Using different calibration methods reduces the effect of sampling variability to different extents. The results in this thesis have implications for both FVC using numerical LRs and FVC in general, as experts need to make pragmatic decisions whether numerical LR is used or not, and every decision made has implication to final evaluation results. Further, the results on score skewness and different calibration methods have potential contribution for improving FVC performance using automatic systems