140 research outputs found

    The Effect Of Acoustic Variability On Automatic Speaker Recognition Systems

    Get PDF
    This thesis examines the influence of acoustic variability on automatic speaker recognition systems (ASRs) with three aims. i. To measure ASR performance under 5 commonly encountered acoustic conditions; ii. To contribute towards ASR system development with the provision of new research data; iii. To assess ASR suitability for forensic speaker comparison (FSC) application and investigative/pre-forensic use. The thesis begins with a literature review and explanation of relevant technical terms. Five categories of research experiments then examine ASR performance, reflective of conditions influencing speech quantity (inhibitors) and speech quality (contaminants), acknowledging quality often influences quantity. Experiments pertain to: net speech duration, signal to noise ratio (SNR), reverberation, frequency bandwidth and transcoding (codecs). The ASR system is placed under scrutiny with examination of settings and optimum conditions (e.g. matched/unmatched test audio and speaker models). Output is examined in relation to baseline performance and metrics assist in informing if ASRs should be applied to suboptimal audio recordings. Results indicate that modern ASRs are relatively resilient to low and moderate levels of the acoustic contaminants and inhibitors examined, whilst remaining sensitive to higher levels. The thesis provides discussion on issues such as the complexity and fragility of the speech signal path, speaker variability, difficulty in measuring conditions and mitigation (thresholds and settings). The application of ASRs to casework is discussed with recommendations, acknowledging the different modes of operation (e.g. investigative usage) and current UK limitations regarding presenting ASR output as evidence in criminal trials. In summary, and in the context of acoustic variability, the thesis recommends that ASRs could be applied to pre-forensic cases, accepting extraneous issues endure which require governance such as validation of method (ASR standardisation) and population data selection. However, ASRs remain unsuitable for broad forensic application with many acoustic conditions causing irrecoverable speech data loss contributing to high error rates

    The individual and the system : Assessing the stability of the output of a semi-automatic forensic voice comparison system

    Get PDF
    Semi-automatic systems based on traditional linguistic-phonetic features are increasingly being used for forensic voice comparison (FVC) casework. In this paper, we examine the stability of the output of a semi-automatic system, based on the long-term formant distributions (LTFDs) of F1, F2, and F3, as the channel quality of the input recordings decreases. Cross-validated, calibrated GMM-UBM log likelihood-ratios (LLRs) were computed for 97 Standard Southern British English speakers under four conditions. In each condition the same speech material was used, but the technical properties of the recordings changed (high quality studio recording, landline telephone recording, high bit-rate GSM mobile telephone recording and low bit-rate GSM mobile telephone recording). Equal error rate (EER) and the log LR cost function (Cllr) were compared across conditions. System validity was found to decrease with poorer technical quality, with the largest differences in EER (21.66%) and Cllr (0.46) found between the studio and the low bit-rate GSM conditions. However, importantly, performance for individual speakers was affected differently by channel quality. Speakers that produced stronger evidence overall were found to be more variable. Mean F3 was also found to be a predictor of LLR variability, however no effects were found based on speakers’ voice quality profiles

    Strength of forensic voice comparison evidence from the acoustics of filled pauses

    Get PDF
    This study investigates the evidential value of filled pauses (FPs, i.e. um, uh) as variables in forensic voice comparison. FPs for 60 young male speakers of standard southern British English were analysed. The following acoustic properties were analysed: midpoint frequencies of the first three formants in the vocalic portion; ‘dynamic’ characterisations of formant trajectories (i.e. quadratic polynomial equations fitted to nine measurement points over the entire vowel); vowel duration; and nasal duration for um. Likelihood ratio (LR) scores were computed using the Multivariate Kernel Density formula (MVKD; Aitken and Lucy, 2004) and converted to calibrated log10 LRs (LLRs) using logistic-regression (Brümmer et al., 2007). System validity was assessed using both equal error rate (EER) and the log LR cost function (Cllr; Brümmer and du Preez, 2006). The system with the best performance combines dynamic measurements of all three formants with vowel and nasal duration for um, achieving an EER of 4.08% and Cllr of 0.12. In terms of general patterns, um consistently outperformed uh. For um, the formant dynamic systems generated better validity than those based on midpoints, presumably reflecting the additional degree of formant movement in um caused by the transition from vowel to nasal. By contrast, midpoints outperformed dynamics for the more monophthongal uh. Further, the addition of duration (vowel or vowel and nasal) consistently improved system performance. The study supports the view that FPs have excellent potential as variables in forensic voice comparison cases

    Formant dynamics and durations of um improve the performance of automatic speaker recognition systems

    Get PDF
    We assess the potential improvement in the performance of MFCC-based automatic speaker recognition (ASR) systems with the inclusion of linguistic-phonetic information. Likelihood ratios were computed using MFCCs and the formant trajectories and durations of the hesitation marker um, extracted from recordings of male standard southern British English speakers. Testing was run over 20 replications using randomised sets of speakers. System validity (EER and Cllr) was found to improve with the inclusion of um relative to the baseline ASR across all 20 replications. These results offer support for the growing integration of automatic and linguistic-phonetic methods in forensic voice comparison

    Strength of forensic voice comparison evidence from the acoustics of filled pauses

    Get PDF
    This study investigates the evidential value of filled pauses (FPs, i.e. um, uh) as variables in forensic voice comparison. FPs for 60 young male speakers of standard southern British English were analysed. The following acoustic properties were analysed: midpoint frequencies of the first three formants in the vocalic portion; ‘dynamic’ characterisations of formant trajectories (i.e. quadratic polynomial equations fitted to nine measurement points over the entire vowel); vowel duration; and nasal duration for um. Likelihood ratio (LR) scores were computed using the Multivariate Kernel Density formula (MVKD; Aitken and Lucy, 2004) and converted to calibrated log10 LRs (LLRs) using logistic-regression (Brümmer et al., 2007). System validity was assessed using both equal error rate (EER) and the log LR cost function (Cllr; Brümmer and du Preez, 2006). The system with the best performance combines dynamic measurements of all three formants with vowel and nasal duration for um, achieving an EER of 4.08% and Cllr of 0.12. In terms of general patterns, um consistently outperformed uh. For um, the formant dynamic systems generated better validity than those based on midpoints, presumably reflecting the additional degree of formant movement in um caused by the transition from vowel to nasal. By contrast, midpoints outperformed dynamics for the more monophthongal uh. Further, the addition of duration (vowel or vowel and nasal) consistently improved system performance. The study supports the view that FPs have excellent potential as variables in forensic voice comparison cases

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT

    Using group delay functions from all-pole models for speaker recognition

    Get PDF
    Bu çalışma, 25-29 Ağustos 2013 tarihlerinde Lyon[Fransa]'da düzenlenen 14. Annual Conference of the International Speech Communication Association [Interspeech 2013]'da bildiri olarak sunulmuştur.Popular features for speech processing, such as mel-frequency cepstral coefficients (MFCCs), are derived from the short-term magnitude spectrum, whereas the phase spectrum remains unused. While the common argument to use only the magnitude spectrum is that the human ear is phase-deaf, phase-based features have remained less explored due to additional signal processing difficulties they introduce. A useful representation of the phase is the group delay function, but its robust computation remains difficult. This paper advocates the use of group delay functions derived from parametric all-pole models instead of their direct computation from the discrete Fourier transform. Using a subset of the vocal effort data in the NIST 2010 speaker recognition evaluation (SRE) corpus, we show that group delay features derived via parametric all-pole models improve recognition accuracy, especially under high vocal effort. Additionally, the group delay features provide comparable or improved accuracy over conventional magnitude-based MFCC features. Thus, the use of group delay functions derived from all-pole models provide an effective way to utilize information from the phase spectrum of speech signals.Academy of Finland (253120)Int Speech Commun AssociationAmazonMicrosoftGoogleTcL SYTRALEuropean Language Resources AssociationOuaeroImaginoveVOCAPIA ResearchAcapelaSpeech OceanALDEBARANOrangeVecsysIBM ResearchRaytheon BBN TechnologyVoxyge

    Nonlinear feature based classification of speech under stress

    Full text link
    corecore