Search CORE

1,397 research outputs found

Detection of Disguised Voice Using Probabilistic Neural Network

Author: Abin Mathew George, Eva George
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/08/2015
Field of study

Since voice disguise is the process of concealing one’s identity, it is being widely used for illegal purposes. This has caused a negative impact on the audio forensics, so it is important to identify whether the voice is disguised or not. It is more difficult to identify whether the voice is disguised or not, if the voice is disguised using electronic scrambling devices or audio editing software tools. We know that voice disguise is the modification of frequency spectrum of speech signals, so we will be using mel-frequency cepstrum coefficients (MFCC’s). In this paper, we will be extracting MFCC statistical moments including mean and correlation coefficients as acoustic features and then we will be using an algorithm based on these features and will be using probabilistic neural network (PNN) as classifier to distinguish whether the voice is disguised or not. DOI: 10.17762/ijritcc2321-8169.15080

International Journal on Recent and Innovation Trends in Computing and Communication

Human and Machine Speaker Recognition Based on Short Trivial Events

Author: Dai Haisheng
Kang Xiaofei
Li Lantian
Tang Zhiyuan
Wang Dong
Wang Yanqing
Zhang Miao
Publication venue
Publication date: 05/02/2018
Field of study

Trivial events are ubiquitous in human to human conversations, e.g., cough, laugh and sniff. Compared to regular speech, these trivial events are usually short and unclear, thus generally regarded as not speaker discriminative and so are largely ignored by present speaker recognition research. However, these trivial events are highly valuable in some particular circumstances such as forensic examination, as they are less subjected to intentional change, so can be used to discover the genuine speaker from disguised speech. In this paper, we collect a trivial event speech database that involves 75 speakers and 6 types of events, and report preliminary speaker recognition results on this database, by both human listeners and machines. Particularly, the deep feature learning technique recently proposed by our group is utilized to analyze and recognize the trivial events, which leads to acceptable equal error rates (EERs) despite the extremely short durations (0.2-0.5 seconds) of these events. Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201

arXiv.org e-Print Archive

Crossref

The impact of laughter in earwitness identification performance

Author: Cherryman Julie
Philippon Axelle C.
Randall Liane M.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2013
Field of study

Portsmouth University Research Portal (Pure)

Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings

Author: Fejes Attila
Sztahó Dávid
Publication venue: 'Wiley'
Publication date: 26/09/2022
Field of study

In forensic voice comparison the speaker embedding has become widely popular in the last 10 years. Most of the pretrained speaker embeddings are trained on English corpora, because it is easily accessible. Thus, language dependency can be an important factor in automatic forensic voice comparison, especially when the target language is linguistically very different. There are numerous commercial systems available, but their models are mainly trained on a different language (mostly English) than the target language. In the case of a low-resource language, developing a corpus for forensic purposes containing enough speakers to train deep learning models is costly. This study aims to investigate whether a model pre-trained on English corpus can be used on a target low-resource language (here, Hungarian), different from the model is trained on. Also, often multiple samples are not available from the offender (unknown speaker). Therefore, samples are compared pairwise with and without speaker enrollment for suspect (known) speakers. Two corpora are applied that were developed especially for forensic purposes, and a third that is meant for traditional speaker verification. Two deep learning based speaker embedding vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker verification was evaluated in the likelihood-ratio framework. A comparison is made between the language combinations (modeling, LR calibration, evaluation). The results were evaluated by minCllr and EER metrics. It was found that the model pre-trained on a different language but on a corpus with a huge amount of speakers performs well on samples with language mismatch. The effect of sample durations and speaking styles were also examined. It was found that the longer the duration of the sample in question the better the performance is. Also, there is no real difference if various speaking styles are applied

arXiv.org e-Print Archive

A Likelihood-Ratio Based Forensic Voice Comparison in Standard Thai

Author: Pingjai Supawan
Publication venue
Publication date: 01/01/2019
Field of study

This research uses a likelihood ratio (LR) framework to assess the discriminatory power of a range of acoustic parameters extracted from speech samples produced by male speakers of Standard Thai. The thesis aims to answer two main questions: 1) to what extent the tested linguistic-phonetic segments of Standard Thai perform in forensic voice comparison (FVC); and 2) how such linguistic-phonetic segments are profitably combined through logistic regression using the FoCal Toolkit (Brümmer, 2007). The segments focused on in this study are the four consonants /s, ʨh, n, m/ and the two diphthongs [ɔi, ai]. First of all, using the alveolar fricative /s/, two different sets of features were compared in terms of their performance in FVC. The first comprised the spectrum-based distributional features of four spectral moments, namely mean, variance, skew and kurtosis; the second consisted of the coefficients of the Discrete Cosine Transform (DCTs) applied to a spectrum. As DCTs were found to perform better, they were subsequently used to model the consonant spectrum of the remaining consonants. The consonant spectrum was extracted at the center point of the /s, ʨh, n, m/ consonants with a Hamming window of 31.25 msec. For the diphthongs [ɔi] - [nɔi L] and [ai] - [mai HL], the cubic polynomials fitted to the F2 and F1-F3 formants were tested separately. The quadratic polynomials fitted to the tonal F0 contours of [ɔi] - [nɔi L] and [ai] - [mai HL] were tested as well. Long-term F0 distribution (LTF0) was also trialed. The results show the promising discriminatory power of the Standard Thai acoustic features and segments tested in this thesis. The main findings are as follows. 1. The fricative /s/ performed better with the DCTs (Cllr = 0.70) than with the spectral moments (Cllr = 0.92). 2. The nasals /n, m/ (Cllr = 0.47) performed better than the affricate /tɕh/ (Cllr = 0.54) and the fricative /s/ (Cllr = 0.70) when their DCT coefficients were parameterized. 3. F1-F3 trajectories (Cllr = 0.42 and Cllr = 0.49) outperformed F2 trajectory (Cllr = 0.69 and Cllr = 0.67) for both diphthongs [ɔi] and [ai]. 4. F1-F3 trajectories of the diphthong [ɔi] (Cllr = 0.42) outperformed those of [ai] (Cllr = 0.49). 5. Tonal F0 (Cllr = 0.52) outperformed LTF0 (Cllr = 0.74). 6. Overall, better results were obtained when DCTs of /n/ - [na: HL] and /n/ - [nɔi L] were fused. (Cllr = 0.40 with the largest consistent-with-fact SSLog10LR = 2.53). In light of the findings, we can conclude that Standard Thai is generally amenable to FVC, especially when linguistic-phonetic segments are being combined; it is recommended that the latter procedure be followed when dealing with forensically realistic casework

The Australian National University

Forensic and Automatic Speaker Recognition System

Author: Singh Satyanand
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/10/2018
Field of study

Current Automatic Speaker Recognition (ASR) System has emerged as an important medium of confirmation of identity in many businesses, ecommerce applications, forensics and law enforcement as well. Specialists trained in criminological recognition can play out this undertaking far superior by looking at an arrangement of acoustic, prosodic, and semantic attributes which has been referred to as structured listening. An algorithmbased system has been developed in the recognition of forensic speakers by physics scientists and forensic linguists to reduce the probability of a contextual bias or pre-centric understanding of a reference model with the validity of an unknown audio sample and any suspicious individual. Many researchers are continuing to develop automatic algorithms in signal processing and machine learning so that improving performance can effectively introduce the speaker’s identity, where the automatic system performs equally with the human audience. In this paper, I examine the literature about the identification of speakers by machines and humans, emphasizing the key technical speaker pattern emerging for the automatic technology in the last decade. I focus on many aspects of automatic speaker recognition (ASR) systems, including speaker-specific features, speaker models, standard assessment data sets, and performance metric

IAES journal

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Correlating cepstra with formant frequencies: : implications for phonetically-informed forensic voice comparison

Author: Clermont Frantz
Harrison Philip
Hughes Vincent
Publication venue: 'International Speech Communication Association'
Publication date: 23/10/2020
Field of study

Crossref

White Rose Research Online

Recommended from our members

Can we have faith jurors listen without prejudice? Likely sources of inaccuracy in voice-comparison exercises

Author: Robson J
Smith H
Publication venue: Sweet & Maxwell
Publication date: 03/10/2018
Field of study

Reviews the legal position governing the circumstances under which fact-finders in criminal proceedings, particularly jurors, are allowed to listen to audio recordings in the context of a voice-comparison exercise for the purposes of identifying the speaker. Examines the estimator and system variables that may affect the exercise's accuracy, and suggests potential safeguards that could be introduced to reduce misidentification

Nottingham Trent Institutional Repository (IRep)

De Montfort University Open Research Archive