1,397 research outputs found
Detection of Disguised Voice Using Probabilistic Neural Network
Since voice disguise is the process of concealing one’s identity, it is being widely used for illegal purposes. This has caused a negative impact on the audio forensics, so it is important to identify whether the voice is disguised or not. It is more difficult to identify whether the voice is disguised or not, if the voice is disguised using electronic scrambling devices or audio editing software tools. We know that voice disguise is the modification of frequency spectrum of speech signals, so we will be using mel-frequency cepstrum coefficients (MFCC’s). In this paper, we will be extracting MFCC statistical moments including mean and correlation coefficients as acoustic features and then we will be using an algorithm based on these features and will be using probabilistic neural network (PNN) as classifier to distinguish whether the voice is disguised or not.
DOI: 10.17762/ijritcc2321-8169.15080
Human and Machine Speaker Recognition Based on Short Trivial Events
Trivial events are ubiquitous in human to human conversations, e.g., cough,
laugh and sniff. Compared to regular speech, these trivial events are usually
short and unclear, thus generally regarded as not speaker discriminative and so
are largely ignored by present speaker recognition research. However, these
trivial events are highly valuable in some particular circumstances such as
forensic examination, as they are less subjected to intentional change, so can
be used to discover the genuine speaker from disguised speech. In this paper,
we collect a trivial event speech database that involves 75 speakers and 6
types of events, and report preliminary speaker recognition results on this
database, by both human listeners and machines. Particularly, the deep feature
learning technique recently proposed by our group is utilized to analyze and
recognize the trivial events, which leads to acceptable equal error rates
(EERs) despite the extremely short durations (0.2-0.5 seconds) of these events.
Comparing different types of events, 'hmm' seems more speaker discriminative.Comment: ICASSP 201
Effects of language mismatch in automatic forensic voice comparison using deep learning embeddings
In forensic voice comparison the speaker embedding has become widely popular
in the last 10 years. Most of the pretrained speaker embeddings are trained on
English corpora, because it is easily accessible. Thus, language dependency can
be an important factor in automatic forensic voice comparison, especially when
the target language is linguistically very different. There are numerous
commercial systems available, but their models are mainly trained on a
different language (mostly English) than the target language. In the case of a
low-resource language, developing a corpus for forensic purposes containing
enough speakers to train deep learning models is costly. This study aims to
investigate whether a model pre-trained on English corpus can be used on a
target low-resource language (here, Hungarian), different from the model is
trained on. Also, often multiple samples are not available from the offender
(unknown speaker). Therefore, samples are compared pairwise with and without
speaker enrollment for suspect (known) speakers. Two corpora are applied that
were developed especially for forensic purposes, and a third that is meant for
traditional speaker verification. Two deep learning based speaker embedding
vector extraction methods are used: the x-vector and ECAPA-TDNN. Speaker
verification was evaluated in the likelihood-ratio framework. A comparison is
made between the language combinations (modeling, LR calibration, evaluation).
The results were evaluated by minCllr and EER metrics. It was found that the
model pre-trained on a different language but on a corpus with a huge amount of
speakers performs well on samples with language mismatch. The effect of sample
durations and speaking styles were also examined. It was found that the longer
the duration of the sample in question the better the performance is. Also,
there is no real difference if various speaking styles are applied
A Likelihood-Ratio Based Forensic Voice Comparison in Standard Thai
This research uses a likelihood ratio (LR) framework to assess
the discriminatory power of a range of acoustic parameters
extracted from speech samples produced by male speakers of
Standard Thai. The thesis aims to answer two main questions: 1)
to what extent the tested linguistic-phonetic segments of
Standard Thai perform in forensic voice comparison (FVC); and 2)
how such linguistic-phonetic segments are profitably combined
through logistic regression using the FoCal Toolkit (BrĂĽmmer,
2007). The segments focused on in this study are the four
consonants /s, ʨh, n, m/ and the two diphthongs [ɔi, ai].
First of all, using the alveolar fricative /s/, two different
sets of features were compared in terms of their performance in
FVC. The first comprised the spectrum-based distributional
features of four spectral moments, namely mean, variance, skew
and kurtosis; the second consisted of the coefficients of the
Discrete Cosine Transform (DCTs) applied to a spectrum. As DCTs
were found to perform better, they were subsequently used to
model the consonant spectrum of the remaining consonants. The
consonant spectrum was extracted at the center point of the /s,
ʨh, n, m/ consonants with a Hamming window of 31.25 msec.
For the diphthongs [ɔi] - [nɔi L] and [ai] - [mai HL], the
cubic polynomials fitted to the F2 and F1-F3 formants were tested
separately. The quadratic polynomials fitted to the tonal F0
contours of [ɔi] - [nɔi L] and [ai] - [mai HL] were tested as
well. Long-term F0 distribution (LTF0) was also trialed.
The results show the promising discriminatory power of the
Standard Thai acoustic features and segments tested in this
thesis. The main findings are as follows.
1. The fricative /s/ performed better with the DCTs (Cllr = 0.70)
than with the spectral moments (Cllr = 0.92).
2. The nasals /n, m/ (Cllr = 0.47) performed better than the
affricate /tɕh/ (Cllr = 0.54) and the fricative /s/ (Cllr =
0.70) when their DCT coefficients were parameterized.
3. F1-F3 trajectories (Cllr = 0.42 and Cllr = 0.49) outperformed
F2 trajectory (Cllr = 0.69 and Cllr = 0.67) for both diphthongs
[ɔi] and [ai].
4. F1-F3 trajectories of the diphthong [ɔi] (Cllr = 0.42)
outperformed those of [ai] (Cllr = 0.49).
5. Tonal F0 (Cllr = 0.52) outperformed LTF0 (Cllr = 0.74).
6. Overall, better results were obtained when DCTs of /n/ - [na:
HL] and /n/ - [nɔi L] were fused. (Cllr = 0.40 with the largest
consistent-with-fact SSLog10LR = 2.53).
In light of the findings, we can conclude that Standard Thai is
generally amenable to FVC, especially when linguistic-phonetic
segments are being combined; it is recommended that the latter
procedure be followed when dealing with forensically realistic
casework
Forensic and Automatic Speaker Recognition System
Current Automatic Speaker Recognition (ASR) System has emerged as an important medium of confirmation of identity in many businesses, ecommerce applications, forensics and law enforcement as well. Specialists trained in criminological recognition can play out this undertaking far superior by looking at an arrangement of acoustic, prosodic, and semantic attributes which has been referred to as structured listening. An algorithmbased system has been developed in the recognition of forensic speakers by physics scientists and forensic linguists to reduce the probability of a contextual bias or pre-centric understanding of a reference model with the validity of an unknown audio sample and any suspicious individual. Many researchers are continuing to develop automatic algorithms in signal processing and machine learning so that improving performance can effectively introduce the speaker’s identity, where the automatic system performs equally with the human audience. In this paper, I examine the literature about the identification of speakers by machines and humans, emphasizing the key technical speaker pattern emerging for the automatic technology in the last decade. I focus on many aspects of automatic speaker recognition (ASR) systems, including speaker-specific features, speaker models, standard assessment data sets, and performance metric
Recommended from our members
Can we have faith jurors listen without prejudice? Likely sources of inaccuracy in voice-comparison exercises
Reviews the legal position governing the circumstances under which fact-finders in criminal proceedings, particularly jurors, are allowed to listen to audio recordings in the context of a voice-comparison exercise for the purposes of identifying the speaker. Examines the estimator and system variables that may affect the exercise's accuracy, and suggests potential safeguards that could be introduced to reduce misidentification
- …