2,480 research outputs found
Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification
There are a number of studies about extraction of bottleneck (BN) features
from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases
and triphone states for improving the performance of text-dependent speaker
verification (TD-SV). However, a moderate success has been achieved. A recent
study [1] presented a time contrastive learning (TCL) concept to explore the
non-stationarity of brain signals for classification of brain states. Speech
signals have similar non-stationarity property, and TCL further has the
advantage of having no need for labeled data. We therefore present a TCL based
BN feature extraction method. The method uniformly partitions each speech
utterance in a training dataset into a predefined number of multi-frame
segments. Each segment in an utterance corresponds to one class, and class
labels are shared across utterances. DNNs are then trained to discriminate all
speech frames among the classes to exploit the temporal structure of speech. In
addition, we propose a segment-based unsupervised clustering algorithm to
re-assign class labels to the segments. TD-SV experiments were conducted on the
RedDots challenge database. The TCL-DNNs were trained using speech data of
fixed pass-phrases that were excluded from the TD-SV evaluation set, so the
learned features can be considered phrase-independent. We compare the
performance of the proposed TCL bottleneck (BN) feature with those of
short-time cepstral features and BN features extracted from DNNs discriminating
speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels
and boundaries are generated by three different automatic speech recognition
(ASR) systems. Experimental results show that the proposed TCL-BN outperforms
cepstral features and speaker+pass-phrase discriminant BN features, and its
performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
A Review of the Fingerprint, Speaker Recognition, Face Recognition and Iris Recognition Based Biometric Identification Technologies
This paper reviews four biometric identification
technologies (fingerprint, speaker recognition, face recognition
and iris recognition). It discusses the mode of operation of each
of the technologies and highlights their advantages and
disadvantages
Adaptive Vocal Random Challenge Support for Biometric Authentication
Käesoleva bakalaureusetöö eesmärgiks oli arendada välja kõnetuvastusprogramm, mida saaks kasutada vokaalsete juhuväljakutse tarvis. Programmi eesmärgiks oli anda üks võimalik lahendus kõnepõhilise biomeetrilise autentimise kesksele turvaprobleemile – taasesitusrünnetele. Programm põhineb vabavaralisel PocketSphinxi kõnetuvastuse tööriistal ning on kirjutatud Pythoni programmeerimiskeeles.
Loodud rakendus koosneb kahest osast: kasutajaliidesega varustatud demonstratsiooniprogrammist ja käsurea utiilidist. Kasutajaliidesega rakendus sobib kõnetuvastusteegi võimete demonstreerimiseks, käsurea utiliiti saab aga kasutada mis tahes teisele programmile kõnetuvastusvõimekuse lisamiseks.
Kasutajaliidesega rakenduses saab kasutaja oma hääle abil programmiga vahetult suheldes avada näitlikustamiseks loodud demoprogrammi ust. Kasutaja peab ütlema õige numbrite jada või pildile vastava sõna inglise keeles, et programmi poolt autoriseeritud saada.
Mõlemat loodud rakendust saab seadistada luues oma keelemudeleid või muutes demorakenduse puhul numbriliste juhuväljakutsete pikkust.The aim of this thesis was to develop a speech recognition application which could be used for vocal random challenges. The goal of the application was to provide a solution to the central problem for voice-based biometric authentication – replay attacks. This piece of software is based on the PocketSphinx speech recognition toolkit and is written in the Python programming language.
The resulting application is composed of two parts: a demonstration application with a GUI interface, and a command line utility. The GUI application is suitable for demonstrating the capabilities of the speech recognition toolkit, whereas the command line utility can be used to add speech recognition capabilities to virtually any application. The user can interact with the door of the GUI application by using his or her voice. The user must utter the correct word corresponding to the picture in English or say the sequence of digits in order to be authenticated.
Both of the applications can be configured by generating language models, or by changing the length of the random challenges for the demonstration application
Verification of feature regions for stops and fricatives in natural speech
The presence of acoustic cues and their importance in speech perception have
long remained debatable topics. In spite of several studies that exist in this
eld, very little is known about what exactly humans perceive in speech. This
research takes a novel approach towards understanding speech perception. A
new method, named three-dimensional deep search (3DDS), was developed
to explore the perceptual cues of 16 consonant-vowel (CV) syllables, namely
/pa/, /ta/, /ka/, /ba/, /da/, /ga/, /fa/, /Ta/, /sa/, /Sa/, /va/, /Da/, /za/,
/Za/, from naturally produced speech. A veri cation experiment was then
conducted to further verify the ndings of the 3DDS method. For this pur-
pose, the time-frequency coordinate that de nes each CV was ltered out
using the short-time Fourier transform (STFT), and perceptual tests were
then conducted. A comparison between unmodi ed speech sounds and those
without the acoustic cues was made. In most of the cases, the scores dropped
from 100% to chance levels even at 12 dB SNR. This clearly emphasizes the
importance of features in identifying each CV. The results con rm earlier
ndings that stops are characterized by a short-duration burst preceding the
vowel by 10 cs in the unvoiced case, and appearing almost coincident
with the vowel in the voiced case. As has been previously hypothesized,
we con rmed that the F2 transition plays no signi cant role in consonant
identi cation. 3DDS analysis labels the /sa/ and /za/ perceptual features
as an intense frication noise around 4 kHz, preceding the vowel by 15{20
cs, with the /za/ feature being around 5 cs shorter in duration than that
of /sa/; the /Sa/ and /Za/ events are found to be frication energy near 2
kHz, preceding the vowel by 17{20 cs. /fa/ has a relatively weak burst and
frication energy over a wide-band including 2{6 kHz, while /va/ has a cue
in the 1.5 kHz mid-frequency region preceding the vowel by 7{10 cs. New
information is established regarding /Da/ and /Ta/, especially with regards
to the nature of their signi cant confusions
An Automatic Real-time Synchronization of Live speech with Its Transcription Approach
Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books
- …