2,480 research outputs found

    Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification

    Get PDF
    There are a number of studies about extraction of bottleneck (BN) features from deep neural networks (DNNs)trained to discriminate speakers, pass-phrases and triphone states for improving the performance of text-dependent speaker verification (TD-SV). However, a moderate success has been achieved. A recent study [1] presented a time contrastive learning (TCL) concept to explore the non-stationarity of brain signals for classification of brain states. Speech signals have similar non-stationarity property, and TCL further has the advantage of having no need for labeled data. We therefore present a TCL based BN feature extraction method. The method uniformly partitions each speech utterance in a training dataset into a predefined number of multi-frame segments. Each segment in an utterance corresponds to one class, and class labels are shared across utterances. DNNs are then trained to discriminate all speech frames among the classes to exploit the temporal structure of speech. In addition, we propose a segment-based unsupervised clustering algorithm to re-assign class labels to the segments. TD-SV experiments were conducted on the RedDots challenge database. The TCL-DNNs were trained using speech data of fixed pass-phrases that were excluded from the TD-SV evaluation set, so the learned features can be considered phrase-independent. We compare the performance of the proposed TCL bottleneck (BN) feature with those of short-time cepstral features and BN features extracted from DNNs discriminating speakers, pass-phrases, speaker+pass-phrase, as well as monophones whose labels and boundaries are generated by three different automatic speech recognition (ASR) systems. Experimental results show that the proposed TCL-BN outperforms cepstral features and speaker+pass-phrase discriminant BN features, and its performance is on par with those of ASR derived BN features. Moreover,....Comment: Copyright (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other work

    Speaker segmentation and clustering

    Get PDF
    This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved

    A Review of the Fingerprint, Speaker Recognition, Face Recognition and Iris Recognition Based Biometric Identification Technologies

    Get PDF
    This paper reviews four biometric identification technologies (fingerprint, speaker recognition, face recognition and iris recognition). It discusses the mode of operation of each of the technologies and highlights their advantages and disadvantages

    Adaptive Vocal Random Challenge Support for Biometric Authentication

    Get PDF
    Käesoleva bakalaureusetöö eesmärgiks oli arendada välja kõnetuvastusprogramm, mida saaks kasutada vokaalsete juhuväljakutse tarvis. Programmi eesmärgiks oli anda üks võimalik lahendus kõnepõhilise biomeetrilise autentimise kesksele turvaprobleemile – taasesitusrünnetele. Programm põhineb vabavaralisel PocketSphinxi kõnetuvastuse tööriistal ning on kirjutatud Pythoni programmeerimiskeeles. Loodud rakendus koosneb kahest osast: kasutajaliidesega varustatud demonstratsiooniprogrammist ja käsurea utiilidist. Kasutajaliidesega rakendus sobib kõnetuvastusteegi võimete demonstreerimiseks, käsurea utiliiti saab aga kasutada mis tahes teisele programmile kõnetuvastusvõimekuse lisamiseks. Kasutajaliidesega rakenduses saab kasutaja oma hääle abil programmiga vahetult suheldes avada näitlikustamiseks loodud demoprogrammi ust. Kasutaja peab ütlema õige numbrite jada või pildile vastava sõna inglise keeles, et programmi poolt autoriseeritud saada. Mõlemat loodud rakendust saab seadistada luues oma keelemudeleid või muutes demorakenduse puhul numbriliste juhuväljakutsete pikkust.The aim of this thesis was to develop a speech recognition application which could be used for vocal random challenges. The goal of the application was to provide a solution to the central problem for voice-based biometric authentication – replay attacks. This piece of software is based on the PocketSphinx speech recognition toolkit and is written in the Python programming language. The resulting application is composed of two parts: a demonstration application with a GUI interface, and a command line utility. The GUI application is suitable for demonstrating the capabilities of the speech recognition toolkit, whereas the command line utility can be used to add speech recognition capabilities to virtually any application. The user can interact with the door of the GUI application by using his or her voice. The user must utter the correct word corresponding to the picture in English or say the sequence of digits in order to be authenticated. Both of the applications can be configured by generating language models, or by changing the length of the random challenges for the demonstration application

    Adaptation of reference patterns in word-based speech recognition

    Get PDF

    Verification of feature regions for stops and fricatives in natural speech

    Get PDF
    The presence of acoustic cues and their importance in speech perception have long remained debatable topics. In spite of several studies that exist in this eld, very little is known about what exactly humans perceive in speech. This research takes a novel approach towards understanding speech perception. A new method, named three-dimensional deep search (3DDS), was developed to explore the perceptual cues of 16 consonant-vowel (CV) syllables, namely /pa/, /ta/, /ka/, /ba/, /da/, /ga/, /fa/, /Ta/, /sa/, /Sa/, /va/, /Da/, /za/, /Za/, from naturally produced speech. A veri cation experiment was then conducted to further verify the ndings of the 3DDS method. For this pur- pose, the time-frequency coordinate that de nes each CV was ltered out using the short-time Fourier transform (STFT), and perceptual tests were then conducted. A comparison between unmodi ed speech sounds and those without the acoustic cues was made. In most of the cases, the scores dropped from 100% to chance levels even at 12 dB SNR. This clearly emphasizes the importance of features in identifying each CV. The results con rm earlier ndings that stops are characterized by a short-duration burst preceding the vowel by 10 cs in the unvoiced case, and appearing almost coincident with the vowel in the voiced case. As has been previously hypothesized, we con rmed that the F2 transition plays no signi cant role in consonant identi cation. 3DDS analysis labels the /sa/ and /za/ perceptual features as an intense frication noise around 4 kHz, preceding the vowel by 15{20 cs, with the /za/ feature being around 5 cs shorter in duration than that of /sa/; the /Sa/ and /Za/ events are found to be frication energy near 2 kHz, preceding the vowel by 17{20 cs. /fa/ has a relatively weak burst and frication energy over a wide-band including 2{6 kHz, while /va/ has a cue in the 1.5 kHz mid-frequency region preceding the vowel by 7{10 cs. New information is established regarding /Da/ and /Ta/, especially with regards to the nature of their signi cant confusions

    An Automatic Real-time Synchronization of Live speech with Its Transcription Approach

    Get PDF
    Most studies in automatic synchronization of speech and transcription focus on the synchronization at the sentence level or the phrase level. Nevertheless, in some languages, like Thai, boundaries of such levels are difficult to linguistically define, especially in case of the synchronization of speech and its transcription. Consequently, the synchronization at a finer level like the syllabic level is promising. In this article, an approach to synchronize live speech with its corresponding transcription in real time at the syllabic level is proposed. Our approach employs the modified real-time syllable detection procedure from our previous work and the transcription verification procedure then adopts to verify correctness and to recover errors caused by the real-time syllable detection procedure. In experiments, the acoustic features and the parameters are customized empirically. Results are compared with two baselines which have been applied to the Thai scenario. Experimental results indicate that, our approach outperforms two baselines with error rate reduction of 75.9% and 41.9% respectively and also can provide results in the real-time situation. Besides, our approach is applied to the practical application, namely ChulaDAISY. Practical experiments show that ChulaDAISY applied with our approach could reduce time consumption for producing audio books
    corecore