64 research outputs found

    Time and Frequency Pruning for Speaker Identification

    Get PDF
    International audienceThis work is an attempt to refine decisions in speaker identification. A test utterance is divided into multiple time-frequency blocks on which a normalized likelihood score is calculated. Instead of averaging the block-likelihoods along the whole test utterance, some of them are rejected (pruning) and the final score is computed with a limited number of time-frequency blocks. The results obtained in the special case of time pruning lead the authors to experiment a joint time and frequency pruning approach. The optimal percentage of blocks pruned is learned on a tuning data set with the minimum identification error criterion. Validation of the time-frequency pruning process on 567 speakers leads to a significant error rate reduction (up to 41% reduction on TIMIT) for short training and test duration. ,QWURGXFWLRQ Mono-gaussian models for speaker recognition have been largely replaced by Gaussian Mixture Models (GMM) which are dedicated to modeling smaller clusters of speech. The Gaussian mixture modeling can be seen as a FRRSHUDWLRQ of models since the gaussian mixture density is a weighted linear combination of uni-modal gaussian densities. The work presented here is rather concerned with FRPSHWLWLRQ of models since different mono-gaussian models (corresponding to different frequency subbands) are applied to the test signal and the decision is made with the best or the N-best model scores. More precisely, a test utterance is divided into time-frequency blocks, each of them corresponding to a particular frequency subband and a particular time segment. During the recognition phase, the block scores are accumulated over the whole test utterance to compute a global score and take a final decision. In this work, we investigate accumulation using a hard threshold approach since some block scores are eliminated (pruning) and the final decision is taken with a subset of these scores. This approach should be robust in the case of a time-frequency localized noise since the least reliable time-frequency blocks can be removed. Even in the case of clean speech, some speaker test utterance blocks can be simply more similar to another speaker model than to the target speaker model itself. Removing these error-prone blocks should lead to a more robust decision. In 6HFWLRQ , a formalism is proposed to describe our block-based speaker recognition system. The potential of this approach is shown with a special case of the formalism: time pruning (6HFWLRQ). Experiments intended to find the optimal percentage of blocks pruned are described in 6HFWLRQ. The optimal parameters (percentage of blocks pruned) are validated on TIMIT and NTIMIT databases (6HFWLRQ). Finally, we summarize our main results and outline the potential advantages of the time-frequency pruning procedure in 6HFWLRQ .)RUPDOLVP 0RQRJDXVVLDQ µVHJPHQWDO ¶ PRGHOLQJ Let { } [ W W 0 1≤ ≤ be a sequence of M vectors resulting from the S-dimensional acoustic analysis of a speech signal uttered by speaker X. These vectors are summarized by the mean vector [ and the covariance matrix X: [ 0 [ ; 0 [ [ [

    A Very Low Resource Language Speech Corpus for Computational Language Documentation Experiments

    Full text link
    Most speech and language technologies are trained with massive amounts of speech and text information. However, most of the world languages do not have such resources or stable orthography. Systems constructed under these almost zero resource conditions are not only promising for speech technology but also for computational language documentation. The goal of computational language documentation is to help field linguists to (semi-)automatically analyze and annotate audio recordings of endangered and unwritten languages. Example tasks are automatic phoneme discovery or lexicon discovery from the speech signal. This paper presents a speech corpus collected during a realistic language documentation process. It is made up of 5k speech utterances in Mboshi (Bantu C25) aligned to French text translations. Speech transcriptions are also made available: they correspond to a non-standard graphemic form close to the language phonology. We present how the data was collected, cleaned and processed and we illustrate its use through a zero-resource task: spoken term discovery. The dataset is made available to the community for reproducible computational language documentation experiments and their evaluation.Comment: accepted to LREC 201

    Unsupervised word segmentation from speech with attention

    Get PDF
    We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation

    A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users

    Get PDF
    Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users’ atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.This research was supported by the project ‘Favoreciendo la vida autónoma de discapacitados intelectuales con problemas de comunicación oral mediante interfaces personalizados de reconocimiento automático del habla’, financed by the Centre of Initiatives for Development Cooperation (Centro de Iniciativas de Cooperación al Desarrollo, CICODE), University of Granada, Spain. This research was supported by the Student Grant Scheme 2014 (SGS) at the Technical University of Liberec
    corecore