1,562 research outputs found

    VoxCeleb2: Deep Speaker Recognition

    Full text link
    The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin.Comment: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 page

    On the use of high-level information in speaker and language recognition

    Full text link
    Actas de las IV Jornadas de TecnologĂ­a del Habla (JTH 2006)Automatic Speaker Recognition systems have been largely dominated by acoustic-spectral based systems, relying in proper modelling of the short-term vocal tract of speakers. However, there is scientific and intuitive evidence that speaker specific information is embedded in the speech signal in multiple short- and long-term characteristics. In this work, a multilevel speaker recognition system combining acoustic, phonotactic and prosodic subsystems is presented and assessed using NIST 2005 Speaker Recognition Evaluation data. For language recognition systems, the NIST 2005 Language Recognition Evaluation was selected to measure performance of a high-level language recognition systems

    Adapting Prosody in a Text-to-Speech System

    Get PDF

    Automatic prosodic variations modelling for language and dialect discrimination

    Get PDF
    International audienceThis paper addresses the problem of modelling prosody for language identification. The aim is to create a system that can be used prior to any linguistic work to show if prosodic differences among languages or dialects can be automatically determined. In previous papers, we defined a prosodic unit, the pseudo-syllable. Rhythmic modelling has proven the relevance of the pseudo-syllable unit for automatic language identification. In this paper, we propose to model the prosodic variations, that is to say model sequences of prosodic units. This is achieved by the separation of phrase and accentual components of intonation. We propose an independent coding of those components on differentiated scales of duration. Short-term and long-term language-dependent sequences of labels are modelled by n-gram models. The performance of the system is demonstrated by experiments on read speech and evaluated by experiments on spontaneous speech. Finally, an experiment is described on the discrimination of Arabic dialects, for which there is a lack of linguistic studies, notably on prosodic comparisons. We show that our system is able to clearly identify the dialectal areas, leading to the hypothesis that those dialects have prosodic differences

    Spot the conversation: speaker diarisation in the wild

    Full text link
    The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.Comment: The dataset will be available for download from http://www.robots.ox.ac.uk/~vgg/data/voxceleb/voxconverse.html . The development set will be released in July 2020, and the test set will be released in October 202
    • 

    corecore