135 research outputs found

    Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models

    Full text link
    This paper is about developing personalized speech synthesis systems with recordings of mildly impaired speech. In particular, we consider consonant and vowel alterations resulted from partial glossectomy, the surgical removal of part of the tongue. The aim is to restore articulation in the synthesized speech and maximally preserve the target speaker's individuality. We propose to tackle the problem with guided diffusion models. Specifically, a diffusion-based speech synthesis model is trained on original recordings, to capture and preserve the target speaker's original articulation style. When using the model for inference, a separately trained phone classifier will guide the synthesis process towards proper articulation. Objective and subjective evaluation results show that the proposed method substantially improves articulation in the synthesized speech over original recordings, and preserves more of the target speaker's individuality than a voice conversion baseline.Comment: submitted to INTERSPEECH 202

    Pronunciation modelling in end-to-end text-to-speech synthesis

    Get PDF
    Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot

    Sequence labeling to detect stuttering events in read speech

    Get PDF
    Stuttering is a speech disorder that, if treated during childhood, may be prevented from persisting into adolescence. A clinician must first determine the severity of stuttering, assessing a child during a conversational or reading task, recording each instance of disfluency, either in real time, or after transcribing the recorded session and analysing the transcript. The current study evaluates the ability of two machine learning approaches, namely conditional random fields (CRF) and bi-directional long-short-term memory (BLSTM), to detect stuttering events in transcriptions of stuttering speech. The two approaches are compared for their performance both on ideal hand-transcribed data and also on the output of automatic speech recognition (ASR). We also study the effect of data augmentation to improve performance. A corpus of 35 speakers’ read speech (13K words) was supplemented with a corpus of 63 speakers’ spontaneous speech (11K words) and an artificially-generated corpus (50K words). Experimental results show that, without feature engineering, BLSTM classifiers outperform CRF classifiers by 33.6%. However, adding features to support the CRF classifier yields performance improvements of 45% and 18% over the CRF baseline and BLSTM results, respectively. Moreover, adding more data to train the CRF and BLSTM classifiers consistently improves the results

    Automatic Framework to Aid Therapists to Diagnose Children who Stutter

    Get PDF

    Advances in deep learning methods for speech recognition and understanding

    Full text link
    Ce travail expose plusieurs Ă©tudes dans les domaines de la reconnaissance de la parole et comprĂ©hension du langage parlĂ©. La comprĂ©hension sĂ©mantique du langage parlĂ© est un sous-domaine important de l'intelligence artificielle. Le traitement de la parole intĂ©resse depuis longtemps les chercheurs, puisque la parole est une des charactĂ©ristiques qui definit l'ĂȘtre humain. Avec le dĂ©veloppement du rĂ©seau neuronal artificiel, le domaine a connu une Ă©volution rapide Ă  la fois en terme de prĂ©cision et de perception humaine. Une autre Ă©tape importante a Ă©tĂ© franchie avec le dĂ©veloppement d'approches bout en bout. De telles approches permettent une coadaptation de toutes les parties du modĂšle, ce qui augmente ainsi les performances, et ce qui simplifie la procĂ©dure d'entrainement. Les modĂšles de bout en bout sont devenus rĂ©alisables avec la quantitĂ© croissante de donnĂ©es disponibles, de ressources informatiques et, surtout, avec de nombreux dĂ©veloppements architecturaux innovateurs. NĂ©anmoins, les approches traditionnelles (qui ne sont pas bout en bout) sont toujours pertinentes pour le traitement de la parole en raison des donnĂ©es difficiles dans les environnements bruyants, de la parole avec un accent et de la grande variĂ©tĂ© de dialectes. Dans le premier travail, nous explorons la reconnaissance de la parole hybride dans des environnements bruyants. Nous proposons de traiter la reconnaissance de la parole, qui fonctionne dans un nouvel environnement composĂ© de diffĂ©rents bruits inconnus, comme une tĂąche d'adaptation de domaine. Pour cela, nous utilisons la nouvelle technique Ă  l'Ă©poque de l'adaptation du domaine antagoniste. En rĂ©sumĂ©, ces travaux antĂ©rieurs proposaient de former des caractĂ©ristiques de maniĂšre Ă  ce qu'elles soient distinctives pour la tĂąche principale, mais non-distinctive pour la tĂąche secondaire. Cette tĂąche secondaire est conçue pour ĂȘtre la tĂąche de reconnaissance de domaine. Ainsi, les fonctionnalitĂ©s entraĂźnĂ©es sont invariantes vis-Ă -vis du domaine considĂ©rĂ©. Dans notre travail, nous adoptons cette technique et la modifions pour la tĂąche de reconnaissance de la parole dans un environnement bruyant. Dans le second travail, nous dĂ©veloppons une mĂ©thode gĂ©nĂ©rale pour la rĂ©gularisation des rĂ©seaux gĂ©nĂ©ratif rĂ©currents. Il est connu que les rĂ©seaux rĂ©currents ont souvent des difficultĂ©s Ă  rester sur le mĂȘme chemin, lors de la production de sorties longues. Bien qu'il soit possible d'utiliser des rĂ©seaux bidirectionnels pour une meilleure traitement de sĂ©quences pour l'apprentissage des charactĂ©ristiques, qui n'est pas applicable au cas gĂ©nĂ©ratif. Nous avons dĂ©veloppĂ© un moyen d'amĂ©liorer la cohĂ©rence de la production de longues sĂ©quences avec des rĂ©seaux rĂ©currents. Nous proposons un moyen de construire un modĂšle similaire Ă  un rĂ©seau bidirectionnel. L'idĂ©e centrale est d'utiliser une perte L2 entre les rĂ©seaux rĂ©currents gĂ©nĂ©ratifs vers l'avant et vers l'arriĂšre. Nous fournissons une Ă©valuation expĂ©rimentale sur une multitude de tĂąches et d'ensembles de donnĂ©es, y compris la reconnaissance vocale, le sous-titrage d'images et la modĂ©lisation du langage. Dans le troisiĂšme article, nous Ă©tudions la possibilitĂ© de dĂ©velopper un identificateur d'intention de bout en bout pour la comprĂ©hension du langage parlĂ©. La comprĂ©hension sĂ©mantique du langage parlĂ© est une Ă©tape importante vers le dĂ©veloppement d'une intelligence artificielle de type humain. Nous avons vu que les approches de bout en bout montrent des performances Ă©levĂ©es sur les tĂąches, y compris la traduction automatique et la reconnaissance de la parole. Nous nous inspirons des travaux antĂ©rieurs pour dĂ©velopper un systĂšme de bout en bout pour la reconnaissance de l'intention.This work presents several studies in the areas of speech recognition and understanding. The semantic speech understanding is an important sub-domain of the broader field of artificial intelligence. Speech processing has had interest from the researchers for long time because language is one of the defining characteristics of a human being. With the development of neural networks, the domain has seen rapid progress both in terms of accuracy and human perception. Another important milestone was achieved with the development of end-to-end approaches. Such approaches allow co-adaptation of all the parts of the model thus increasing the performance, as well as simplifying the training procedure. End-to-end models became feasible with the increasing amount of available data, computational resources, and most importantly with many novel architectural developments. Nevertheless, traditional, non end-to-end, approaches are still relevant for speech processing due to challenging data in noisy environments, accented speech, and high variety of dialects. In the first work, we explore the hybrid speech recognition in noisy environments. We propose to treat the recognition in the unseen noise condition as the domain adaptation task. For this, we use the novel at the time technique of the adversarial domain adaptation. In the nutshell, this prior work proposed to train features in such a way that they are discriminative for the primary task, but non-discriminative for the secondary task. This secondary task is constructed to be the domain recognition task. Thus, the features trained are invariant towards the domain at hand. In our work, we adopt this technique and modify it for the task of noisy speech recognition. In the second work, we develop a general method for regularizing the generative recurrent networks. It is known that the recurrent networks frequently have difficulties staying on same track when generating long outputs. While it is possible to use bi-directional networks for better sequence aggregation for feature learning, it is not applicable for the generative case. We developed a way improve the consistency of generating long sequences with recurrent networks. We propose a way to construct a model similar to bi-directional network. The key insight is to use a soft L2 loss between the forward and the backward generative recurrent networks. We provide experimental evaluation on a multitude of tasks and datasets, including speech recognition, image captioning, and language modeling. In the third paper, we investigate the possibility of developing an end-to-end intent recognizer for spoken language understanding. The semantic spoken language understanding is an important step towards developing a human-like artificial intelligence. We have seen that the end-to-end approaches show high performance on the tasks including machine translation and speech recognition. We draw the inspiration from the prior works to develop an end-to-end system for intent recognition

    IberSPEECH 2020: XI Jornadas en TecnologĂ­a del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de TecnologĂ­as del Habla. Universidad de Valladoli

    Multi-dialect Arabic broadcast speech recognition

    Get PDF
    Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features

    Simulating vocal learning of spoken language: Beyond imitation

    Get PDF
    Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and phonological correspondence problems, can be addressed by linguistically grounded auditory perception. In particular, we show how the articulation of consonant-vowel syllables may be learnt from auditory percepts that can represent either individual utterances by speakers with different vocal tract characteristics or ideal phonetic realisations. The result is an optimisation-based implementation of vocal exploration – incorporating semantic, auditory, and articulatory signals – that can serve as a basis for simulating vocal learning beyond imitation
    • 

    corecore