13 research outputs found

    Designing a speech corpus for instancebased spoken language generation

    Get PDF
    In spoken language applications such as conversation systems where not only the speech waveforms but also the content of the speech (the text) need to be generated automatically, a Concept-to-Speech (CTS) system is needed. In this paper, we address several issues on designing a speech corpus to facilitate an instance-based integrated CTS framework. Both the instance-based CTS generation approach and the corpus design process have not been addressed systematically in previous researches

    Corpus-based unit selection for natural-sounding speech synthesis

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 179-196).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability.(cont.) Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system "easiest to understand" out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine.by Jon Rong-Wei Yi.Ph.D

    Conditioning Text-to-Speech synthesis on dialect accent: a case study

    Get PDF
    Modern text-to-speech systems are modular in many different ways. In recent years, end-users gained the ability to control speech attributes such as degree of emotion, rhythm and timbre, along with other suprasegmental features. More ambitious objectives are related to modelling a combination of speakers and languages, e.g. to enable cross-speaker language transfer. Though, no prior work has been done on the more fine-grained analysis of regional accents. To fill this gap, in this thesis we present practical end-to-end solutions to synthesise speech while controlling within-country variations of the same language, and we do so for 6 different dialects of the British Isles. In particular, we first conduct an extensive study of the speaker verification field and tweak state-of-the-art embedding models to work with dialect accents. Then, we adapt standard acoustic models and voice conversion systems by conditioning them on dialect accent representations and finally compare our custom pipelines with a cutting-edge end-to-end architecture from the multi-lingual world. Results show that the adopted models are suitable and have enough capacity to accomplish the task of regional accent conversion. Indeed, we are able to produce speech closely resembling the selected speaker and dialect accent, where the most accurate synthesis is obtained via careful fine-tuning of the multi-lingual model to the multi-dialect case. Finally, we delineate limitations of our multi-stage approach and propose practical mitigations, to be explored in future work

    Évaluation expérimentale d'un système statistique de synthèse de la parole, HTS, pour la langue française

    Get PDF
    Les travaux présentés dans cette thèse se situent dans le cadre de la synthèse de la parole à partir du texte et, plus précisément, dans le cadre de la synthèse paramétrique utilisant des règles statistiques. Nous nous intéressons à l'influence des descripteurs linguistiques utilisés pour caractériser un signal de parole sur la modélisation effectuée dans le système de synthèse statistique HTS. Pour cela, deux méthodologies d'évaluation objective sont présentées. La première repose sur une modélisation de l'espace acoustique, généré par HTS par des mélanges gaussiens (GMM). En utilisant ensuite un ensemble de signaux de parole de référence, il est possible de comparer les GMM entre eux et ainsi les espaces acoustiques générés par les différentes configurations de HTS. La seconde méthodologie proposée repose sur le calcul de distances entre trames acoustiques appariées pour pouvoir évaluer la modélisation effectuée par HTS de manière plus locale. Cette seconde méthodologie permet de compléter les diverses analyses en contrôlant notamment les ensembles de données générées et évaluées. Les résultats obtenus selon ces deux méthodologies, et confirmés par des évaluations subjectives, indiquent que l'utilisation d'un ensemble complexe de descripteurs linguistiques n'aboutit pas nécessairement à une meilleure modélisation et peut s'avérer contre-productif sur la qualité du signal de synthèse produit.The work presented in this thesis is about TTS speech synthesis and, more particularly, about statistical speech synthesis for French. We present an analysis on the impact of the linguistic contextual factors on the synthesis achieved by the HTS statistical speech synthesis system. To conduct the experiments, two objective evaluation protocols are proposed. The first one uses Gaussian mixture models (GMM) to represent the acoustical space produced by HTS according to a contextual feature set. By using a constant reference set of natural speech stimuli, GMM can be compared between themselves and consequently acoustic spaces generated by HTS. The second objective evaluation that we propose is based on pairwise distances between natural speech and synthetic speech generated by HTS. Results obtained by both protocols, and confirmed by subjective evaluations, show that using a large set of contextual factors does not necessarily improve the modeling and could be counter-productive on the speech quality.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF

    Conversión de texto en habla multidominio basada en selección de unidades con ajuste subjetivo de pesos y marcado robusto de pitch

    Get PDF
    El propòsit final de la conversió de text a parla (CTP) és la generació de parla sintètica completament natural a partir d'un text d'entrada qualsevol. Històricament, s'han seguit dues estratègies per a assolir aquest objectiu: la que prima la flexibilitat de la conversió davant la qualitat de la síntesi, donant lloc als sistemes de conversió de text a parla de propòsit general (CTP-PG); i la que anteposa la naturalitat de la síntesi a la generalitat de la CTP, coneguda com a conversió de text a parla de domini restringit (CTP-DR). En l'actualitat, l'estratègia més utilitzada per a desenvolupar els sistemes de CTP és la conversió de text a parla basada en corpus o per selecció d'unitats (CTP-SU). Tot i que la qualitat dels sistemes de CTP-SU és bastant bona en general, encara existeixen qüestions que continuen essent font d'investigació. En aquesta tesi es presenten diverses aportacions en el context de la CTP-SU per a millorar, d'una banda, la naturalitat dels sistemes de CTP-PG i, per l'altra, la flexibilitat dels sistemes de CTP-DR. Per abordar la primera qüestió, es presenta una tècnica que permet incorporar de forma eficient la percepció humana al procés de selecció de les unitats del corpus de veu mitjançant l'ajust subjectiu dels pesos de la funció de cost que guia la selecció de les unitats, controlant la fatiga i la consistència de l'usuari. Així mateix, es presenta un mètode per a millorar la fiabilitat del procés d'etiquetatge automàtic del corpus de veu, concretament, de les marques de pitch ---qüestió fonamental en el context dels CTP basats en selecció d'unitats. En quant al segon problema, i seguint l'estratègia de CTP-DR, es presenta la conversió de text a parla multidomini (CTP-MD), que persegueix aconseguir una qualitat sintètica equivalent a la dels sistemes de CTP-DR, augmentant la seva flexibilitat per considerar diferents dominis (estils de locució, emocions, temàtiques, etc.) per a la síntesi. En aquest context, és necessari que el sistema de CTP-MD conegui, durant el procés de conversió de text a parla, quin domini o dominis són els més adequats per a poder sintetitzar el text d'entrada amb la major naturalitat possible. En aquest cas, el sistema de CTP-MD incorpora un mòdul de classificació de textos a l'arquitectura clàssica dels sistemes de CTP adaptat a les necessitats que planteja la CTP-MD. Finalment, totes les propostes descrites s'avaluen en termes objectius ---mitjançant l'ús de mesures clàssiques juntament amb noves propostes--- i/o subjectius ---mitjançant proves perceptives--- per a validar les millores aconseguides pels mètodes desenvolupats en el context de la CTP-SU en el camí cap al desenvolupament de nous sistemes de CTP d'alta qualitat y flexibilitat.El propósito final de la conversión de texto en habla (CTH) es la generación de habla sintética completamente natural a partir de un texto de entrada cualquiera. Históricamente, se han seguido dos estrategias para lograr este objetivo: la que prima la flexibilidad de la conversión ante la calidad de la síntesis, dando lugar a los sistemas de conversión de texto en habla de propósito general (CTH-PG); y la que antepone la naturalidad de la síntesis a la generalidad de la CTH, conocida como conversión de texto en habla de dominio restringido (CTH-DR). En la actualidad, la estrategia más utilizada para desarrollar los sistemas de CTH es la conversión de texto en habla basada en corpus o por selección de unidades (CTH-SU). Aunque la calidad de los sistemas de CTH-SU es bastante buena en general, todavía existen elementos que continúan siendo fuente de investigación. En esta tesis se presentan distintas aportaciones en el contexto de la CTH-SU para mejorar, por un lado, la naturalidad de los sistemas de CTH-PG y, por otro, la flexibilidad de los sistemas de CTH-DR. Para abordar la primera cuestión, se presenta una técnica que permite incorporar de forma eficiente la percepción humana al proceso de selección de las unidades del corpus de voz mediante el ajuste subjetivo de los pesos de la función de coste que guía la selección de las unidades, controlando la fatiga y la consistencia del usuario. Asimismo, se presenta un método para mejorar la fiabilidad del proceso de etiquetado automático del corpus de voz, concretamente, de las marcas de pitch ---cuestión fundamental en el contexto de los CTH basados en selección de unidades. En cuanto al segundo problema, y siguiendo la estrategia de CTH-DR, se presenta la conversión de texto en habla multidominio (CTH-MD), que persigue conseguir una calidad sintética equivalente a la de los sistemas de CTH-DR, aumentando su flexibilidad al considerar distintos dominios (estilos de locución, emociones, temáticas, etc.) para la síntesis. En este contexto, es necesario que el sistema de CTH-MD conozca, durante el proceso de conversión de texto en habla, qué dominio o dominios son los más adecuados para poder sintetizar el texto de entrada con la mayor naturalidad posible. En este caso, el sistema de CTH-MD incorpora un módulo de clasificación de textos a la arquitectura clásica de los sistemas de CTH adaptado a las necesidades que plantea la CTH-MD. Finalmente, todas las propuestas descritas se evalúan en términos objetivos ---mediante el uso de medidas clásicas junto a nuevas propuestas--- y/o subjetivos ---mediante pruebas de percepción--- para validar las mejoras conseguidas por los métodos desarrollados en el contexto de la CTH-SU en el camino hacia el desarrollo de nuevos sistemas de CTH de elevada calidad y flexibilidad.The final purpose of any Text-to-Speech (TTS) system is the generation of perfectly natural synthetic speech from any input text. Historically, two strategies have been followed in the quest for this goal: the general purpose TTS synthesis (GP-TTS), which strives the flexibility of the application at the expense of the achieved synthetic speech quality; and the limited domain TTS synthesis (LD-TTS), which prioritizes the development of high quality TTS systems by restricting the scope of the input text. At present, the most used strategy to develop TTS systems is the so called corpus-based text-to-speech or unit selection TTS (US-TTS) synthesis. Although the quality of US-TTS synthesis systems is quite good in general, there are still several open issues which are still being investigated. This PhD thesis introduces different contributions for US-TTS systems in order to improve, by one hand, the naturalness of GP-TTS systems, and by the other hand, the flexibility of LD-TTS systems. To deal with the former problem, a new technique for efficiently incorporating human perception in the unit selection process by means of subjective weight tuning is introduced, which also allows controlling user fatigue and user consistency. Moreover, a new method for improving the reliability of automatic speech corpus labelling is described, particularly, a generic pitch marks filtering algorithm is introduced ---an essential issue in corpus-based TTS systems. Moreover, the latter problem is addressed by multi-domain TTS (MD-TTS) synthesis, following the LD-TTS approach, which deals with achieving synthetic speech quality equivalent to that of LD-TTS systems, but improving TTS flexibility by considering different domains (speaking styles, emotions, topics, etc.) for conducting speech synthesis. In this context, the MD-TTS system needs to know, at run time, which domain or domains are the most suitable for synthesizing the input text with the highest synthetic speech quality. To that effect, the MD-TTS system incorporates a text classification module to classic TTS synthesis architecture adapted to the MD-TTS classification particularities. Finally, all the proposals are evaluated in terms of objective experiments ---by means of classic or new measures--- and/or subjective tests ---perceptual tests--- in order to validate the improvements achieved by the methods developed in the US-TTS framework, as a step further in our research towards developing high quality and flexible text-to-speech synthesis systems

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Reports to the President

    Get PDF
    A compilation of annual reports for the 1988-1989 academic year, including a report from the President of the Massachusetts Institute of Technology, as well as reports from the academic and administrative units of the Institute. The reports outline the year's goals, accomplishments, honors and awards, and future plans

    Preface

    Get PDF

    Reports to the President

    Get PDF
    A compilation of annual reports for the 1999-2000 academic year, including a report from the President of the Massachusetts Institute of Technology, as well as reports from the academic and administrative units of the Institute. The reports outline the year's goals, accomplishments, honors and awards, and future plans

    Exploring Written Artefacts

    Get PDF
    This collection, presented to Michael Friedrich in honour of his academic career at of the Centre for the Study of Manuscript Cultures, traces key concepts that scholars associated with the Centre have developed and refined for the systematic study of manuscript cultures. At the same time, the contributions showcase the possibilities of expanding the traditional subject of ‘manuscripts’ to the larger perspective of ‘written artefacts’
    corecore