11 research outputs found

    On the Computation of the Kullback-Leibler Measure for Spectral Distances

    Get PDF
    Efficient algorithms for the exact and approximate computation of the symmetrical Kullback-Leibler (1998) measure for spectral distances are presented for linear predictive coding (LPC) spectra. A interpretation of this measure is given in terms of the poles of the spectra. The performances of the algorithms in terms of accuracy and computational complexity are assessed for the application of computing concatenation costs in unit-selection-based speech synthesis. With the same complexity and storage requirements, the exact method is superior in terms of accuracy

    Statistical prediction of spectral discontinuities of speech in concatenative synthesis

    Get PDF
    La estimación de discontinuidades espectrales es uno de los mayores problemas en el ámbito de la síntesis concatenativa del habla. Este artículo presenta una metodología basada en el estudio del comportamiento estadístico de medidas objetivas sobre uniones naturales. El objetivo es definir un proceso automático para seleccionar qué medidas emplear como coste de unión para sintetizar un habla lo más natural posible. El artículo presenta los resultados objetivos y subjetivos que permiten validar la propuesta.The estimation of spectral discontinuities is one of the most common problems in speech concatenative synthesis. This paper introduces a methodology based on analyzing the statistical behaviour of objective measures for natural concatenations. The main goal is defining an automatic process capable of including the most appropriate measures as concatenation cost to generate high quality synthetic speech. This paper describes both the objective and subjective results for validating the proposal

    Towards Signal-Based Instrumental Quality Diagnosis for Text-to-Speech Systems

    Full text link

    On the Information Geometry of Audio Streams with Applications to Similarity Computing

    Get PDF
    International audienceThis paper proposes methods for information processing of audio streams using methods of information geometry. We lay the theoretical groundwork for a framework allowing the treatment of signal information as information entities, suitable for similarity and symbolic computing on audio signals. The theoretical basis of this paper is based on the information geometry of statistical structures representing audio spectrum features, and specifically through the bijection between the generic families of Bregman divergences and that of exponential distributions. The proposed framework, called Music Information Geometry allows online segmentation of audio streams to metric balls where each ball represents a quasi-stationary continuous chunk of audio, and discusses methods to qualify and quantify information between entities for similarity computing. We define an information geometry that approximates a similarity metric space, redefine general notions in music information retrieval such as similarity between entities, and address methods for dealing with non-stationarity of audio signals. We demonstrate the framework on two sample applications for online audio structure discovery and audio matching

    Sided and Symmetrized Bregman Centroids

    Full text link

    Parametric synthesis of expressive speech

    Get PDF
    U disertaciji su opisani postupci sinteze ekspresivnog govora korišćenjem parametarskih pristupa. Pokazano je da se korišćenjem dubokih neuronskih mreža dobijaju bolji rezultati nego korišćenjem skrivenix Markovljevih modela. Predložene su tri nove metode za sintezu ekspresivnog govora korišćenjem dubokih neuronskih mreža: metoda kodova stila, metoda dodatne obuke mreže i arhitektura zasnovana na deljenim skrivenim slojevima. Pokazano je da se najbolji rezultati dobijaju korišćenjem metode kodova stila. Takođe je predložana i nova metoda za transplantaciju emocija/stilova bazirana na deljenim skrivenim slojevima. Predložena metoda ocenjena je bolje od referentne metode iz literature.In this thesis methods for expressive speech synthesis using parametric approaches are presented. It is shown that better results are achived with usage of deep neural networks compared to synthesis based on hidden Markov models. Three new methods for synthesis of expresive speech using deep neural networks are presented: style codes, model re-training and shared hidden layer architecture. It is shown that best results are achived by using style code method. The new method for style transplantation based on shared hidden layer architecture is also proposed. It is shown that this method outperforms referent method from literature

    Unit selection and waveform concatenation strategies in Cantonese text-to-speech.

    Get PDF
    Oey Sai Lok.Thesis (M.Phil.)--Chinese University of Hong Kong, 2005.Includes bibliographical references.Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1 --- An overview of Text-to-Speech technology --- p.2Chapter 1.1.1 --- Text processing --- p.2Chapter 1.1.2 --- Acoustic synthesis --- p.3Chapter 1.1.3 --- Prosody modification --- p.4Chapter 1.2 --- Trends in Text-to-Speech technologies --- p.5Chapter 1.3 --- Objectives of this thesis --- p.7Chapter 1.4 --- Outline of the thesis --- p.9References --- p.11Chapter 2. --- Cantonese Speech --- p.13Chapter 2.1 --- The Cantonese dialect --- p.13Chapter 2.2 --- Phonology of Cantonese --- p.14Chapter 2.2.1 --- Initials --- p.15Chapter 2.2.2 --- Finals --- p.16Chapter 2.2.3 --- Tones --- p.18Chapter 2.3 --- Acoustic-phonetic properties of Cantonese syllables --- p.19References --- p.24Chapter 3. --- Cantonese Text-to-Speech --- p.25Chapter 3.1 --- General overview --- p.25Chapter 3.1.1 --- Text processing --- p.25Chapter 3.1.2 --- Corpus based acoustic synthesis --- p.26Chapter 3.1.3 --- Prosodic control --- p.27Chapter 3.2 --- Syllable based Cantonese Text-to-Speech system --- p.28Chapter 3.3 --- Sub-syllable based Cantonese Text-to-Speech system --- p.29Chapter 3.3.1 --- Definition of sub-syllable units --- p.29Chapter 3.3.2 --- Acoustic inventory --- p.31Chapter 3.3.3 --- Determination of the concatenation points --- p.33Chapter 3.4 --- Problems --- p.34References --- p.36Chapter 4. --- Waveform Concatenation for Sub-syllable Units --- p.37Chapter 4.1 --- Previous work in concatenation methods --- p.37Chapter 4.1.1 --- Determination of concatenation point --- p.38Chapter 4.1.2 --- Waveform concatenation --- p.38Chapter 4.2 --- Problems and difficulties in concatenating sub-syllable units --- p.39Chapter 4.2.1 --- Mismatch of acoustic properties --- p.40Chapter 4.2.2 --- "Allophone problem of Initials /z/, Id and /s/" --- p.42Chapter 4.3 --- General procedures in concatenation strategies --- p.44Chapter 4.3.1 --- Concatenation of unvoiced segments --- p.45Chapter 4.3.2 --- Concatenation of voiced segments --- p.45Chapter 4.3.3 --- Measurement of spectral distance --- p.48Chapter 4.4 --- Detailed procedures in concatenation points determination --- p.50Chapter 4.4.1 --- Unvoiced segments --- p.50Chapter 4.4.2 --- Voiced segments --- p.53Chapter 4.5 --- Selected examples in concatenation strategies --- p.58Chapter 4.5.1 --- Concatenation at Initial segments --- p.58Chapter 4.5.1.1 --- Plosives --- p.58Chapter 4.5.1.2 --- Fricatives --- p.59Chapter 4.5.2 --- Concatenation at Final segments --- p.60Chapter 4.5.2.1 --- V group (long vowel) --- p.60Chapter 4.5.2.2 --- D group (diphthong) --- p.61References --- p.63Chapter 5. --- Unit Selection for Sub-syllable Units --- p.65Chapter 5.1 --- Basic requirements in unit selection process --- p.65Chapter 5.1.1 --- Availability of multiple copies of sub-syllable units --- p.65Chapter 5.1.1.1 --- "Levels of ""identical""" --- p.66Chapter 5.1.1.2 --- Statistics on the availability --- p.67Chapter 5.1.2 --- Variations in acoustic parameters --- p.70Chapter 5.1.2.1 --- Pitch level --- p.71Chapter 5.1.2.2 --- Duration --- p.74Chapter 5.1.2.3 --- Intensity level --- p.75Chapter 5.2 --- Selection process: availability check on sub-syllable units --- p.77Chapter 5.2.1 --- Multiple copies found --- p.79Chapter 5.2.2 --- Unique copy found --- p.79Chapter 5.2.3 --- No matched copy found --- p.80Chapter 5.2.4 --- Illustrative examples --- p.80Chapter 5.3 --- Selection process: acoustic analysis on candidate units --- p.81References --- p.88Chapter 6. --- Performance Evaluation --- p.89Chapter 6.1 --- General information --- p.90Chapter 6.1.1 --- Objective test --- p.90Chapter 6.1.2 --- Subjective test --- p.90Chapter 6.1.3 --- Test materials --- p.91Chapter 6.2 --- Details of the objective test --- p.92Chapter 6.2.1 --- Testing method --- p.92Chapter 6.2.2 --- Results --- p.93Chapter 6.2.3 --- Analysis --- p.96Chapter 6.3 --- Details of the subjective test --- p.98Chapter 6.3.1 --- Testing method --- p.98Chapter 6.3.2 --- Results --- p.99Chapter 6.3.3 --- Analysis --- p.101Chapter 6.4 --- Summary --- p.107References --- p.108Chapter 7. --- Conclusions and Future Works --- p.109Chapter 7.1 --- Conclusions --- p.109Chapter 7.2 --- Suggested future works --- p.111References --- p.113Appendix 1 Mean pitch level of Initials and Finals stored in the inventory --- p.114Appendix 2 Mean durations of Initials and Finals stored in the inventory --- p.121Appendix 3 Mean intensity level of Initials and Finals stored in the inventory --- p.124Appendix 4 Test word used in performance evaluation --- p.127Appendix 5 Test paragraph used in performance evaluation --- p.128Appendix 6 Pitch profile used in the Text-to-Speech system --- p.131Appendix 7 Duration model used in Text-to-Speech system --- p.13

    Adaptive audio-visuelle Synthese : Automatische Trainingsverfahren für Unit-Selection-basierte audio-visuelle Sprachsynthese

    Get PDF
    In dieser Arbeit wurden Algorithmen und Verfahren entwickelt und angewendet, die es ermöglichen eine video-realistische audio-visuelle Synthese durchzuführen. Das generierte audio-visuelle Signal zeigt einen Talking-Head, der aus zuvor aufgenommenen Videodaten und einem zugrunde liegenden TTS-System konstruiert wurde. Die Arbeit gliedert sich in drei Teile: statistische Lernverfahren Verfahren, konkatenative Sprachsynthese sowie video-realistische audio-visuelle Synthese. Bei dem entwickelten Sprachsynthese System wird die Verkettung natürlichsprachlicher Einheiten verwendet. Die ist gemeinhin als Unit-Selection-basierte Text-to-Speech bekannt. Das Verfahren der Auswahl und Konkatenation wird ebenso für die visuelle Synthese verwendet, wobei hier auf natürliche Videosequenzen zurückgegriffen wird. Als statistische Lernverfahren werden vor allem Graphen-basierte Verfahren entwickelt und angewendet. Hier ist der Einsatz von Hidden-Markov Modellen und bedingten Zufallsfeldern (Conditional-Random-Fields) hervorgehoben, die zur Auswahl der geeigneten Sprachrepresentationseinheiten dienen. Bei der visuellen Synthese kommt ein Prototypen-basiertes Lernverfahren zum Einsatz, welches weithin als K-Nächster-Nachbar Algorithmus bekannt ist. Das Training des Systems benötigt ein annotiertes Sprachdatenkorpus, sowie ein annotiertes Videodatenkorpus. Zur Evaluation der eingesetzten Verfahren wurde eine video-realistische audio-visuelle Synthese Software entwickelt, welche vollautomatisch die Texteingabe in die gewünschte Videosequenz umsetzt. Alle Schritte bis zur Signalausgabe bedürfen keinerlei manuellen Eingriffs
    corecore