27 research outputs found

    A copy synthesis method to pilot the Klatt synthesiser

    Get PDF
    Colloque avec actes et comit茅 de lecture. internationale.International audienceThis paper presents a copy synthesis method to controlling the Klatt synthesizer. Our method allows speech stimuli to be constructed very easily. We accepted the parallel branch of the Klatt synthesizer. After formants have been tracked, the amp litudes of the resonators are measured on a spectrum obtained by an algorithm derived from cepstral smoothing called ''true e nvelope''. This algorithm has the advantage of approximating harmonics very accurately. The analysis strategy of a speech sig nal is straightforward: the fundamental frequency is calculated so that voiced regions are known and the frication energy is set to the value of the spectral energy above 4000 Hz. Stimuli which have been created by means of this method have a timbre close to that of natural speech. This copy synthesis method is incorporated in our software for speech research called ``Sno rri''. Therefore, the user has at his disposal a versatile tool for creating stimuli in the context of the Klatt synthesize

    Snorri, a software for speech sciences

    Get PDF
    Colloque avec actes et comit茅 de lecture.Using tools for investigating speech signals is an invaluable help to teach phonetics and more generally speech sciences. For several years we have undertaken the development of the software WinSnorri which is for both speech scientists as a research tool and teachers in phonetics as an illustration tool. It consists of five types of tools: * to edit speech signals, * to annotate phonetically or orthographically speech signals. WinSnorri offers tools to explore annotated corpora automatically, * to analyse speech with several spectral analyses and monitor spectral peaks along time, * to study prosody. Besides pitch calculation it is possible to synthesise new signals by modifying the F0 curve and/or the speech rate, * to generate parameters for the Klatt synthesiser. A user friendly graphic interface and copy synthesis tools allows the user to generate files for the Klatt synthesiser easily. In the context of speech sciences Snorri can therefore be exploited for many purposes, among them, illustrating speech phenomena and investigating acoustic cues of speech sounds and prosody

    Time-domain concatenative text-to-speech synthesis.

    Get PDF
    A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six

    A study on reusing resources of speech synthesis for closely-related languages

    Get PDF
    This thesis describes research on building a text-to-speech (TTS) framework that can accommodate the lack of linguistic information of under-resource languages by using existing resources from another language. It describes the adaptation process required when such limited resource is used. The main natural languages involved in this research are Malay and Iban language. The thesis includes a study on grapheme to phoneme mapping and the substitution of phonemes. A set of substitution matrices is presented which show the phoneme confusion in term of perception among respondents. The experiments conducted study the intelligibility as well as perception based on context of utterances. The study on the phonetic prosody is then presented and compared to the Klatt duration model. This is to find the similarities of cross language duration model if one exists. Then a comparative study of Iban native speaker with an Iban polyglot TTS using Malay resources is presented. This is to confirm that the prosody of Malay can be used to generate Iban synthesised speech. The central hypothesis of this thesis is that by using a closely-related language resource, a natural sounding speech can be produced. The aim of this research was to show that by sticking to the indigenous language characteristics, it is possible to build a polyglot synthesised speech system even with insufficient speech resources

    Construction of perception stimuli with copy synthesis

    Get PDF
    International audienceA number of experiments in perception requires the construction of speech-like stimuli whose acoustic content needs to be manipulated easily. Formant synthesis offers the possibility of editing all the parameters of speech. However, the construction of stimuli by hand is a very laborious task and therefore automatic tools are necessary. This paper describes two main extensions of a copy synthesis algorithm previously proposed. The first concerns formant tracking which relies on a concurrent curve strategy. The second is a pitch synchronous amplitude adjustment algorithm that enables the capture of fast varying amplitude transitions in consonants. In addition, the automatic determination of the source parameters through the computation of F0 and of the friction to voicing ratio enables the speech signals to be copied automatically. This copy synthesis is evaluated on sentences and V-Stop-V stimuli

    Synthetic voice design and implementation.

    Get PDF
    The limitations of speech output technology emphasise the need for exploratory psychological research to maximise the effectiveness of speech as a display medium in human-computer interaction. Stage 1 of this study reviewed speech implementation research, focusing on general issues for tasks, users and environments. An analysis of design issues was conducted, related to the differing methodologies for synthesised and digitised message production. A selection of ergonomic guidelines were developed to enhance effective speech interface design. Stage 2 addressed the negative reactions of users to synthetic speech in spite of elegant dialogue structure and appropriate functional assignment. Synthetic speech interfaces have been consistently rejected by their users in a wide variety of application domains because of their poor quality. Indeed the literature repeatedly emphasises quality as being the most important contributor to implementation acceptance. In order to investigate this, a converging operations approach was adopted. This consisted of a series of five experiments (and associated pilot studies) which homed in on the specific characteristics of synthetic speech that determine the listeners varying perceptions of its qualities, and how these might be manipulated to improve its aesthetics. A flexible and reliable ratings interface was designed to display DECtalk speech variations and record listeners perceptions. In experiment one, 40 participants used this to evaluate synthetic speech variations on a wide range of perceptual scales. Factor analysis revealed two main factors: "listenability" accounting for 44.7% of the variance and correlating with the DECtalk "smoothness" parameter to . 57 (p<0.005) and "richness" to . 53 (p<0.005); "assurance" accounting for 12.6% of the variance and correlating with "average pitch" to . 42 (p<0.005) and "head size" to. 42 (p<0.005). Complimentary experiments were then required in order to address appropriate voice design for enhanced listenability and assurance perceptions. With a standard male voice set, 20 participants rated enhanced smoothness and attenuated richness as contributing significantly to speech listenability (p<0.001). Experiment three using a female voice set yielded comparable results, suggesting that further refinements of the technique were necessary in order to develop an effective methodology for speech quality optimization. At this stage it became essential to focus directly on the parameter modifications that are associated with the the aesthetically pleasing characteristics of synthetic speech. If a reliable technique could be developed to enhance perceived speech quality, then synthesis systems based on the commonly used DECtalk model might assume some of their considerable yet unfulfilled potential. In experiment four, 20 subjects rated a wide range of voices modified across the two main parameters associated with perceived listenability, smoothness and richness. The results clearly revealed a linear relationship between enhanced smoothness and attenuated richness and significant improvements in perceived listenability (p<0.001 in both cases). Planned comparisons conducted were between the different levels of the parameters and revealed significant listenability enhancements as smoothness was increased, and a similar pattern as richness decreased. Statistical analysis also revealed a significant interaction between the two parameters (p<0.001) and a more comprehensive picture was constructed. In order to expand the focus of and enhance the generality of the research, it was now necessary to assess the effects of synthetic speech modifications whilst subjects were undertaking a more realistic task. Passively rating the voices independent of processing for meaning is arguably an artificial task which rarely, if ever, would occur in 'real-world' settings. In order to investigate perceived quality in a more realistic task scenario, experiment five introduced two levels of information processing load. The purpose of this experiment was firstly to see if a comprehension load modified the pattern of listenability enhancements, and secondly to see if that pattern differed between high and and low load. Techniques for introducing cognitive load were investigated and comprehension load was selected as the most appropriate method in this case. A pilot study distinguished two levels of comprehension load from a set of 150 true/false sentences and these were recorded across the full range of parameter modifications. Twenty subjects then rated the voices using the established listenability scales as before but also performing the additional task of processing each spoken stimuli for meaning and determining the authenticity of the statements. Results indicated that listenability enhancements did indeed occur at both levels of processing although at the higher level variations in the pattern occured. A significant difference was revealed between optimal parameter modifications for conditions of high and low cognitive load (p<0.05). The results showed that subjects perceived the synthetic voices in the high cognitive load condition to be significantly less listenable than those same voices in the low cognitive load condition. The analysis also revealed that this effect was independent of the number of errors made. This result may be of general value because conclusions drawn from this findings are independent of any particular parameter modifications that may be exclusively available to DECtalk users. Overall, the study presents a detailed analysis of the research domain combined with a systematic experimental program of synthetic speech quality assessment. The experiments reported establish a reliable and replicable procedure for optimising the aesthetically pleasing characteristics of DECtalk speech, but the implications of the research extend beyond the boundaries of a particular synthesiser. Results from the experimental program lead to a number of conclusions, the most salient being that not only does the synthetic speech designer have to overcome the general rejection of synthetic voices based on their poor quality by sophisticated customisation of synthetic voice parameters, but that he or she needs to take into account the cognitive load of the task being undertaken. The interaction between cognitive load and optimal settings for synthesis requires direct consideration if synthetic speech systems are going to realise and maximise their potential in human computer interaction

    HMM-based speech synthesis using an acoustic glottal source model

    Get PDF
    Parametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system

    Adding expressiveness to unit selection speech synthesis and to numerical voice production

    Get PDF
    La parla 茅s una de les formes de comunicaci贸 m茅s naturals i directes entre 茅ssers humans, ja que codifica un missatge i tamb茅 claus paraling眉铆stiques sobre l鈥檈stat emocional del locutor, el to o la seva intenci贸, esdevenint aix铆 fonamental en la consecuci贸 d鈥檜na interacci贸 hum脿-m脿quina (HCI) m茅s natural. En aquest context, la generaci贸 de parla expressiva pel canal de sortida d鈥橦CI 茅s un element clau en el desenvolupament de tecnologies assistencials o assistents personals entre altres aplicacions. La parla sint猫tica pot ser generada a partir de parla enregistrada utilitzant m猫todes basats en corpus com la selecci贸 d鈥檜nitats (US), que poden aconseguir resultats d鈥檃lta qualitat per貌 d鈥檈xpressivitat restringida a la pr貌pia del corpus. A fi de millorar la qualitat de la sortida de la s铆ntesi, la tend猫ncia actual 茅s construir bases de dades de veu cada cop m茅s grans, seguint especialment l鈥檃proximaci贸 de s铆ntesi anomenada End-to-End basada en t猫cniques d鈥檃prenentatge profund. Tanmateix, enregistrar corpus ad-hoc per cada estil expressiu desitjat pot ser extremadament cost贸s o fins i tot inviable si el locutor no 茅s capa莽 de realitzar adequadament els estils requerits per a una aplicaci贸 donada (ex: cant en el domini de la narraci贸 de contes). Alternativament, nous m猫todes basats en la f铆sica de la producci贸 de veu s鈥檋an desenvolupat a la darrera d猫cada gr脿cies a l鈥檌ncrement en la pot猫ncia computacional. Per exemple, vocals o diftongs poden ser obtinguts utilitzant el m猫tode d鈥檈lements finits (FEM) per simular la propagaci贸 d鈥檕nes ac煤stiques a trav茅s d鈥檜na geometria 3D realista del tracte vocal obtinguda a partir de resson脿ncies magn猫tiques (MRI). Tanmateix, at猫s que els principals esfor莽os en aquests m猫todes de producci贸 num猫rica de veu s鈥檋an focalitzat en la millora del modelat del proc茅s de generaci贸 de veu, fins ara s鈥檋a prestat poca atenci贸 a la seva expressivitat. A m茅s, la col路lecci贸 de dades per aquestes simulacions 茅s molt costosa, a m茅s de requerir un llarg postprocessament manual com el necessari per extreure geometries 3D del tracte vocal a partir de MRI. L鈥檕bjectiu de la tesi 茅s afegir expressivitat en un sistema que genera veu neutra, sense haver d鈥檃dquirir dades expressives del locutor original. Per un costat, s鈥檃fegeixen capacitats expressives a un sistema de conversi贸 de text a parla basat en selecci贸 d鈥檜nitats (US-TTS) dotat d鈥檜n corpus de veu neutra, per adre莽ar necessitats espec铆fiques i concretes en l鈥櫭爉bit de la narraci贸 de contes, com s贸n la veu cantada o situacions de suspens. A tal efecte, la veu 茅s parametritzada utilitzant un model harm貌nic i transformada a l鈥檈stil expressiu desitjat d鈥檃cord amb un sistema expert. Es presenta una primera aproximaci贸, centrada en la s铆ntesi de suspens creixent per a la narraci贸 de contes, i es demostra la seva viabilitat pel que fa a naturalitat i qualitat de narraci贸 de contes. Tamb茅 s鈥檃fegeixen capacitats de cant al sistema US-TTS mitjan莽ant la integraci贸 de m貌duls de transformaci贸 de parla a veu cantada en el pipeline del TTS, i la incorporaci贸 d鈥檜n m貌dul de generaci贸 de pros貌dia expressiva que permet al m貌dul de US seleccionar unitats m茅s properes a la pros貌dia cantada obtinguda a partir de la partitura d鈥檈ntrada. Aix貌 resulta en un framework de s铆ntesi de conversi贸 de text a parla i veu cantada basat en selecci贸 d鈥檜nitats (US-TTS&S) que pot generar veu parlada i cantada a partir d'un petit corpus de veu neutra (~2.6h). D鈥檃cord amb els resultats objectius, l鈥檈strat猫gia de US guiada per la partitura permet reduir els factors de modificaci贸 de pitch requerits per produir veu cantada a partir de les unitats de veu parlada seleccionades, per貌 en canvi t茅 una efectivitat limitada amb els factors de modificaci贸 de les durades degut a la curta durada de les vocals parlades neutres. Els resultats dels tests perceptius mostren que tot i 貌bviament obtenir una naturalitat inferior a la oferta per un sintetitzador professional de veu cantada, el framework pot adre莽ar necessitats puntuals de veu cantada per a la s铆ntesis de narraci贸 de contes amb una qualitat raonable. La incorporaci贸 d鈥檈xpressivitat s鈥檌nvestiga tamb茅 en la simulaci贸 num猫rica 3D de vocals basada en FEM mitjan莽ant modificacions de les senyals d鈥檈xcitaci贸 glotal utilitzant una aproximaci贸 font-filtre de producci贸 de veu. Aquestes senyals es generen utilitzant un model Liljencrants-Fant (LF) controlat amb el par脿metre de forma del pols Rd, que permet explorar el continu de fonaci贸 lax-tens a m茅s del rang de freq眉猫ncies fonamentals, F0, de la veu parlada. S鈥檃nalitza la contribuci贸 de la font glotal als modes d鈥檃lt ordre en la s铆ntesis FEM de les vocals cardinals [a], [i] i [u] mitjan莽ant la comparaci贸 dels valors d鈥檈nergia d鈥檃lta freq眉猫ncia (HFE) obtinguts amb geometries realistes i simplificades del tracte vocal. Les simulacions indiquen que els modes d鈥檃lt ordre es preveuen perceptivament rellevants d鈥檃cord amb valors de refer猫ncia de la literatura, particularment per a fonacions tenses i/o F0s altes. En canvi, per a vocals amb una fonaci贸 laxa i/o F0s baixes els nivells d鈥橦FE poden resultar inaudibles, especialment si no hi ha soroll d鈥檃spiraci贸 en la font glotal. Despr茅s d鈥檃quest estudi preliminar, s鈥檋an analitzat les caracter铆stiques d鈥檈xcitaci贸 de vocals alegres i agressives d鈥檜n corpus paral路lel de veu en castell脿 amb l鈥檕bjectiu d鈥檌ncorporar aquests estils expressius de veu tensa en la simulaci贸 num猫rica de veu. Per a tal efecte, s鈥檋a usat el vocoder GlottDNN per analitzar variacions d鈥橣0 i pendent espectral relacionades amb l鈥檈xcitaci贸 glotal en vocals [a]. Aquestes variacions es mapegen mitjan莽ant la comparaci贸 amb vocals sint猫tiques en valors d鈥橣0 i Rd per simular vocals que s鈥檃ssemblin als estils alegre i agressiu. Els resultats mostren que 茅s necessari incrementar l鈥橣0 i disminuir l鈥橰d respecte la veu neutra, amb variacions majors per a alegre que per agressiu, especialment per a vocals accentuades. Els resultats aconseguits en les investigacions realitzades validen la possibilitat d鈥檃fegir expressivitat a la s铆ntesi basada en corpus US-TTS i a la simulaci贸 num猫rica de veu basada en FEM. Tanmateix, encara hi ha marge de millora. Per exemple, l鈥檈strat猫gia aplicada a la producci贸 num猫rica de veu es podria millorar estudiant i desenvolupant m猫todes de filtratge invers aix铆 com incorporant modificacions del tracte vocal, mentre que el framework US-TTS&S es podria beneficiar dels aven莽os en t猫cniques de transformaci贸 de veu incloent transformacions de la qualitat de veu, aprofitant l鈥檈xperi猫ncia adquirida en la simulaci贸 num猫rica de vocals expressives.El habla es una de las formas de comunicaci贸n m谩s naturales y directas entre seres humanos, ya que codifica un mensaje y tambi茅n claves paraling眉铆sticas sobre el estado emocional del locutor, el tono o su intenci贸n, convirti茅ndose as铆 en fundamental en la consecuci贸n de una interacci贸n humano-m谩quina (HCI) m谩s natural. En este contexto, la generaci贸n de habla expresiva para el canal de salida de HCI es un elemento clave en el desarrollo de tecnolog铆as asistenciales o asistentes personales entre otras aplicaciones. El habla sint茅tica puede ser generada a partir de habla gravada utilizando m茅todos basados en corpus como la selecci贸n de unidades (US), que pueden conseguir resultados de alta calidad, pero de expresividad restringida a la propia del corpus. A fin de mejorar la calidad de la salida de la s铆ntesis, la tendencia actual es construir bases de datos de voz cada vez m谩s grandes, siguiendo especialmente la aproximaci贸n de s铆ntesis llamada End-to-End basada en t茅cnicas de aprendizaje profundo. Sin embargo, gravar corpus ad-hoc para cada estilo expresivo deseado puede ser extremadamente costoso o incluso inviable si el locutor no es capaz de realizar adecuadamente los estilos requeridos para una aplicaci贸n dada (ej: canto en el dominio de la narraci贸n de cuentos). Alternativamente, nuevos m茅todos basados en la f铆sica de la producci贸n de voz se han desarrollado en la 煤ltima d茅cada gracias al incremento en la potencia computacional. Por ejemplo, vocales o diptongos pueden ser obtenidos utilizando el m茅todo de elementos finitos (FEM) para simular la propagaci贸n de ondas ac煤sticas a trav茅s de una geometr铆a 3D realista del tracto vocal obtenida a partir de resonancias magn茅ticas (MRI). Sin embargo, dado que los principales esfuerzos en estos m茅todos de producci贸n num茅rica de voz se han focalizado en la mejora del modelado del proceso de generaci贸n de voz, hasta ahora se ha prestado poca atenci贸n a su expresividad. Adem谩s, la colecci贸n de datos para estas simulaciones es muy costosa, adem谩s de requerir un largo postproceso manual como el necesario para extraer geometr铆as 3D del tracto vocal a partir de MRI. El objetivo de la tesis es a帽adir expresividad en un sistema que genera voz neutra, sin tener que adquirir datos expresivos del locutor original. Per un lado, se a帽aden capacidades expresivas a un sistema de conversi贸n de texto a habla basado en selecci贸n de unidades (US-TTS) dotado de un corpus de voz neutra, para abordar necesidades espec铆ficas y concretas en el 谩mbito de la narraci贸n de cuentos, como son la voz cantada o situaciones de suspense. Para ello, la voz se parametriza utilizando un modelo harm贸nico y se transforma al estilo expresivo deseado de acuerdo con un sistema experto. Se presenta una primera aproximaci贸n, centrada en la s铆ntesis de suspense creciente para la narraci贸n de cuentos, y se demuestra su viabilidad en cuanto a naturalidad y calidad de narraci贸n de cuentos. Tambi茅n se a帽aden capacidades de canto al sistema US-TTS mediante la integraci贸n de m贸dulos de transformaci贸n de habla a voz cantada en el pipeline del TTS, y la incorporaci贸n de un m贸dulo de generaci贸n de prosodia expresiva que permite al m贸dulo de US seleccionar unidades m谩s cercanas a la prosodia cantada obtenida a partir de la partitura de entrada. Esto resulta en un framework de s铆ntesis de conversi贸n de texto a habla y voz cantada basado en selecci贸n de unidades (US-TTS&S) que puede generar voz hablada y cantada a partir del mismo peque帽o corpus de voz neutra (~2.6h). De acuerdo con los resultados objetivos, la estrategia de US guiada por la partitura permite reducir los factores de modificaci贸n de pitch requeridos para producir voz cantada a partir de las unidades de voz hablada seleccionadas, pero en cambio tiene una efectividad limitada con los factores de modificaci贸n de duraciones debido a la corta duraci贸n de las vocales habladas neutras. Los resultados de las pruebas perceptivas muestran que, a pesar de obtener una naturalidad obviamente inferior a la ofrecida por un sintetizador profesional de voz cantada, el framework puede abordar necesidades puntuales de voz cantada para la s铆ntesis de narraci贸n de cuentos con una calidad razonable. La incorporaci贸n de expresividad se investiga tambi茅n en la simulaci贸n num茅rica 3D de vocales basada en FEM mediante modificaciones en las se帽ales de excitaci贸n glotal utilizando una aproximaci贸n fuente-filtro de producci贸n de voz. Estas se帽ales se generan utilizando un modelo Liljencrants-Fant (LF) controlado con el par谩metro de forma del pulso Rd, que permite explorar el continuo de fonaci贸n laxo-tenso adem谩s del rango de frecuencias fundamentales, F0, de la voz hablada. Se analiza la contribuci贸n de la fuente glotal a los modos de alto orden en la s铆ntesis FEM de las vocales cardinales [a], [i] y [u] mediante la comparaci贸n de los valores de energ铆a de alta frecuencia (HFE) obtenidos con geometr铆as realistas y simplificadas del tracto vocal. Las simulaciones indican que los modos de alto orden se prev茅n perceptivamente relevantes de acuerdo con valores de referencia de la literatura, particularmente para fonaciones tensas y/o F0s altas. En cambio, para vocales con una fonaci贸n laxa y/o F0s bajas los niveles de HFE pueden resultar inaudibles, especialmente si no hay ruido de aspiraci贸n en la fuente glotal. Despu茅s de este estudio preliminar, se han analizado las caracter铆sticas de excitaci贸n de vocales alegres y agresivas de un corpus paralelo de voz en castellano con el objetivo de incorporar estos estilos expresivos de voz tensa en la simulaci贸n num茅rica de voz. Para ello, se ha usado el vocoder GlottDNN para analizar variaciones de F0 y pendiente espectral relacionadas con la excitaci贸n glotal en vocales [a]. Estas variaciones se mapean mediante la comparaci贸n con vocales sint茅ticas en valores de F0 y Rd para simular vocales que se asemejen a los estilos alegre y agresivo. Los resultados muestran que es necesario incrementar la F0 y disminuir la Rd respecto la voz neutra, con variaciones mayores para alegre que para agresivo, especialmente para vocales acentuadas. Los resultados conseguidos en las investigaciones realizadas validan la posibilidad de a帽adir expresividad a la s铆ntesis basada en corpus US-TTS y a la simulaci贸n num茅rica de voz basada en FEM. Sin embargo, hay margen de mejora. Por ejemplo, la estrategia aplicada a la producci贸n num茅rica de voz se podr铆a mejorar estudiando y desarrollando m茅todos de filtrado inverso, as铆 como incorporando modificaciones del tracto vocal, mientras que el framework US-TTS&S desarrollado se podr铆a beneficiar de los avances en t茅cnicas de transformaci贸n de voz incluyendo transformaciones de la calidad de la voz, aprovechando la experiencia adquirida en la simulaci贸n num茅rica de vocales expresivas.Speech is one of the most natural and direct forms of communication between human beings, as it codifies both a message and paralinguistic cues about the emotional state of the speaker, its mood, or its intention, thus becoming instrumental in pursuing a more natural Human Computer Interaction (HCI). In this context, the generation of expressive speech for the HCI output channel is a key element in the development of assistive technologies or personal assistants among other applications. Synthetic speech can be generated from recorded speech using corpus-based methods such as Unit-Selection (US), which can achieve high quality results but whose expressiveness is restricted to that available in the speech corpus. In order to improve the quality of the synthesis output, the current trend is to build ever larger speech databases, especially following the so-called End-to-End synthesis approach based on deep learning techniques. However, recording ad-hoc corpora for each and every desired expressive style can be extremely costly, or even unfeasible if the speaker is unable to properly perform the styles required for a given application (e.g., singing in the storytelling domain). Alternatively, new methods based on the physics of voice production have been developed in the last decade thanks to the increase in computing power. For instance, vowels or diphthongs can be obtained using the Finite Element Method (FEM) to simulate the propagation of acoustic waves through a 3D realistic vocal tract geometry obtained from Magnetic Resonance Imaging (MRI). However, since the main efforts in these numerical voice production methods have been focused on improving the modelling of the voice generation process, little attention has been paid to its expressiveness up to now. Furthermore, the collection of data for such simulations is very costly, besides requiring manual time-consuming postprocessing like that needed to extract 3D vocal tract geometries from MRI. The aim of the thesis is to add expressiveness into a system that generates neutral voice, without having to acquire expressive data from the original speaker. One the one hand, expressive capabilities are added to a Unit-Selection Text-to-Speech (US-TTS) system fed with a neutral speech corpus, to address specific and timely needs in the storytelling domain, such as for singing or in suspenseful situations. To this end, speech is parameterised using a harmonic-based model and subsequently transformed to the target expressive style according to an expert system. A first approach dealing with the synthesis of storytelling increasing suspense shows the viability of the proposal in terms of naturalness and storytelling quality. Singing capabilities are also added to the US-TTS system through the integration of Speech-to-Singing (STS) transformation modules into the TTS pipeline, and by incorporating an expressive prosody generation module that allows the US to select units closer to the target singing prosody obtained from the input score. This results in a Unit Selection based Text-to-Speech-and-Singing (US-TTS&S) synthesis framework that can generate both speech and singing from the same neutral speech small corpus (~2.6 h). According to the objective results, the score-driven US strategy can reduce the pitch scaling factors required to produce singing from the selected spoken units, but its effectiveness is limited regarding the time-scale requirements due to the short duration of the spoken vowels. Results from the perceptual tests show that although the obtained naturalness is obviously far from that given by a professional singing synthesiser, the framework can address eventual singing needs for synthetic storytelling with a reasonable quality. The incorporation of expressiveness is also investigated in the 3D FEM-based numerical simulation of vowels through modifications of the glottal flow signals following a source-filter approach of voice production. These signals are generated using a Liljencrants-Fant (LF) model controlled with the glottal shape parameter Rd, which allows exploring the tense-lax continuum of phonation besides the spoken vocal range of fundamental frequency values, F0. The contribution of the glottal source to higher order modes in the FEM synthesis of cardinal vowels [a], [i] and [u] is analysed through the comparison of the High Frequency Energy (HFE) values obtained with realistic and simplified 3D geometries of the vocal tract. The simulations indicate that higher order modes are expected to be perceptually relevant according to reference values stated in the literature, particularly for tense phonations and/or high F0s. Conversely, vowels with a lax phonation and/or low F0s can result in inaudible HFE levels, especially if aspiration noise is not present in the glottal source. After this preliminary study, the excitation characteristics of happy and aggressive vowels from a Spanish parallel speech corpus are analysed with the aim of incorporating this tense voice expressive styles into the numerical production of voice. To that effect, the GlottDNN vocoder is used to analyse F0 and spectral tilt variations associated with the glottal excitation on vowels [a]. These variations are mapped through the comparison with synthetic vowels into F0 and Rd values to simulate vowels resembling happy and aggressive styles. Results show that it is necessary to increase F0 and decrease Rd with respect to neutral speech, with larger variations for happy than aggressive style, especially for the stressed [a] vowels. The results achieved in the conducted investigations validate the possibility of adding expressiveness to both corpus-based US-TTS synthesis and FEM-based numerical simulation of voice. Nevertheless, there is still room for improvement. For instance, the strategy applied to the numerical voice production could be improved by studying and developing inverse filtering approaches as well as incorporating modifications of the vocal tract, whereas the developed US-TTS&S framework could benefit from advances in voice transformation techniques including voice quality modifications, taking advantage of the experience gained in the numerical simulation of expressive vowels

    Production and Perception of Fast Speech

    Get PDF
    This thesis reports on a series of experiments investigating how speakers produce and listeners perceive fast speech. The main research question is how the perception of naturally produced fast speech compares to the perception of artificially time-compressed speech. Research has shown that listeners can understand speech at much faster rates than they can produce themselves. The current study attempts to answer for this discrepancy and addresses the following questions: Why is speech intelligibility relatively unaffected by time compression? How do segmental intelligibility, prosodic patterns and other sources of information contribute? Does the intelligibility of synthetic speech suffer more from time compression than that of natural speech, and if so, why? Several intelligibility experiments were set up to answer these questions. Whereas artificial time compression of speech is normally conducted in a linear way, production studies on normal and fast-rate speech have shown that speakers compress some parts more than others. When speakers speed up, unstressed syllables are shortened more, relatively, than stressed syllables. Thus, the prosodic pattern of fast-rate speech is even more pronounced than that at a normal speech rate. This raises the question whether this natural non-linear way of speeding up might reflect a communicative strategy in order to save the stressed syllables, which are the most informative ones. Speakers are claimed to tailor their speech to the needs of the listener. Furthermore, prosodic patterns are known to be an important source of information under adverse listening conditions. Therefore, this study investigates whether modelling the temporal pattern of artificially time-compressed speech in accordance with the temporal pattern of natural fast speech improves intelligibility and ease of processing over linear compression. Secondly, it is investigated whether listeners find artificially time-compressed speech more difficult to process than naturally produced fast speech. It turns out that both the changed temporal pattern of naturally produced fast speech, and its increased slurring, or reduced articulation, make naturally produced fast speech more difficult to process than artificially time-compressed speech. This means that both the temporal and the segmental changes that speakers apply when speeding up their speech rate do not make perception easier for the listener, but are due to speakers s inability to speed up otherwise. The findings are considered in relation to current models of speech production and perception. This study is of interest to phoneticians, phonologists, and psycholinguists, as well as researchers working in the domain of speech technolog

    Diphthong Synthesis using the Three-Dimensional Dynamic Digital Waveguide Mesh

    Get PDF
    The human voice is a complex and nuanced instrument, and despite many years of research, no system is yet capable of producing natural-sounding synthetic speech. This affects intelligibility for some groups of listeners, in applications such as automated announcements and screen readers. Furthermore, those who require a computer to speak - due to surgery or a degenerative disease - are limited to unnatural-sounding voices that lack expressive control and may not match the user's gender, age or accent. It is evident that natural, personalised and controllable synthetic speech systems are required. A three-dimensional digital waveguide model of the vocal tract, based on magnetic resonance imaging data, is proposed here in order to address these issues. The model uses a heterogeneous digital waveguide mesh method to represent the vocal tract airway and surrounding tissues, facilitating dynamic movement and hence speech output. The accuracy of the method is validated by comparison with audio recordings of natural speech, and perceptual tests are performed which confirm that the proposed model sounds significantly more natural than simpler digital waveguide mesh vocal tract models. Control of such a model is also considered, and a proof-of-concept study is presented using a deep neural network to control the parameters of a two-dimensional vocal tract model, resulting in intelligible speech output and paving the way for extension of the control system to the proposed three-dimensional vocal tract model. Future improvements to the system are also discussed in detail. This project considers both the naturalness and control issues associated with synthetic speech and therefore represents a significant step towards improved synthetic speech for use across society
    corecore