10 research outputs found

    Adaptation of respiratory patterns in collaborative reading

    No full text
    International audienceSpeech and variation of respiratory chest circumferences of eight French dyads were monitored while reading texts with increasing constraints on mutual synchrony. In line with previous research, we find that speakers mutually adapt their respiratory patterns. However a significant alignment is observed only when speakers need to perform together, i.e. when reading in alternation or synchronously. From quiet breathing to listening, to speech reading, we didn't find the gradual asymmetric shaping of respiratory cycles generally assumed in literature (e.g. from symmetric inhalation and exhalation phases towards short inhalation and long exhalation). In contrast, the control of breathing seems to switch abruptly between two systems: vital vs. speech production. We also find that the syllabic and the respiratory cycles are strongly phased at speech onsets. This phenomenon is in agreement with the quantal nature of speech rhythm beyond the utterance, previously observed via pause durations

    Un estudio del corpus de medidas de duración rítmica del dialecto Kalhori del Kurdo

    Get PDF
    In order to identify between-sentence and between-speaker variabilities, one of the methods used by phoneticians is studying durational rhythmic features. In the present research, to classify speech rhythm of Kalhori, a variety of Kurdish, and to find out about the most appropriate measures for between-sentence and between-speaker rhythmic variability in Kalhori, durational speech rhythmic measures were analyzed. To this end, two speaking styles (read and spontaneous) were explored. The analysis of the read corpus revealed that Kalhori Kurdish rhythm pattern is between stress-timed and syllable-timed. The results indicated that %V (proportion over which speech is vocalic) was the most significant measure for distinguishing between-sentence rhythmic variability in the read corpus, while %V and rateSyl (syllable rate) were the most efficient measures for identifying the between-speaker rhythmic variability in both the read and spontaneous corpus.Uno de los métodos empleados en fonética para identificar la variabilidad entre oraciones y hablantes es el estudio de las características rítmicas. En este estudio, se han analizado algunas métricas temporales de ritmo en kalhori (una variedad del kurdo) para descubrir las que mejor explican la variabilidad rítmica entre oraciones y entre hablantes. Con este fin, se han utilizado dos estilos de habla: lectura y habla espontánea. El análisis del corpus de lectura demostró que el tipo de ritmo del kurdo kalhori se puede situar en el medio del continuo entre lenguas de ritmo acentual y lenguas de ritmo silábico. Los resultados indican que la métrica más adecuada para explicar la variabilidad rítmica entre oraciones en el corpus leído fue %V (proporción de vocales sobre el total de habla), mientras que %V y rateSyl (número de sílabas pronunciadas por minuto) fueron las métricas más eficientes para identificar la variabilidad rítmica entre hablantes, tanto en el corpus leído como en el espontáneo

    AT LEAST TWO MACRORHYTHMIC UNITS ARE NECESSARY FOR MODELING BRAZILIAN PORTUGUESE DURATION: EMPHASIS ON AUTOMATIC SEGMENTAL DURATION GENERATION

    Get PDF
    O modelamento da duração acústica no português do Brasil (PB) tornou possível a emergência de uma tipologia acentual que revela a existência de ao menos duas unidades de programação macrorrítmicas: a sílaba e o GIPC. Os pontos de máximo dos z-scores da sílaba coincidem com a posição do acento lexical enquanto os pontos de máximo dos z-scores do GIPC (coincidindo com posição de acento lexical) demarcam as fronteiras prosódicas do enunciado. Um modelo rítmico possibilitando a geração automática e simplificada da duração segmental é proposto para ser integrado em um sistema de síntese da fala em PB

    Leitura automática de números

    Get PDF
    Neste trabalho foi desenvolvido um sistema que faz a leitura automática de números. A entoação dada a um número em posição final de uma sequência é diferente da entoação usada para números noutras posições. O sistema faz a leitura, dos números inteiros em Português Europeu, de 0 (zero) a 999 999 999, (novecentos e noventa e nove milhões, novecentos e noventa e nove mil, novecentos e noventa e nove), de datas no formato (dd-mm-aaaa), números de telefone da rede fixa, números de telemóvel e número de identificação da segurança social. O sistema começa por identificar o tipo de número, depois faz a leitura desse número usando o algoritmo desenvolvido para cada caso. Estes foram programados utilizando o software Matlab. Os sinais de áudio foram gravados na “Radio Brigantia”, com a voz de um locutor profissional, do sexo masculino e editados utilizando o software Praat. Finalmente foi realizado um teste auditivo para avaliar a qualidade dos sons para cada algoritmo. Cada um deles foi avaliado numa escala MOS (Mean Opinion Score) de 1 a 5. A pontuação MOS do trabalho desenvolvido foi de 4,46.This work presents the development of a system that makes the automatic reading of numbers. The prosody given to a number in a final position is different from the one used for numbers elsewhere. The system reads, the integers in European Portuguese, from 0 (zero) to 999,999,999 (nine hundred ninety-nine million, nine hundred ninety-nine thousand, nine hundred ninety-nine), dates in the format (dd-mm-yyyy), wireline phone numbers, cell phone numbers and social security number of identification. The system begins by identifying the type of number, then reads that number using the algorithm developed for each case. These were programmed using Matlab software. Audio signals were recorded in the "Radio Brigantia" with the voice of a professional announcer, male and edited using the Praat software. Finally we conducted a hearing test to assess the quality of the sounds for each algorithm. Each was evaluated on a MOS (Mean Opinion Score) scale 1 to 5. The work MOS score was 4.46

    Bayesian networks for predicting duration of phones

    Get PDF
    In a concatenative text-to-speech (TTS) system, the duration of a phonetic segment (phone) is predicted by a duration model which is usually trained using a database of feature vectors, that consist of a set of linguistic factors' (attributes') values describing a phone in a particular context. In general, databases used to train phone duration models are unbalanced. However, it has been shown that the probability of a rare feature vector occurring even in a small sample of text is quite high. Furthermore, factors affecting phone's duration interact; a set of two or more factors may amplify or attenuate the affect of other factors. A robust model for predicting phone duration must generalise well in order to successfully predict the durations of phones with these rare feature vectors. Since linguistic factors affecting segment duration interact, we would expect that modelling these factor interactions will give a better model. There have been a number of models developed for predicting a phone's duration, ranging from rule-based to neural nets to classification and regression tree (CART) to sums-of-products (SoP) mod¬ els. In the CART model, a phone's duration is predicted by a decision tree. The tree is built by recursively clustering the training data into subsets that share common values for certain attributes of the feature vectors. The duration of a phone is then predicted by using the tree to find the data cluster that matches as many of the feature vector attributes as possible. The CART model is easy to build, robust to errors in data but performs poorly when the percent of missing data is too high. In the SoP model, the log of a phone's duration is predicted as a sum of factors' product terms. The SoP model predicts phone duration with high accuracy, even in cases of hidden or missing data. However, this is done at the cost of substantial data pre-processing. In addition, the number of different sums-of-products models grows hyper-exponentially with the number of factors. Therefore, one must use some heuristic search techniques to find the model that fits the data the best. In our work, we use a Bayesian belief network (BN) consisting of discrete nodes for the linguistic factors and a single continuous node for the phone's duration. Interactions between factors are represented as conditional dependency relations in this graphical model. During train¬ ing, the parameters of the belief network are learned via the Expectation Maximisation (EM) algorithm. The duration of each phone in the test set is then predicted via Bayesian inference: given the parameters of the be¬ lief network, we calculate the probability of a phone taking on a particular duration given the observations of the linguistic variables. The duration value with the maximum probability is chosen as the phone's duration. We contrasted the results of the belief network model with those of the sums of products and CART models. We trained and tested all three models on the same data. In terms of the RMS error our BN model performs better than both CART and SoP models. In terms of the correlation coefficient, our BN model performs better than SoP model, and no worse than CART model. We believe our Bayesian model has many advantages compared to CART and SoP models. For instance, it captures the factors' interactions in a concise way by causal relationships among the variables in the graphical model. The Bayesian model also makes robust predictions of phone duration in cases of missing or hidden data

    Dictée automatique de textes à haute voix : analyse de corpus et modélisation des stratégies d'énonciation

    Get PDF
    The work presented in this paper is in the context of Text-To-Speech synthesis (TTS). This work aims to develop an automatic dictation system that converts written text into spoken words, for use by CM1 and CM2 students (4th and 5th grade in French primary school). This system includes four modules : 1) a morpho-syntactic analysis module which extracts tree data structure from orthographic strings; 2) a prosodic segmentation module which divides the utterances into prosodic groups and relates them to their propositional content; 3) a prosody generation module which calculates prosody automatically (fundamental frequency contour and speech rhythm), to be applied to those groups based on phonotactic features; 4) a synthesis module which converts this segmental and supra-segmental data into an acoustic signal. The syntactic parsing is first performed. The syntactic tree is then further projected onto the syntagmatic axis : sentence-internal prosodic markers are supposed to cue the dependency relations between adjacent constituents. Prosody generation is then performed by associating these markers with multiparametric contours via so-called contour generators. These contour generators are implemented as feed-forward neural networks and trained thanks to an iterative analysis-by-synthesis process (see description of the SFC "Superposition of Functional Contours" model in Bailly and Holm (2003) using a corpus of dictations. We finally assessed the quality of the modeling with a synthesis system available at GIPSA-Lab. We showed that strategy-specific contour generators are able to capture the slow rate of speech flow, the increased rate of word and phrase segmentation as well as changes in the melodic contours of the four versions for each phrase.Les travaux présentés dans ce mémoire se situent dans le cadre de la synthèse de la parole à partir du texte. Notre travail vise à concevoir un système permettant de dicter automatiquement à voix haute un texte orthographié à des élèves de niveau CM1-CM2. Ce système comporte quatre modules : 1) un module d'analyse morphosyntaxique qui lève une structure arborescente à partir de la chaîne orthographique ; 2) un module de marquage prosodique qui découpe cet énoncé en groupes prosodiques et les articule entre eux ; 3) un module de génération de prosodie qui calcule automatiquement la prosodie (contour de fréquence fondamentale et contour rythmique) à appliquer sur ces groupes en fonction des diverses marques ; 4) un module de synthèse qui convertit l'ensemble de ces données segmentales et suprasegmentales en signal acoustique. Partant de la sortie d'un analyseur syntaxique à l'état de l'art, notre travail a consisté tout d'abord à projeter toute ou partie de la structure syntaxique de chaque texte afin de marquer le texte et les reprises de ses groupes de mots par un ensemble restreint de marqueurs. Nous avons ensuite paramétré un modèle de génération de la prosodie SFC "Superposition of Functional Contours" développé par Bailly et Holm (2003) à l'aide du corpus de dictées. Nous avons finalement évalué la qualité de la synthèse produite par un système de synthèse disponible au GIPSA-Lab. Nous montrons que les générateurs de contours indexés par la version d'énonciation arrivent à capturer le ralentissement du débit, l'accentuation du découpage des mots et des syntagmes ainsi que les modifications des contours mélodiques des quatre versions de chaque groupe de mots

    Síntesis de voz aplicada a la traducción voz a voz

    Get PDF
    In the field of speech technologies, text-to-speech conversion is the automatic generation of artificial voices that sound identical to a human voice when reading a text in loud speech. Inside a text-to-speech system, the prosody module produces the prosodic information that is necessary to generate a natural voice: intonational phrases, intonation of the sentence, duration and energy of phonemes, etc. The correct generation of this information directly impacts in the naturalness and expressiveness of the system. The main goals of this thesis is the development of new algorithms to train models for prosody generation that may be used in a text-to-speech system, and their use in the framework of speech-to-speech translation. In this thesis several alternatives were studied for intonation modeling. They combine the parameterization and the intonation model generation as a integrated process. Such approach was successfully judged both with objective and subjective evaluations. The influence of segmental and suprasegmental factors in duration modeling was also studied. Several algorithms were proposed with the results of these studies that may combine segmental and suprasegmental information, likewise other publications of this field. Finally, an analysis of various phrase break models was also performed, both with words and accent groups: classification trees (CART), language modeling (LM) and finite state transducers (FST). The use of the same data set in the experiments was useful to obtain relevant conclusions about the differences between these models. One of the main goals of this thesis was the improvement of naturalness, expressiveness and consistency with the style of the source speaker in text-to-speech systems. This may be done by using the prosody of the source speaker in the framework of speech-to-speech translation as an additional information source. Several algorithms were developed for prosody generation that may integrate such additional information for the prediction of intonation, phoneme duration and phrase breaks. In that direction several approaches were studied to transfer the intonation from one language to the other. The chosen approach was an automatic clustering algorithm that finds a certain number of tonal movements that are related between languages, without any limitation about their number. In this way, it is possible to use this coding for intonation modeling of the target language. Experimental results show an improvement, that is more relevant in close languages, such as Spanish and Catalan. Although no segmental duration transfer was performed between languages, in this thesis is proposed the transfer of rhythm from one language to the other. For that purpose a method that combines the rhythm transfer and audio synchronization was proposed. The synchronizations is included because of its importance for the speech-to-speech translation technology when video is also used. Lastly, in this thesis was also proposed a pause transfer technique in the framework of speech-to-speech translation, by means of alignment information. Studies in training data have shown the advantage of tuples for this task. In order to predict any pause that can not be transferred using the before mentioned method, conventional pause prediction algorithms are used (CART, CART+LM, FST), taking into account the already transferred pauses.Dentro de las tecnologías del habla, la conversión texto a voz consiste en la generación, por medios automáticos, de una voz artificial que genera idéntico sonido al producido por una persona al leer un texto en voz alta. En resumen, los conversores texto a voz son sistemas que permiten la conversión de textos en voz sintética. El proceso de conversión texto a voz se divide en tres módulos básicos: procesamiento del texto, generación de la prosodia y generación de la voz sintética. En el primero de los módulos se realiza la normalización del texto (para expandir abreviaciones, convertir números y fechas en texto, etc), y en ocasiones, luego también se hace un etiquetado morfosintáctico. A continuación se procede a la conversión de los grafemas en fonemas y a la silabificación para obtener la secuencia de fonemas necesaria para reproducir el texto. Posteriormente, el módulo de prosodia genera la información prosódica para poder producir la voz. Para ello se predicen las frases entonativas y la entonación de la oración, y también la duración y la energía de los fonemas, etc. La correcta generación de esta información repercutirá directamente en la naturalidad y expresividad del sistema. En el último modulo de generación de la voz es donde se produce la voz considerando la información provista por los módulos de procesamiento del texto y prosodia. El objetivo de la presente tesis es el desarrollo de nuevos algoritmos para el entrenamiento de modelos de generación de prosodia para la conversión texto a voz, y su aplicación en el marco de la traducción voz a voz. En el caso de los algoritmos de modelado de entonación, en la literatura se proponen generalmente enfoques que incluyen una estilización previa a la parametrización. En esta tesis se estudiaron alternativas para evitar esa estilización, combinando la parametrización y la generación del modelo de entonación en un todo integrado. Dicho enfoque ha resultado exitoso tanto en la evaluación objetiva (usando medidas como el error cuadrático medio o el coeficiente de correlación Pearson) como en la subjetiva. Los evaluadores han considerado que el enfoque propuesto tiene una calidad y una naturalidad superiores a otros algoritmos existentes en la literatura incluidos en las evaluaciones, alcanzando un MOS de naturalidad de 3,55 (4,63 para la voz original) y un MOS de calidad de 3,78 (4,78 para la voz original).Postprint (published version

    Les musiques du français parlé

    Get PDF
    This volume presents a series of critical essays on the accentuation, rhythm, and intonation of contemporary French which offer new insight into the formal and functional characteristics of French prosody from three different perspectives (historical, epistemological, descriptive). These properties are interpreted in the context of the latest research into the prosody of languages
    corecore