9 research outputs found

    Vowel reduction in conversational speech in French: the role of lexical factors

    No full text
    International audienceIn this study we investigate vowel reduction and the role of some lexical factors in the production of vowels extracted from a corpus of French conversations. Vowel durations and spectral quality are examined with respect to 1. their interaction in the corpus, 2. the position of vowels in words, and 3. word frequency and word category. The analyses are conducted on vowels produced by 16 speakers. Our study provides strong evidence that vowel reduction (decrease in durations and more centralized spectral values) affects most of the vowels in conversational speech. The results show that vowels in final syllables of words were less often reduced while the preceding ones show reduced durations and centralized formant values. Moreover vowels are more reduced in monosyllabic function words than in monosyllabic content words. Nevertheless, we did not find a clear effect of word frequency on vowel durations. Finally, our study shows that vowel reduction depends on several factors related to lexical properties (word category) and to prosodic properties (stress and final lengthening)

    Time-domain concatenative text-to-speech synthesis.

    Get PDF
    A concatenation framework for time-domain concatenative speech synthesis (TDCSS) is presented and evaluated. In this framework, speech segments are extracted from CV, VC, CVC and CC waveforms, and abutted. Speech rhythm is controlled via a single duration parameter, which specifies the initial portion of each stored waveform to be output. An appropriate choice of segmental durations reduces spectral discontinuity problems at points of concatenation, thus reducing reliance upon smoothing procedures. For text-to-speech considerations, a segmental timing system is described, which predicts segmental durations at the word level, using a timing database and a pattern matching look-up algorithm. The timing database contains segmented words with associated duration values, and is specific to an actual inventory of concatenative units. Segmental duration prediction accuracy improves as the timing database size increases. The problem of incomplete timing data has been addressed by using `default duration' entries in the database, which are created by re-categorising existing timing data according to articulation manner. If segmental duration data are incomplete, a default duration procedure automatically categorises the missing speech segments according to segment class. The look-up algorithm then searches the timing database for duration data corresponding to these re-categorised segments. The timing database is constructed using an iterative synthesis/adjustment technique, in which a `judge' listens to synthetic speech and adjusts segmental durations to improve naturalness. This manual technique for constructing the timing database has been evaluated. Since the timing data is linked to an expert judge's perception, an investigation examined whether the expert judge's perception of speech naturalness is representative of people in general. Listening experiments revealed marked similarities between an expert judge's perception of naturalness and that of the experimental subjects. It was also found that the expert judge's perception remains stable over time. A synthesis/adjustment experiment found a positive linear correlation between segmental durations chosen by an experienced expert judge and duration values chosen by subjects acting as expert judges. A listening test confirmed that between 70% and 100% intelligibility can be achieved with words synthesised using TDCSS. In a further test, a TDCSS synthesiser was compared with five well-known text-to-speech synthesisers, and was ranked fifth most natural out of six. An alternative concatenation framework (TDCSS2) was also evaluated, in which duration parameters specify both the start point and the end point of the speech to be extracted from a stored waveform and concatenated. In a similar listening experiment, TDCSS2 stimuli were compared with five well-known text-tospeech synthesisers, and were ranked fifth most natural out of six

    Automatic prosodic analysis for computer aided pronunciation teaching

    Get PDF
    Correct pronunciation of spoken language requires the appropriate modulation of acoustic characteristics of speech to convey linguistic information at a suprasegmental level. Such prosodic modulation is a key aspect of spoken language and is an important component of foreign language learning, for purposes of both comprehension and intelligibility. Computer aided pronunciation teaching involves automatic analysis of the speech of a non-native talker in order to provide a diagnosis of the learner's performance in comparison with the speech of a native talker. This thesis describes research undertaken to automatically analyse the prosodic aspects of speech for computer aided pronunciation teaching. It is necessary to describe the suprasegmental composition of a learner's speech in order to characterise significant deviations from a native-like prosody, and to offer some kind of corrective diagnosis. Phonological theories of prosody aim to describe the suprasegmental composition of speech..

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

    Időzítési mintázatok a magyar beszédben

    Get PDF
    A Beszéd – Kutatás – Alkalmazás sorozat nyolcadik köteteként megjelenő mű egy olyan vizsgálatsorozatot mutat be, amely hiánypótló a magyar beszéddel foglalkozó szakirodalomban. Bár általánosságban a magyar beszéd időzítésével kapcsolatban sok leírás megjelent már, a beszédritmus eddigi megközelítései, az ezzel kapcsolatos korábbi állítások a ritmus sajátosságainak megragadhatatlanságát, a sok változóból adódó bizonytalanságot sugallták, nem véletlenül. A jelenség valóban soktényezős, az adatok gyakran ellentmondásosak, nehéz fogódzókat találni a megfelelő módszertan kialakításához. Kohári Anna azonban vette a bátorságot, hogy erre az ingoványos talajra lépjen, és útját siker koronázta. A nemzetközi szakirodalom széles körének ismeretében új, korábban a (magyar) beszédre még nem alkalmazott módszertanokat használva, különféle metódusokat ötvözve, szorgalmas, aprólékos és szisztematikus elemző munkával jutott el azon eredményekig és megállapításokig, amelyek ebben a kötetben napvilágot látnak, és amelyek a legkorszerűbb ismereteinket foglalják össze a (magyar) beszéd ritmusának vonatkozásában. A kötet bevezetése tananyagként is használható, mivel áttekinti és értelmezi a vonatkozó tudományos fogalomkészletet és terminológiát, továbbá számot ad a nemzetközi és a magyar kutatási eredményekről a legutóbbi időkig bezárólag. A második fejezettől kezdődően a szerző saját kutatásának lépéseit ismerjük meg, az elemzett anyag, az alkalmazott módszerek és az eredmények részletes, jól illusztrált áttekintését kapja az olvasó. Kohári Anna arra is rámutat, hogy a kapott eredmények mely területeken és milyen módon hasznosulhatnak, valamint kijelöli a további kutatások lehetséges irányait is. Mindezek alapján a kötet nemcsak a szűkebb, fonetikusokból álló olvasótábor érdeklődésére tarthat számot, hanem olyan területek művelői is építhetnek a benne foglalt ismeretekre, amelyek a beszéd időzítéséhez bármilyen módon kapcsolódnak, a logopédiától a beszédtechnológiáig. Az a kutatásmódszertani innováció, amelyre a kötet példát ad, azonban még távolabbi, a beszédhez nem vagy kevésbé kapcsolódó, de az időzítés mintázatait magában rejtő jelenségek leírásában is haszonnal kecsegtet. Így a könyv bátran ajánlható a szélesebb érdeklődő közönség számára is

    Concatenative speech synthesis : a framework for reducing perceived distortion when using the TD-PSOLA algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Vokalická délka: srovnání českého a francouzského vokalického systému z hlediska fonetického a fonologického. Kontrastivní studie s pedagogickým zaměřením

    Get PDF
    NÁZEV: Vokalická délka: srovnání českého a francouzského vokalického systému z hlediska fonetického a fonologického. Kontrastivní studie s pedagogickým zaměřením AUTOR: Kateřina Vychopňová KATEDRA: Katedra francouzského jazyka a literatury ŠKOLITEL: prof. PhDr. Marie DOHALSKÁ, DrSc. / prof. Philippe MARTIN Dizertační práce, jedna z francouzsko-českých kontrastivních studií, se zabývá problematikou vokalické délky v českém a francouzském jazyce. První část práce, jež byla vytvořena se záměrem možného pedagogického využití, představuje srovnání hlavních fonetických a fonologických charakteristik obou jazyků, jejich typologii z hlediska vokalické délky a její vývoj napříč historií obou jazyků. Druhá část, experimentální, přináší popis a výsledky několika pokusů, jež sledují, zda francouzští mluvčí, učící se češtině, respektují délku českých samohlásek, dále pak vliv následujících konsonantů, finálního a iniciálního přízvuku, jakožto i typu slabiky na délku samohlásek francouzských, a to jak v produkci francouzských rodilých mluvčích, tak českých mluvčích učících se francouzštině. Výsledky potvrzují, že artikulační návyky, jež používají obě skupiny mluvčích v jazyce mateřském, mají značný vliv na produkci jazyka osvojovaného. Ve třetí části jsou představeny výsledky analýz učebních souborů...TITLE: Vowel Lenght: Comparison of Czech and French Vowel Systems from the Phonetic and Phonological Point of View. A Pedagogically Focused Contrastive Study. AUTHOR: Kateřina Vychopňová DEPARTMENT: Department of French Language and Literature SUPERVISOR: prof. PhDr. Marie DOHALSKÁ, DrSc./ prof. Philippe MARTIN The thesis, one of the French-Czech contrastive studies, deals with the issue of vowel length in Czech and French language. The first part of the thesis, which was intentionally written for a potential pedagogic use, introduces the comparison of the main phonetic and phonological characteristics of both languages, their typology considering vowels length and its development through the history of these languages. The second part, experimental, brings the description and results of several attempts which observe whether the French native speakers learning Czech respect the length of Czech vowels, further the influence of following consonants, final and initial stress and also the type of a syllable on the length of French vowels, both in the production of French native speakers and Czech native speakers learning French. The results prove that articulatory habits which both groups of speakers use in their mother tongue have a significant influence on the production of the acquired language. In...Katedra francouzského jazyka a literaturyFaculty of EducationPedagogická fakult

    Do grafema ao gesto : contributos linguísticos para um sistema de síntese de base articulatória

    Get PDF
    Doutoramento em LinguísticaMotivados pelo propósito central de contribuir para a construção, a longo prazo, de um sistema completo de conversão de texto para fala, baseado em síntese articulatória, desenvolvemos um modelo linguístico para o português europeu (PE), com base no sistema TADA (TAsk Dynamic Application), que visou a obtenção automática da trajectória dos articuladores a partir do texto de entrada. A concretização deste objectivo ditou o desenvolvimento de um conjunto de tarefas, nomeadamente 1) a implementação e avaliação de dois sistemas de silabificação automática e de transcrição fonética, tendo em vista a transformação do texto de entrada num formato adequado ao TADA; 2) a criação de um dicionário gestual para os sons do PE, de modo a que cada fone obtido à saída do conversor grafema-fone pudesse ter correspondência com um conjunto de gestos articulatórios adaptados para o PE; 3) a análise do fenómeno da nasalidade à luz dos princípios dinâmicos da Fonologia Articulatória (FA), com base num estudo articulatório e perceptivo. Os dois algoritmos de silabificação automática implementados e testados fizeram apelo a conhecimentos de natureza fonológica sobre a estrutura da sílaba, sendo o primeiro baseado em transdutores de estados finitos e o segundo uma implementação fiel das propostas de Mateus & d'Andrade (2000). O desempenho destes algoritmos – sobretudo do segundo – mostrou-se similar ao de outros sistemas com as mesmas potencialidades. Quanto à conversão grafema-fone, seguimos uma metodologia baseada em regras de reescrita combinada com uma técnica de aprendizagem automática. Os resultados da avaliação deste sistema motivaram a exploração posterior de outros métodos automáticos, procurando também avaliar o impacto da integração de informação silábica nos sistemas. A descrição dinâmica dos sons do PE, ancorada nos princípios teóricos e metodológicos da FA, baseou-se essencialmente na análise de dados de ressonância magnética, a partir dos quais foram realizadas todas as medições, com vista à obtenção de parâmetros articulatórios quantitativos. Foi tentada uma primeira validação das várias configurações gestuais propostas, através de um pequeno teste perceptual, que permitiu identificar os principais problemas subjacentes à proposta gestual. Este trabalho propiciou, pela primeira vez para o PE, o desenvolvimento de um primeiro sistema de conversão de texto para fala, de base articulatória. A descrição dinâmica das vogais nasais contou, quer com os dados de ressonância magnética, para caracterização dos gestos orais, quer com os dados obtidos através de articulografia electromagnética (EMA), para estudo da dinâmica do velo e da sua relação com os restantes articuladores. Para além disso, foi efectuado um teste perceptivo, usando o TADA e o SAPWindows, para avaliar a sensibilidade dos ouvintes portugueses às variações na altura do velo e alterações na coordenação intergestual. Este estudo serviu de base a uma interpretação abstracta (em termos gestuais) das vogais nasais do PE e permitiu também esclarecer aspectos cruciais relacionados com a sua produção e percepção.Motivated by the central purpose of contributing for the construction, in the long term, of a complete text-to-speech system based in articulatory synthesis, we develop a linguistic model for European Portuguese (EP), based on TADA system (TAsk Dynamic Application), that aimed at the automatic attainment of the articulators trajectory from the input text. The specification of this purpose determined the development of a set of tasks, namely the 1) implementation and evaluation of two automatic syllabification systems and two grapheme-to-phoneme (G2P) conversion systems, in view of the transformation of the input in an appropriate format to the TADA; 2) the creation of a gestural database for the EP sounds, in so that each phone obtained at the output of the g2p system could have correspondence with a set of articulatory gestures adapted for EP; 3) the dynamic analysis of nasality, on the basis of an articulatory and perceptive study. The two automatic syllabification algorithms implemented and tested make appeal to phonological knowledge on the structure of the syllable, being the first one based in finite state transducers and the second one a faithful implementation of Mateus & d'Andrade (2000) proposals. The performance of these algorithms – especially the second - was similar to the one of other systems with the same potentialities. Regarding grapheme-to-phone conversion, we follow a methodology based on manual rules combined with an automatic learning technique. The evaluation results of this system motivated the exploitation of others automatic approaches, finding also to evaluate the impact of the syllabic information integration in the systems. The gestural description of the European Portuguese sounds, anchored on the theoretical and methodological tenets of the Articulatory Phonology, was based essentially on the analysis of magnetic resonance data (MRI), from which all the measurements were carried out, aiming to obtain the quantitative articulatory parameters. The several gestural configurations proposed have been validated, through a small perceptual test, which allowed identifying the main underlying problems of the gestural proposal. This work provided, for the first time to PE, the development of a first articulatory based text-to-speech system. The dynamic description of nasal vowels relied either on the magnetic resonance data, for characterization of the oral gestures, either on the data obtained through electromagnetic articulography (EMA), for the study of the velum dynamic and of its relation with the remaining articulators. Besides that, a perceptive test was performed, using TADA and SAPWindows, to evaluate the sensibility of the Portuguese listeners to the variations in the height of velum and alterations in the intergestural coordination. This study supported an abstract interpretation (in gestural terms) of the EP nasal vowels and allowed also to clarify crucial aspects related with its production and perception
    corecore