24 research outputs found

    Pronunciation modelling in end-to-end text-to-speech synthesis

    Get PDF
    Sequence-to-sequence (S2S) models in text-to-speech synthesis (TTS) can achieve high-quality naturalness scores without extensive processing of text-input. Since S2S models have been proposed in multiple aspects of the TTS pipeline, the field has focused on embedding the pipeline toward End-to-End (E2E-) TTS where a waveform is predicted directly from a sequence of text or phone characters. Early work on E2ETTS in English, such as Char2Wav [1] and Tacotron [2], suggested that phonetisation (lexicon-lookup and/or G2P modelling) could be implicitly learnt in a text-encoder during training. The benefits of a learned text encoding include improved modelling of phonetic context, which make contextual linguistic features traditionally used in TTS pipelines redundant [3]. Subsequent work on E2E-TTS has since shown similar naturalness scores with text- or phone-input (e.g. as in [4]). Successful modelling of phonetic context has led some to question the benefit of using phone- instead of text-input altogether (see [5]). The use of text-input brings into question the value of the pronunciation lexicon in E2E-TTS. Without phone-input, a S2S encoder learns an implicit grapheme-tophoneme (G2P) model from text-audio pairs during training. With common datasets for E2E-TTS in English, I simulated implicit G2P models, finding increased error rates compared to a traditional, lexicon-based G2P model. Ultimately, successful G2P generalisation is difficult for some words (e.g. foreign words and proper names) since the knowledge to disambiguate their pronunciations may not be provided by the local grapheme context and may require knowledge beyond that contained in sentence-level text-audio sequences. When test stimuli were selected according to G2P difficulty, increased mispronunciations in E2E-TTS with text-input were observed. Following the proposed benefits of subword decomposition in S2S modelling in other language tasks (e.g. neural machine translation), the effects of morphological decomposition were investigated on pronunciation modelling. Learning of the French post-lexical phenomenon liaison was also evaluated. With the goal of an inexpensive, large-scale evaluation of pronunciation modelling, the reliability of automatic speech recognition (ASR) to measure TTS intelligibility was investigated. A re-evaluation of 6 years of results from the Blizzard Challenge was conducted. ASR reliably found similar significant differences between systems as paid listeners in controlled conditions in English. An analysis of transcriptions for words exhibiting difficult-to-predict G2P relations was also conducted. The E2E-ASR Transformer model used was found to be unreliable in its transcription of difficult G2P relations due to homophonic transcription and incorrect transcription of words with difficult G2P relations. A further evaluation of representation mixing in Tacotron finds pronunciation correction is possible when mixing text- and phone-inputs. The thesis concludes that there is still a place for the pronunciation lexicon in E2E-TTS as a pronunciation guide since it can provide assurances that G2P generalisation cannot

    Rapid Generation of Pronunciation Dictionaries for new Domains and Languages

    Get PDF
    This dissertation presents innovative strategies and methods for the rapid generation of pronunciation dictionaries for new domains and languages. Depending on various conditions, solutions are proposed and developed. Starting from the straightforward scenario in which the target language is present in written form on the Internet and the mapping between speech and written language is close up to the difficult scenario in which no written form for the target language exists

    Uncovering the myth of learning to read Chinese characters: phonetic, semantic, and orthographic strategies used by Chinese as foreign language learners

    Get PDF
    Oral Session - 6A: Lexical modeling: no. 6A.3Chinese is considered to be one of the most challenging orthographies to be learned by non-native speakers, in particular, the character. Chinese character is the basic reading unit that converges sound, form and meaning. The predominant type of Chinese character is semantic-phonetic compound that is composed of phonetic and semantic radicals, giving the clues of the sound and meaning, respectively. Over the last two decades, psycholinguistic research has made significant progress in specifying the roles of phonetic and semantic radicals in character processing among native Chinese speakers …postprin

    (Dis)connections between specific language impairment and dyslexia in Chinese

    Get PDF
    Poster Session: no. 26P.40Specific language impairment (SLI) and dyslexia describe language-learning impairments that occur in the absence of a sensory, cognitive, or psychosocial impairment. SLI is primarily defined by an impairment in oral language, and dyslexia by a deficit in the reading of written words. SLI and dyslexia co-occur in school-age children learning English, with rates ranging from 17% to 75%. For children learning Chinese, SLI and dyslexia also co-occur. Wong et al. (2010) first reported on the presence of dyslexia in a clinical sample of 6- to 11-year-old school-age children with SLI. The study compared the reading-related cognitive skills of children with SLI and dyslexia (SLI-D) with 2 groups of children …postprin

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Grapheme-based Automatic Speech Recognition using Probabilistic Lexical Modeling

    Get PDF
    Automatic speech recognition (ASR) systems incorporate expert knowledge of language or the linguistic expertise through the use of phone pronunciation lexicon (or dictionary) where each word is associated with a sequence of phones. The creation of phone pronunciation lexicon for a new language or domain is costly as it requires linguistic expertise, and includes time and money. In this thesis, we focus on effective building of ASR systems in the absence of linguistic expertise for a new domain or language. Particularly, we consider graphemes as alternate subword units for speech recognition. In a grapheme lexicon, pronunciation of a word is derived from its orthography. However, modeling graphemes for speech recognition is a challenging task for two reasons. Firstly, grapheme-to-phoneme (G2P) relationship can be ambiguous as languages continue to evolve after their spelling has been standardized. Secondly, as elucidated in this thesis, typically ASR systems directly model the relationship between graphemes and acoustic features; and the acoustic features depict the envelope of speech, which is related to phones. In this thesis, a grapheme-based ASR approach is proposed where the modeling of the relationship between graphemes and acoustic features is factored through a latent variable into two models, namely, acoustic model and lexical model. In the acoustic model the relationship between latent variables and acoustic features is modeled, while in the lexical model a probabilistic relationship between latent variables and graphemes is modeled. We refer to the proposed approach as probabilistic lexical modeling based ASR. In the thesis we show that the latent variables can be phones or multilingual phones or clustered context-dependent subword units; and an acoustic model can be trained on domain-independent or language-independent resources. The lexical model is trained on transcribed speech data from the target domain or language. In doing so, the parameters of the lexical model capture a probabilistic relationship between graphemes and phones. In the proposed grapheme-based ASR approach, lexicon learning is implicitly integrated as a phase in ASR system training as opposed to the conventional approach where first phone pronunciation lexicon is developed and then a phone-based ASR system is trained. The potential and the efficacy of the proposed approach is demonstrated through experiments and comparisons with other standard approaches on ASR for resource rich languages, nonnative and accented speech, under-resourced languages, and minority languages. The studies revealed that the proposed framework is particularly suitable when the task is challenged by the lack of both linguistic expertise and transcribed data. Furthermore, our investigations also showed that standard ASR approaches in which the lexical model is deterministic are more suitable for phones than graphemes, while probabilistic lexical model based ASR approach is suitable for both. Finally, we show that the captured grapheme-to-phoneme relationship can be exploited to perform acoustic data-driven G2P conversion

    How users read translated web pages: occupational and purpose-based differences

    Get PDF
    Aquest estudi vol demostrar diferències en els patrons de lectura d'una pàgina web en base a les diferències ocupacionals dels participants i la finalitat de lectura. La recerca es realitza amb 20 participants dividits en dos grups: un grup de lectors professionals, on els participants són professionals de la lectura com traductors, editors i correctors, i un grup de lectura recreativa, on les seves professions no impliquen lectura detallada, com xefs, enginyers i personal militar. Tots els participants van completar quatre tasques dissenyades amb quatre finalitats de lectura: 1) sense objectiu específic, 2) per estudiar el tema, 3) per obtenir informació, i 4) per compartir informació. El text de partida en anglès presenta iOS 7. Per al test es va usar la versió web oficial en coreà, després d'inserir-li cinc tipus d'error segons el model d'avaluació LISA. Es va gravar la pantalla dels participants amb programari específic i es van usar Protocols en Veu Alta (TAP) per als seus informes verbals. L'anàlisi suggereix que els errors de traducció es perceben de diferent manera quan la finalitat lectora i la professió dels lectors varien. El grup de lectura professional va aplicar una lectura lineal i meticulosa, mentre que el grup de lectura recreativa va optar per una lectura circular i no rigorosa. El primer grup va detectar gran quantitat d'errors, mentre que el segon va mostrar un baix percentatge de detecció d'errors. Malgrat això, el grup professional mostra major nivell de tolerància als errors de traducció en el procés de comprensió i el grup recreatiu major nivell de frustració a la falta de comprensió. L'autoritat de la companyia encarregada de la web va exercir gran influència en el nivell de confiança del grup de lectura recreativa.Este estudio investiga diferencias en la lectura de una página web traducida en base a las diferencias ocupacionales de los participantes y la finalidad de lectura. La investigación se realiza con 20 participantes divididos en dos grupos: un grupo de lectores profesionales, cuyos participantes son profesionales de la lectura como traductores, editores y correctores, y un grupo de lectura recreativa, cuyas profesiones no implican lectura detallada, como chefs, ingenieros y personal militar. Todos los participantes completaron cuatro tareas diseñadas con cuatro finalidades de lectura: 1) sin objetivo específico, 2) para estudiar el tema, 3) para obtener información, y 4) para compartir información. Para estudiar como los lectores perciben los errores de traducción y la relación entre patrones de lectura y finalidad lectora, se introdujeron cinco tipos de errores de traducción en una web. Se grabó la pantalla de los participantes con software específico y se usaron Protocolos en Voz Alta (TAP) para sus informes verbales. El análisis sugiere que los errores de traducción se perciben de diferente manera cuando la finalidad lectora y la profesión de los lectores varían. El grupo de lectura profesional aplicó una lectura lineal y concienzuda, mientras que el grupo de lectura recreativa optó por una lectura circular y no rigurosa. El primer grupo detectó gran cantidad de errores, mientras que el segundo mostró un bajo porcentaje de detección de errores. Pese a ello, el grupo profesional muestra mayor nivel de tolerancia a los errores de traducción en el proceso de comprensión y el grupo recreativo mayor nivel de frustración a la falta de comprensión. La autoridad de la compañía encargada de la web ejerció gran influencia en el nivel de confianza del grupo de lectura recreativa.As a part of process-oriented research efforts, the study aimed to test occupational differences and purpose-based differences in reading pattern of a translated web page. The research used two reading groups with ten participants in each: a heavy-reading group, whose participants were reading professionals such as translators, editors and proofreaders, and a light-reading group, whose professions did not involve intensive reading, such as chefs, engineers, and military personnel. The participants in both groups completed four different tasks that were designed to provoke four different reading purposes: 1) reading without a specific task, 2) reading for studying subject matter, 3) reading for retrieving information, and 4) reading for sharing information. In order to learn more about how readers perceive translation errors and to test the relations between the reading patterns and the reading purposes, five different types of translation errors were planted in a web page. The participants’ screen activity was recorded by a screen recording software, and TAP is used for their verbal reports. In-depth analysis of the results suggests that occupations and reading purposes have meaningful impacts on the reading patterns of the translated web page. The heavy-reading group displayed very strict bottom-up approaches, with linear and thorough reading, whereas the light-reading group showed relaxed top-down approaches, with circular and not-thorough reading. The heavy group showed critical attitudes in detecting errors, while the light-reading group showed extremely low error-detection rates, with relaxed attitudes. Surprisingly, in spite of the critical error detection pattern, the tolerance level of translation errors in the process of comprehension was much higher in the heavy-reading group, and the frustration level of incomprehension was much higher in the light-reading group. The authority of the company producing the web page also heavily influenced the trust level of the light-reading group
    corecore