1,835 research outputs found

    Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

    Get PDF
    Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.Comment: EMNLP 201

    Morphological Analysis as Classification: an Inductive-Learning Approach

    Full text link
    Morphological analysis is an important subtask in text-to-speech conversion, hyphenation, and other language engineering tasks. The traditional approach to performing morphological analysis is to combine a morpheme lexicon, sets of (linguistic) rules, and heuristics to find a most probable analysis. In contrast we present an inductive learning approach in which morphological analysis is reformulated as a segmentation task. We report on a number of experiments in which five inductive learning algorithms are applied to three variations of the task of morphological analysis. Results show (i) that the generalisation performance of the algorithms is good, and (ii) that the lazy learning algorithm IB1-IG performs best on all three tasks. We conclude that lazy learning of morphological analysis as a classification task is indeed a viable approach; moreover, it has the strong advantages over the traditional approach of avoiding the knowledge-acquisition bottleneck, being fast and deterministic in learning and processing, and being language-independent.Comment: 11 pages, 5 encapsulated postscript figures, uses non-standard NeMLaP proceedings style nemlap.sty; inputs ipamacs (international phonetic alphabet) and epsf macro

    Automatic syllabification using segmental conditional random fields

    Get PDF
    In this paper we present a statistical approach for the automatic syllabification of phonetic word transcriptions. A syllable bigram language model forms the core of the system. Given the large number of syllables in non-syllabic languages, sparsity is the main issue, especially since the available syllabified corpora tend to be small. Traditional back-off mechanisms only give a partial solution to the sparsity problem. In this work we use a set of features for back-off purposes: on the one hand probabilities such as consonant cluster probabilities, and on the other hand a set of rules based on generic syllabification principles such as legality, sonority and maximal onset. For the combination of these highly correlated features with the baseline bigram feature we employ segmental conditional random fields (SCRFs) as statistical framework. The resulting method is very versatile and can be used for any amount of data of any language. The method was tested on various datasets in English and Dutch with dictionary sizes varying between 1 and 60 thousand words. We obtained a 97.96% word accuracy for supervised syllabification and a 91.22% word accuracy for unsupervised syllabification for English. When including the top-2 generated syllabifications for a small fraction of the words, virtual perfect syllabification is obtained in supervised mode

    Comparison between rule-based and data-driven natural language processing algorithms for Brazilian Portuguese speech synthesis

    Get PDF
    Due to the exponential growth in the use of computers, personal digital assistants and smartphones, the development of Text-to-Speech (TTS) systems have become highly demanded during the last years. An important part of these systems is the Text Analysis block, that converts the input text into linguistic specifications that are going to be used to generate the final speech waveform. The Natural Language Processing algorithms presented in this block are crucial to the quality of the speech generated by synthesizers. These algorithms are responsible for important tasks such as Grapheme-to-Phoneme Conversion, Syllabification and Stress Determination. For Brazilian Portuguese (BP), solutions for the algorithms presented in the Text Analysis block have been focused in rule-based approaches. These algorithms perform well for BP but have many disadvantages. On the other hand, there is still no research to evaluate and analyze the performance of data-driven approaches that reach state-of-the-art results for complex languages, such as English. So, in this work, we compare different data-driven approaches and rule-based approaches for NLP algorithms presented in a TTS system. Moreover, we propose, as a novel application, the use of Sequence-to-Sequence models as solution for the Syllabification and Stress Determination problems. As a brief summary of the results obtained, we show that data-driven algorithms can achieve state-of-the-art performance for the NLP algorithms presented in the Text Analysis block of a BP TTS system.Nos últimos anos, devido ao grande crescimento no uso de computadores, assistentes pessoais e smartphones, o desenvolvimento de sistemas capazes de converter texto em fala tem sido bastante demandado. O bloco de análise de texto, onde o texto de entrada é convertido em especificações linguísticas usadas para gerar a onda sonora final é uma parte importante destes sistemas. O desempenho dos algoritmos de Processamento de Linguagem Natural (NLP) presentes neste bloco é crucial para a qualidade dos sintetizadores de voz. Conversão Grafema-Fonema, separação silábica e determinação da sílaba tônica são algumas das tarefas executadas por estes algoritmos. Para o Português Brasileiro (BP), os algoritmos baseados em regras têm sido o foco na solução destes problemas. Estes algoritmos atingem bom desempenho para o BP, contudo apresentam diversas desvantagens. Por outro lado, ainda não há pesquisa no intuito de avaliar o desempenho de algoritmos data-driven, largamente utilizados para línguas complexas, como o inglês. Desta forma, expõe-se neste trabalho uma comparação entre diferentes técnicas data-driven e baseadas em regras para algoritmos de NLP utilizados em um sintetizador de voz. Além disso, propõe o uso de Sequence-to-Sequence models para a separação silábica e a determinação da tonicidade. Em suma, o presente trabalho demonstra que o uso de algoritmos data-driven atinge o estado-da-arte na performance dos algoritmos de Processamento de Linguagem Natural de um sintetizador de voz para o Português Brasileiro

    Current approaches to orthography instruction for spanish heritage learners: an analysis of intermediate and advanced textbooks

    Get PDF
    As it is well known, heritage learners of Spanish have an advantage on oral production and aural comprehension over L2 learners. However, due to their lack of formal instruction in Spanish, their linguistic weaknesses lie on their literacy skills (reading and writing). In terms of writing skills, students mainly struggle with orthography issues (accentuation and spelling) and larger literacy skills (e.g. developing a thesis or organizing ideas). However, there is an important gap in the literature regarding orthographic acquisition since most of the research in the field on focus on form instruction has predominantly been on grammar acquisition (Anderson, 2008; Montrul and Bowles, 2008; Potowski, 2005, among others). In fact, despite the increasing amount of textbooks addressed to this student population with an emphasis on the writing process, to my knowledge there has not been a study on the current approaches of Spanish heritage learners’ textbooks for orthography instruction. After analyzing four popular textbooks for Spanish heritage, it can be deduced that the lack of research on this area perpetuates the maintenance of traditional non-communicative explicit instruction of orthography through drills after an explicit explanation of the rules of both accentuation and spelling but new textbooks shed some light towards the implementation of focus on form teaching techniques commonly used in the L2 classroom such as input-output activities

    MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning

    Full text link
    In this paper, we present a methodology for linguistic feature extraction, focusing particularly on automatically syllabifying words in multiple languages, with a design to be compatible with a forced-alignment tool, the Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our method focuses on the extraction of phonetic transcriptions from text, stress marks, and a unified automatic syllabification (in text and phonetic domains). The system was built with open-source components and resources. Through an ablation study, we demonstrate the efficacy of our approach in automatically syllabifying words from several languages (English, French and Spanish). Additionally, we apply the technique to the transcriptions of the CMU ARCTIC dataset, generating valuable annotations available online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for speech representation learning, speech unit discovery, and disentanglement of speech factors in several speech-related fields.Comment: Accepted for publication at EMNLP 202
    corecore