2,823 research outputs found

    Book Reviews

    Get PDF

    Reordering of Source Side for a Factored English to Manipuri SMT System

    Get PDF
    Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively

    What we learn about language from Spoken Corpus Linguistics?

    Get PDF
    Over the last few decades, the Spoken Corpus Linguistics (SCL) has achieved a great deal in terms of quantity and quality of works (O’Keeffe, McCarthy 2010). Enormous progress has been made in the last thirty years and the increment of multimodal corpora stimulates sophisticated investigations on the relationship between the verbal and non-verbal component of spoken communication (Knight 2011). The SCL is a very vital field of research, which is able to provide essential data and tools for the advancement of language knowledge. In this article I will focus on the contribution that SCL and the resulting data provide to general linguistics. In § 2, I discuss the contribution that the SCL gives to a better understanding of linguistic variation; in § 3, I show how the SCL can improve the descriptive adequacy of grammar; finally, § 4 is dedicated to the contribution that speech data can give to a better knowledge of the grammaticality of languages. Across the article I will use mainly data from Italian corpora, but widely validated by comparison with data from corpora of other languages

    Is literary language a development of ordinary language?

    Get PDF
    Contemporary literary linguistics is guided by the 'Development Hypothesis' which says that literary language is formed and regulated by developing only the elements, rules and constraints of ordinary language. Six ways of differentiating literary language from ordinary language are tested against the Development Hypothesis, as are various kinds of superadded constraint including metre, rhyme and alliteration and parallelism. Literary language differs formally, but is unlikely to differ semantically from ordinary language. The article concludes by asking why the Development Hypothesis might hold

    From Phonology to Syntax:Unsupervised Linguistic Typology at Different Levels with Language Embeddings

    Get PDF
    A core part of linguistic typology is the classification of languages according to linguistic properties, such as those detailed in the World Atlas of Language Structure (WALS). Doing this manually is prohibitively time-consuming, which is in part evidenced by the fact that only 100 out of over 7,000 languages spoken in the world are fully covered in WALS. We learn distributed language representations, which can be used to predict typological properties on a massively multilingual scale. Additionally, quantitative and qualitative analyses of these language embeddings can tell us how language similarities are encoded in NLP models for tasks at different typological levels. The representations are learned in an unsupervised manner alongside tasks at three typological levels: phonology (grapheme-to-phoneme prediction, and phoneme reconstruction), morphology (morphological inflection), and syntax (part-of-speech tagging). We consider more than 800 languages and find significant differences in the language representations encoded, depending on the target task. For instance, although Norwegian Bokm{\aa}l and Danish are typologically close to one another, they are phonologically distant, which is reflected in their language embeddings growing relatively distant in a phonological task. We are also able to predict typological features in WALS with high accuracies, even for unseen language families.Comment: Accepted to NAACL 2018 (long paper). arXiv admin note: text overlap with arXiv:1711.0546
    corecore