2,823 research outputs found
Reordering of Source Side for a Factored English to Manipuri SMT System
Similar languages with massive parallel corpora are readily implemented by large-scale systems using either Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Translations involving low-resource language pairs with linguistic divergence have always been a challenge. We consider one such pair, English-Manipuri, which shows linguistic divergence and belongs to the low resource category. For such language pairs, SMT gets better acclamation than NMT. However, SMT’s more prominent phrase- based model uses groupings of surface word forms treated as phrases for translation. Therefore, without any linguistic knowledge, it fails to learn a proper mapping between the source and target language symbols. Our model adopts a factored model of SMT (FSMT3*) with a part-of-speech (POS) tag as a factor to incorporate linguistic information about the languages followed by hand-coded reordering. The reordering of source sentences makes them similar to the target language allowing better mapping between source and target symbols. The reordering also converts long-distance reordering problems to monotone reordering that SMT models can better handle, thereby reducing the load during decoding time. Additionally, we discover that adding a POS feature data enhances the system’s precision. Experimental results using automatic evaluation metrics show that our model improved over phrase-based and other factored models using the lexicalised Moses reordering options. Our FSMT3* model shows an increase in the automatic scores of translation result over the factored model with lexicalised phrase reordering (FSMT2) by an amount of 11.05% (Bilingual Evaluation Understudy), 5.46% (F1), 9.35% (Precision), and 2.56% (Recall), respectively
What we learn about language from Spoken Corpus Linguistics?
Over the last few decades, the Spoken Corpus Linguistics (SCL) has achieved a great deal in terms of quantity and quality of works (O’Keeffe, McCarthy 2010). Enormous progress has been made in the last thirty years and the increment of multimodal corpora stimulates sophisticated investigations on the relationship between the verbal and non-verbal component of spoken communication (Knight 2011). The SCL is a very vital field of research, which is able to provide essential data and tools for the advancement of language knowledge. In this article I will focus on the contribution that SCL and the resulting data provide to general linguistics. In § 2, I discuss the contribution that the SCL gives to a better understanding of linguistic variation; in § 3, I show how the SCL can improve the descriptive adequacy of grammar; finally, § 4 is dedicated to the contribution that speech data can give to a better knowledge of the grammaticality of languages. Across the article I will use mainly data from Italian corpora, but widely validated by comparison with data from corpora of other languages
Is literary language a development of ordinary language?
Contemporary literary linguistics is guided by the 'Development Hypothesis' which says that literary language is formed and regulated by developing only the elements, rules and constraints of ordinary language. Six ways of differentiating literary language from ordinary language are tested against the Development Hypothesis, as are various kinds of superadded constraint including metre, rhyme and alliteration and parallelism. Literary language differs formally, but is unlikely to differ semantically from ordinary language. The article concludes by asking why the Development Hypothesis might hold
From Phonology to Syntax:Unsupervised Linguistic Typology at Different Levels with Language Embeddings
A core part of linguistic typology is the classification of languages
according to linguistic properties, such as those detailed in the World Atlas
of Language Structure (WALS). Doing this manually is prohibitively
time-consuming, which is in part evidenced by the fact that only 100 out of
over 7,000 languages spoken in the world are fully covered in WALS.
We learn distributed language representations, which can be used to predict
typological properties on a massively multilingual scale. Additionally,
quantitative and qualitative analyses of these language embeddings can tell us
how language similarities are encoded in NLP models for tasks at different
typological levels. The representations are learned in an unsupervised manner
alongside tasks at three typological levels: phonology (grapheme-to-phoneme
prediction, and phoneme reconstruction), morphology (morphological inflection),
and syntax (part-of-speech tagging).
We consider more than 800 languages and find significant differences in the
language representations encoded, depending on the target task. For instance,
although Norwegian Bokm{\aa}l and Danish are typologically close to one
another, they are phonologically distant, which is reflected in their language
embeddings growing relatively distant in a phonological task. We are also able
to predict typological features in WALS with high accuracies, even for unseen
language families.Comment: Accepted to NAACL 2018 (long paper). arXiv admin note: text overlap
with arXiv:1711.0546
- …