681 research outputs found

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Automatic processing of code-mixed social media content

    Get PDF
    Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

    Automated Adaptation Between Kiranti Languages

    Get PDF
    McCloy, Daniel, M.A., December 2006 Linguistics Automated Adaptation Between Kiranti Languages Chairperson: Dr. Anthony Mattina Minority language communities that are seeking to develop their language may be hampered by a lack of vernacular materials. Large volumes of such materials may be available in a related language. Automated adaptation holds potential to enable these large volumes of materials to be efficiently translated into the resource-scarce language. I describe a project to assess the feasibility of automatically adapting text between Limbu and Yamphu, two languages in Nepal’s Kiranti grouping. The approaches taken—essentially a transfer-based system partially hybridized with a Kiranti-specific interlingua—are placed in the context of machine translation efforts world-wide. A key principle embodied in this strategy is that adaptation can transcend the structural obstacles by taking advantage of functional commonalities. That is, what matters most for successful adaptation is that the languages “care about the same kinds of things.” I examine various typological phenomena of these languages to assess this degree of functional commonality. I look at the types of features marked on the finite verb, case-marking systems, the encoding of vertical deixis, object-incorporated verbs, and nominalization issues. As this Kiranti adaptation goal involves adaptation into multiple target languages, I also present a disambiguation strategy that ensures that the manual disambiguation performed for one target language is fed back into the system, such that the same disambiguation will not need to be performed again for other target languages
    corecore