996 research outputs found
Findings of the 2019 Conference on Machine Translation (WMT19)
This paper presents the results of the premier shared task organized alongside the Conference on Machine Translation (WMT) 2019.
Participants were asked to build machine translation systems for any of 18 language pairs, to be evaluated on a test set of news stories. The main metric for this task is human judgment of translation quality. The task was also opened up to additional test suites to probe specific aspects of translation
Initial Experiments on Russian to Kazakh SMT
We present our initial experiments on Russian to Kazakh phrase-based
statistical machine translation. Following a common approach to SMT between
morphologically rich languages, we employ morphological processing techniques.
Namely, for our initial experiments, we perform source-side lemmatization. Given
a rather humble-sized parallel corpus at hand, we also put some effort in data
cleaning and investigate the impact of data quality vs. quantity trade off on the
overall performance. Although our experiments mostly focus on source side preprocessing we achieve a substantial, statistically significant improvement over the
baseline that operates on raw, unprocessed data
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
A free/open-source hybrid morphological disambiguation tool for Kazakh
This paper presents the results of developing a
morphological disambiguation tool for Kazakh. Starting with a
previously developed rule-based approach, we tried to cope with
the complex morphology of Kazakh by breaking up lexical forms
across their derivational boundaries into inflectional groups
and modeling their behavior with statistical methods. A hybrid
rule-based/statistical approach appears to benefit morphological
disambiguation demonstrating a per-token accuracy of 91% in
running text
Initial Normalization of User Generated Content: Case Study in a Multilingual Setting
We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy
Development of an intelligent information resource model based on modern natural language processing methods
Currently, there is an avalanche-like increase in the need for automatic text processing, respectively, new effective methods and tools for processing texts in natural language are emerging. Although these methods, tools and resources are mostly presented on the internet, many of them remain inaccessible to developers, since they are not systematized, distributed in various directories or on separate sites of both humanitarian and technical orientation. All this greatly complicates their search and practical use in conducting research in computational linguistics and developing applied systems for natural text processing. This paper is aimed at solving the need described above. The paper goal is to develop model of an intelligent information resource based on modern methods of natural language processing (IIR NLP). The main goal of IIR NLP is to render convenient valuable access for specialists in the field of computational linguistics. The originality of our proposed approach is that the developed ontology of the subject area “NLP” will be used to systematize all the above knowledge, data, information resources and organize meaningful access to them, and semantic web standards and technology tools will be used as a software basis
- …