9 research outputs found

    Initial Experiments on Russian to Kazakh SMT

    Get PDF
    We present our initial experiments on Russian to Kazakh phrase-based statistical machine translation. Following a common approach to SMT between morphologically rich languages, we employ morphological processing techniques. Namely, for our initial experiments, we perform source-side lemmatization. Given a rather humble-sized parallel corpus at hand, we also put some effort in data cleaning and investigate the impact of data quality vs. quantity trade off on the overall performance. Although our experiments mostly focus on source side preprocessing we achieve a substantial, statistically significant improvement over the baseline that operates on raw, unprocessed data

    Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

    Get PDF
    Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.Comment: EMNLP 201

    Experiments with Russian to Kazakh sentence alignment

    Get PDF
    Sentence alignment is the final step in building parallel corpora, which arguably has the greatest impact on the quality of a resulting corpus and the accuracy of machine translation systems that use it for training. However, the quality of sentence alignment itself depends on a number of factors. In this paper we investigate the impact of several data processing techniques on the quality of sentence alignment. We develop and use a number of automatic evaluation metrics, and provide empirical evidence that application of all of the considered data processing techniques yields bitexts with the lowest ratio of noise and the highest ratio of parallel sentences

    Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

    Get PDF
    We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

    Explorations on chaotic behaviors of Recurrent Neural Networks

    No full text
    Submitted to the Department of Mathematics on Apr 29, 2019, in partial fulfillment of the requirements for the degree of Master of Science in Applied MathematicsIn this thesis work we analyzed the dynamics of the Recurrent Neural Network architectures. We explored the chaotic nature of state-of-the-art Recurrent Neural Networks: Vanilla Recurrent Network, Recurrent Highway Networks and Structurally Constrained Recurrent Network. Our experiments showed that they exhibit chaotic behavior in the absence of input data. We also proposed a way of removing chaos chaos from Recurrent Neural Networks. Our findings show that initialization of the weight matrices during the training plays an important role, as initialization with the matrices whose norm is smaller than one will lead to the non-chaotic behavior of the Recurrent Neural Networks. The advantage of the non-chaotic cells is stable dynamics. At the end, we tested our chaos-free version of the Recurrent Highway Networks (RHN) in a real-world application. In a sequence-to-sequence modeling experiments, particularly in the language modeling task, chaos-free version of RHN perform on par with the original version by using the same hyperparameters

    Manual vs Automatic Bitext Extraction

    No full text
    We compare manual and automatic approaches to the problem of extracting bitexts from the Web in the framework of a case study on building a Russian-Kazakh parallel corpus. Our findings suggest that targeted, site-specific crawling results in cleaner bitexts with a higher ratio of parallel sentences. We also find that general crawlers combined with boilerplate removal tools tend to retrieve shorter texts, as some content gets cleaned out with the markup. When it comes to sentence splitting and alignment we show that investing some effort in data pre- and post-processing as well as fiddling with off-the-shelf solutions pays a noticeable dividend. Overall we observe that, depending on the source, automatic bitext extraction methods may lack severely in coverage (retrieve fewer sentence pairs) and on average are fewer precise (retrieve less parallel sentence pairs). We conclude that if one aims at extracting high-quality bitexts for a small number of language pairs, automatic methods best be avoided, or at least used with caution

    On Various Approaches to Machine Translation from Russian to Kazakh

    No full text
    In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future

    Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

    No full text
    We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

    On Various Approaches to Machine Translation from Russian to Kazakh

    No full text
    In this work we compare a number of approaches to machine translation (MT) form Russian to Kazakh. We focus specifically on this pair of languages for a number of reasons. First, these languages are relatively understudied in terms of MT research, as well as, natural language processing (NLP) research in general. Kazakh, in particular, has been actively studied with modern methods for less than a decade. Second, this pair of languages poses several processing challenges rooted in their nature: both languages are morphologically complex and tend to have free order constituents, which makes long term dependencies rather frequent. From the perspective of data-driven approaches to NLP that means increased data sparseness and high OOV rates. Lastly, apart from scientific curiosity there is a strong practical demand for high quality MT between the languages in question. Kazakh is the state language of Kazakhstan, while Russian, due to a strong Soviet heritage, largely remains a language of professional communication and conduct. This frequently results in paperwork being initially prepared in Russian and then translated into Kazakh. Thus, high quality MT systems are in demand as they would greatly reduce manual labor of the professional translators. We categorize the approaches that we compare into data-driven, linguistically motivated and hybrid ones. In the first category we compare a phrase-based statistical MT (SMT) and a neural MT (NMT) approaches. For the latter we experiment with three different neural architectures. As the result of this comparison we conclude that while NMT is a promising research direction one needs a lot more computational resources and, perhaps, even more data to achieve the level of accuracy offered by SMT. As for linguistically motivated and hybrid approaches we compare a rule-based approach with a so called factored model, which is essentially an SMT model that takes into account various linguistic factors, such as parts of speech, lemmata, morphology, etc. Although this comparison has shown that factored models should be strongly favored, we must note that the Russian-Kazakh pair for the rule-based system that was used in the experiment is still a work in progress. Lastly, one final comparison between the best performing models from each category, i.e. a pure data-driven SMT-model and a hybrid factored model, has favored the former. While we acknowledge that the present work makes no significant contribution to the NLP research in general, we want to point out that, to the best of our knowledge, for the particular language pair considered herein experiments on NMT and factored SMT have never been performed before. We speculate that one possible reason for this is the absence of an accessible Russian-Kazakh parallel corpus that is suitable for those experiments in terms of both size and quality. With this in mind we also provide a detailed description of the parallel data set that we used for our experiments and which we plan to make available in the future
    corecore