5 research outputs found

    Електронний словник підвищеної швидкодії на основі хеш-адресації без колізій

    Get PDF
    Ціллю представлених в дипломному проєкті досліджень є підвищення швид- кодії електронних словників інтелектуальних систем комп’ютерного перекладу за рахунок використання найбільш швидкого виду пошуку – хеш-адресації. . Для підвищення швидкості пошуку запропоновано застосувати хеш- адресацію без колізій. Можливість швидкого віднаходження хеш-перетворення, яке не утворює колізій забезпечується шляхом розрідження адресного простору пам'яті. Контекстна інформація для ключових слів розміщується в вільних проміжках між задіяними хеш-адресами. Розроблено процедуру підбору хеш- перетворення без колізій для заданого масиву ключових слів, організацію розміщення слів за хеш-адресами та супутньої інформації, а також організацію пошуку контекстної інформації за ключем. Результати досліджень можуть бути використані для підвищення ефективності систем інтелектуального комп’ютерного перекладуThe aim of the research presented in the diploma project is to increase the speed of electronic dictionaries of intelligent computer translation systems by using the fastest type of search - hash addressing. To increase the search speed, it is recommended to use collision-free hash addressing. The ability to quickly find a hash transformation that does not form collisions is provided by depleting the address space of memory. Contextual information for keywords is placed in the free spaces between the involved hash addresses. The procedure of selection of hash-transformation without collisions for the set array of keywords, the organization of placement of words on hash addresses and the accompanying information, and also the organization of search of the contextual information on a key is developed. Research results can be used to increase the efficiency of intelligent computer transla- tion systems

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Itzulpen automatiko gainbegiratu gabea

    Get PDF
    192 p.Modern machine translation relies on strong supervision in the form of parallel corpora. Such arequirement greatly departs from the way in which humans acquire language, and poses a major practicalproblem for low-resource language pairs. In this thesis, we develop a new paradigm that removes thedependency on parallel data altogether, relying on nothing but monolingual corpora to train unsupervisedmachine translation systems. For that purpose, our approach first aligns separately trained wordrepresentations in different languages based on their structural similarity, and uses them to initializeeither a neural or a statistical machine translation system, which is further trained through iterative backtranslation.While previous attempts at learning machine translation systems from monolingual corporahad strong limitations, our work¿along with other contemporaneous developments¿is the first to reportpositive results in standard, large-scale settings, establishing the foundations of unsupervised machinetranslation and opening exciting opportunities for future research
    corecore