32 research outputs found

    Collocation segmentation for text chunking

    No full text
    Teksto skaidymo įvairaus tipo segmentais metodai yra plačiai naudojami teksto apdorojimui. Segmentuojant naudojami tiek statistiniai, tiek formalieji metodai. Disertacijoje pristatomas naujas segmentavimo tipas ir metodas - segmentavimas pastoviaisiais junginiais - ir pateikiami taikymai įvairiose teksto apdorojimo srityse. Taikant pastoviųjų junginių segmentavimą leksikografijoje atskleidžiama, kaip objektyviai ir greitai galima analizuoti labai didelius tekstų archyvus aptinkant vartojamą terminiją ir šių automatiškai identifikuotų terminų svarbumą ir kaitą laiko tėkmėje. Ši analizė leidžia greitai nustatyti svarbius metodologinius pokyčius mokslinių tyrimų istorijoje ir nustatyti pastarojo meto aktualias tyrimų sritis. Tekstų klasifikavimo taikyme atskleidžiama, kaip taikant segmentavimą pastoviaisiais junginiais galima pagerinti tekstų klasifikavimo rezultatus. Taip pat, pasitelkiant segmentavimą pastoviaisiais junginiais, atskleidžiama, kad nežymiai galima pagerinti statistinio mašininio vertimo kokybę, ir atskleidžiama įvairių žodžių junglumo įverčių įtaka segmentavimui pastoviaisiais junginiais. Naujas teksto skaidymo pastoviaisiais junginiais metodas atskleidžia naujas galimybes gerinti teksto apdorojimo rezultatus įvairiuose taikymuose ir įvairiose kalbose.Segmentation is a widely used paradigm in text processing. Rule-based, statistical and hybrid methods are employed to perform the segmentation. This dissertation introduces a new type of segmentation - collocation segmentation - and a new method to perform it, and applies them to three different text processing tasks. In lexicography, collocation segmentation makes possible the use of large corpora to evaluate the usage and importance of terminology over time. Text categorization results can be improved using collocation segmentation. The study shows that collocation segmentation, without any other language resources, achieves better results than the widely used n-gram techniques together with POS (Part-of-Speech) processing tools. Also, the preprocessing of data with collocation segmentation and subsequent integration of these segments into a Statistical Machine Translation system improves the translation results. Diverse word combinability measures variously influence the final collocation segmentation and, thus, the translation results. The new collocation segmentation method is simple, efficient and applicable to language processing for diverse applications.Vytauto Didžiojo universiteta

    Openning to infinity: machine translation and lithuanian language

    No full text
    Mašininis vertimas yra kompleksinis uždavinys, apimantis teorinius ir praktinius įvairių mokslo sričių metodus: informacinių technologijų, lingvistikos, psichologijos, filosofijos. Verčiant automatiškai labai susimažėja finansinės ir laiko sąnaudos vertimui. Galima skirti penkias mašininio vertimo naudojimo sritis: 1) individualus vertimas; 2) atsitiktinis vertimas; 3) individualus profesionalus vertimas; 4) pramoninis profesionalus vertimas; 5) dokumentų paieškos užklausų vertimas. Mašininis vertimas gali būti skirstomas į formalųjį ir statistinį. Formalusis mašininis vertimas skirstomas į 3 tipus: tiesioginį, transformacinį, metakalbinį. Jo etapai – morfologinė analizė ir vienareikšmiškumas; sintaksinė analizė; transformavimas. Statistinis vertimas gali būti frazinis ir leksinis. Statistinis mašininis vertimas gali gana tiksliai perteikti bendrą turinį, nors ir ne viskas bus logiškai išversta. Pagrindinė tokio vertimo problema – lygiagrečių tekstynų dydis, kuris riboja leksiką ir gramatiką. Straipsnyje taip pat aptariamas Vytauto Didžiojo universiteto projektas „Internetinė informacijos vertimo priemonė“ ir mašininio vertimo kokybės vertinimas. Mašininio vertimo atveju kokybės rodikliu laikytinas procentas sakinių, įvertintas trimis ir daugiau balųThe article dwells on the most important landmarks in the history of machine translation (MT). In the middle of the twentieth century, at the time when discussions about machine translation started, it was thought that after several years every person would have a small device allowing people to communicate in any language. Today the view towards machine translation has become more pragmatic. For example, everyone agrees that machine translation of fiction will never be of satisfactory quality. The article discusses three key methods of machine translation, namely, statistical, rulebased, and logical. For practical purposes, often several methods of machine translation areintegrated. In rule-based machine translation the most important role is played by dictionaries and sets of rules. The article demonstrates that morphological, syntactic, and semantic ambiguities are the main obstacles for creating a high quality formal machine translation product. Statistical machine translation is based on data from parallel and monolingual corpora. While statistical machine translation is good for capturing the key words of a sentence, the overall quality of a translation remains low. The limited size of parallel corpora is the main obstacle of statistical machine translation. The article also discusses evaluation criteria for assessing MT quality, which is often very problematicVytauto Didžiojo universiteta

    Applying Collocation Segmentation to the ACL Anthology Reference Corpus

    No full text
    Collocation is a well-known linguistic phenomenon which has a long history of research and use. In this study I employ collocation segmentation to extract terms from the large and complex ACL Anthology Reference Corpus, and also briefly research and describe the history of the ACL. The results of the study show that until 1986, the most significant terms were related to formal/rule based methods. Starting in 1987, terms related to statistical methods became more important. For instance, language model, similarity measure, text classification. In 1990, the terms Penn Treebank, Mutual Information, statistical parsing, bilingual corpus, and dependency tree became the most important, showing that newly released language resources appeared together with many new research areas in computational linguistics. Although Penn Treebank was a significant term only temporarily in the early nineties, the corpus is still used by researchers today. The most recent significant terms are Bleu score and semantic role labeling. While machine translation as a term is significant throughout the ACL ARC corpus, it is not significant for any particular time period. This shows that some terms can be significant globally while remaining insignificant at a local level.

    The influence of collocation segmentation and top 10 items to keyword assignment performance

    No full text
    Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages testedSistemų analizės katedraVytauto Didžiojo universiteta

    Automatic identification of lexical units

    No full text
    Sistemų analizės katedraVytauto Didžiojo universiteta

    Automatic multilingual annotation of EU legislation with Eurovoc descriptors

    No full text
    Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate the method, comparing it against other language independent methods based on single words and bigrams. Testing the method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 50.7 to 57.6 percent over three diverse languages (English, Lithuanian and Finnish) tested. We found high correlation between automatic assignment precision against document length and language features such as inflectiveness and compoundingSistemų analizės katedraVytauto Didžiojo universiteta

    Teksto skaidymas pastoviųjų junginių segmentais

    No full text
    Disertacija rengta 2008-2012 metais Vytauto Didžiojo universiteteBibliogr.: p. 63-72Sistemų analizės katedraVytauto Didžiojo universiteta

    Pradžia į begalybę : mašininis vertimas ir lietuvių kalba

    No full text
    The article dwells on the most important landmarks in the history of machine translation (MT). When discussions about machine translations started in the middle of the 20th century, many thought that in a few years every person would carry a small device allowing people to communicate in any language. Today the view towards machine translation has become more pragmatic. For example, everyone agrees that machine translation of fiction will never be of satisfactory quality. The article discusses three key methods of machine translation: statistical, rule-based and logical. For practical purposes, often several methods of machine translation are integrated. In rule-based machine translation, the most important role is played by dictionaries and sets of rules. The article demonstrates that morphological, syntactic and semantic ambiguities are the main obstacles for creating a high-quality formal machine translation product. Statistical machine translation is based on data from parallel and monolingual corpora. While statistical machine translation is suitable for capturing the key words of a sentence, the overall quality of such translation remains low. The limited size of parallel corpora is the main obstacle of machine translation. The article also discusses evaluation criteria for assessing MT quality, which is often very problematic
    corecore