24 research outputs found

    Promocijas darbs

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs veltīts hibrīda latviešu valodas gramatikas modeļa izstrādei un transformēšanai uz Universālo atkarību (Universal Dependencies, UD) modeli. Promocijas darbā ir aizsākts jauns latviešu valodas izpētes virziens – sintaktiski marķētos tekstos balstīti pētījumi. Darba rezultātā ir izstrādāts un aprobēts fundamentāls, latviešu valodai iepriekš nebijis valodas resurss – mašīnlasāms sintaktiski marķēts korpuss 17 tūkstošu teikumu apmērā. Teikumi ir marķēti atbilstoši diviem dažādiem sintaktiskās marķēšanas modeļiem – darbā radītajam frāžu struktūru un atkarību gramatikas hibrīdam un starptautiski aprobētajam UD modelim. Izveidotais valodas resurss publiski pieejams gan lejuplādei, gan tiešsaistes meklēšanai abos iepriekš minētajos marķējuma veidos. Pētījuma laikā radīta rīku kopa un latviešu valodas sintaktiski marķētā korpusa veidošanai vajadzīgā infrastruktūra. Tajā skaitā tika definēti plašam valodas pārklājumam nepieciešamie LU MII eksperimentālā hibrīdā gramatikas modeļa paplašinājumi. Tāpat tika analizētas iespējas atbilstoši hibrīdmodelim marķētus datus pārveidot uz atkarību modeli, un tika radīts atvasināts UD korpuss. Izveidotais sintaktiski marķētais korpuss ir kalpojis par pamatu, lai varētu radīt augstas precizitātes (91%) parsētājus latviešu valodai. Savukārt dalība UD iniciatīvā ir veicinājusi latviešu valodas un arī citu fleksīvu valodu resursu starptautisko atpazīstamību un fleksīvām valodām piemērotāku rīku izveidi datorlingvistikā – pētniecības jomā, kuras vēsturiskā izcelsme pamatā meklējama darbā ar analītiskajām valodām. Atslēgvārdi: sintakses korpuss, Universal Dependencies, valodu tehnoloģijasThe given doctoral thesis describes the creation of a hybrid grammar model for the Latvian language, as well as its subsequent conversion to a Universal Dependencies (UD) grammar model. The thesis also lays the groundwork for Latvian language research through syntactically annotated texts. In this work, a fundamental Latvian language resource was developed and evaluated for the first time – a machine-readable treebank of 17 thousand syntactically annotated sentences. The sentences are annotated according to two syntactic annotation models: the hybrid grammar model developed in the thesis, and the internationally recognised UD model. Both annotated versions of the treebank are publicly available for downloading or querying online. Over the course of the study, a set of tools and infrastructure necessary for treebank creation and maintenance were developed. The language coverage of the IMCS UL experimental hybrid model was extended, and the possibilities were defined for converting data annotated according to the hybrid grammar model to the dependency grammar model. Based on this work, a derived UD treebank was created. The resulting treebank has served as a basis for the development of high accuracy (91%) Latvian language parsers. Furthermore, the participation in the UD initiative has promoted the international recognition of Latvian and other inflective languages and the development of better-fitted tools for inflective language processing in computational linguistics, which historically has been more oriented towards analytic languages. Keywords: treebank, Universal Dependencies, language technologie

    Angļu-latviešu statistiskās mašīntulkošanas sistēmas izveide: metodes, resursi un pirmie rezultāti

    Get PDF
    <p class="Pa4"><strong>DEVELOPMENT OF ENGLISH-LATVIAN STATISTICAL MACHINE TRANSLATION SYSTEM: METHODS, RESOURCES AND FIRST RESULTS</strong></p><p class="Pa5"><em>Summary</em></p><p>This paper presents research and development of English-Latvian Statistical Machine Translation (SMT) prototypes for legal domain. Several methods have been investigated, i.e., phrase-based models and factored models. Translation quality has been evaluated using automated metrics (BLEU score) and human evaluation. In automatic evaluation the best score (46.44 BLEU points) was assigned to factored model trained on JRC Ac­quis corpus (version 3.0) which was also evaluated as the best from the human viewpoint. In addition, error analysis of SMT output was performed. This analysis showed that al­though the output of the best prototype demonstrated a reasonable quality, it had several frequent common errors, i.e., incorrect form, missing words and wrong word order. For the future, work on tree-based SMT and hybrid systems is proposed.</p

    A Prague Markup Language profile for the SemTi-Kamols grammar model

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 303-306. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/1695

    Relatório de estágio em farmácia comunitária

    Get PDF
    Relatório de estágio realizado no âmbito do Mestrado Integrado em Ciências Farmacêuticas, apresentado à Faculdade de Farmácia da Universidade de Coimbr

    An integrated system for the development of Latvian Treebank

    No full text
    Darbā aplūkota sintaktiski anotētu korpusu (treebank) izstrādes problemātika ar mērķi radīt stabilu tehnoloģisko pamatu sintaktiski anotēta latviešu valodas korpusa izstrādei. Darbā apskatīti klasiskie sintaktiskās analīzes (reprezentācijas) modeļi — vārdkopu struktūru un atkarību gramatikas — un SemTi-Kamola hibrīdais gramatikas modelis valodām ar relatīvi brīvu vārdu secību. Darbā analizēta pasaulē lielāko sintaktiski anotēto korpusu pieredze un formāti, īpašu uzmanību pievēršot vadošā atkarību pieejā balstītā korpusa — Prāgas atkarību korpusa (Prague Dependency Treebank — PDT) — vairāklīmeņu anotāciju struktūrai. Darbā izstrādāts SemTi-Kamola gramatikas modeļa paplašinājums, kas nodrošina sintaktiski neierobežotu teikumu anotēšanu. Izveidots PML (Prague Markup Language) profils SemTi-Kamols datu aprakstīšanai starptautiski atzītā mašīnlasāmā formā. Izstrādātais XML balstītais datu formāts ir integrēts ar SemTi-Kamola automātiskās sintaktiskās analīzes rīkiem un vizuālo kokveida datu struktūru redaktoru TrEd, kas ir izmantots PDT izveidē. Tādejādi ir radīts tehnoloģiskais un metodoloģiskais pamats latviešu valodas sintaktiski anotēta korpusa radīšanai — vide (integrētu rīku un formātu kopums), kas ļauj tekstus formāli anotēt atbilstoši SemTi-Kamols modelim, bet neprasa specifiskas tehnoloģiju zināšanas no lietotāja (valodnieka). Izstrādātā vide tiek sekmīgi pielietota praksē — izveidotas anotācijas apmēram 200 teikumiem.The problem of developing syntactically annotated text corpus (treebank) is considered in this work. The aim of this work is to develop a sound technological base for developing Latvian Treebank. General approaches of the syntactic analysis are described — the phrase structure approach and the dependency approach. The SemTi-Kamols hybrid dependency based grammar for languages with rather free word order is also described. The experience of world’s largest treebanks, particularly Prague Dependency Treebank (PDT) and its multi-level annotation structure, is analysed as well. An extension of the SemTi-Kamols model has been developed to cover syntactically unrestricted sentences of Latvian language. A PML (Prague Markup Language) profile for displaying SemTi-Kamols data in the internationally acknowledged machine-readable form has been developed. This XML based format is integrated with SemTi-Kamols parser and visual tree editor TrEd originally developed for PDT. The main result of this work is the technological and methodological base for creating Latvian Treebank — a framework consisting of integrated tools and formats that allows to annotate treebank data accordingly to the SemTi-Kamols model without requiring deep technological knowledge from the end-user (linguist). Approximately 200 sentences have been annotated using the developed framework

    Universally trainable structural markup tool for explanatory dictionaries

    No full text
    Darbā apskatīts jautājums, kā izstrādāt pielāgojamu programmnodrošinājumu latviešu valodas skaidrojošo vārdnīcu strukturālai marķēšanai, no vizuāla formatējuma iegūstot pēc iespējas precīzu šķirkļu struktūras atainojumu XML formātā, atbilstošu iepriekš dotai XML shēmai. Izstrādājamā rīka mērķis ir ļaut leksikogrāfam bez zināšanām datorikā, programmēšanā, tai skaitā regulārajās izteiksmēs, ar maksimāli vienkāršiem, intuitīviem līdzekļiem iegūt konkrētai, esošai vārdnīcai pielāgotu automātisku marķētāju. Darbā ir aprakstīti gan izstrādātie marķēšanas rīka darbības principi, gan realizētais prototips.The paper considers the problem (question) how to develop adjustable software for the structural mark-up of Latvian language explanatory dictionaries. The software is based on the idea of obtaining an accurate portrayal representation of entities’ structures in XML format from visual formatting. Obtained XML structure corresponds to the appropriate XML schema. The main purpose of the developed tool is to allow the lexicographer without knowledge in computer science, programming, including regular expressions, with a maximum of a simple, intuitive means to obtain a customized automatic mark-up tool for the concrete, existing dictionary. The paper describes both the operating principles of the developed mark-up tool and the realized prototyp

    Dictionary and Thesaurus of Latvian - Tezaurs.lv (ELEXIS)

    No full text
    Tēzaurs.lv: An extensive dictionary and thesaurus of Latvian, comprising more than 320,000 lexical entries, including multi-word units. Compiled and edited based on more than 300 sources. Provides detailed morphological information; being extented into a Latvian WordNet

    Universal Dependencies 2.0 alpha (obsolete)

    No full text
    This release contains errors in several files. Please use http://hdl.handle.net/11234/1-1983 instead

    Universal Dependencies 2.0

    No full text
    Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008). This release is special in that the treebanks will be used as training/development data in the CoNLL 2017 shared task (http://universaldependencies.org/conll17/). Test data are not released, except for the few treebanks that do not take part in the shared task. 64 treebanks will be in the shared task, and they correspond to the following 45 languages: Ancient Greek, Arabic, Basque, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Gothic, Greek, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Kazakh, Korean, Latin, Latvian, Norwegian, Old Church Slavonic, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Turkish, Ukrainian, Urdu, Uyghur and Vietnamese. This release fixes a bug in http://hdl.handle.net/11234/1-1976. Changed files: ud-tools-v2.0.tgz (conllu_to_text.pl, conllu_to_conllx.pl; added text_without_spaces.pl), ud-treebanks-conll2017.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt) and ud-treebanks-v2.0.tgz (fi_ftb-ud-train.txt, he-ud-train.txt, it-ud-train.txt, pt_br-ud-train.txt, es-ud-train.txt, ar_nyuad-ud-dev.txt, ar_nyuad-ud-test.txt, ar_nyuad-ud-train.txt, cop-ud-dev.txt, cop-ud-test.txt, cop-ud-train.txt, sa-ud-dev.txt, sa-ud-test.txt, sa-ud-train.txt)

    Universal Dependencies 1.4

    No full text
    Universal Dependencies is a project that seeks to develop cross-linguistically consistent treebank annotation for many languages, with the goal of facilitating multilingual parser development, cross-lingual learning, and parsing research from a language typology perspective. The annotation scheme is based on (universal) Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal part-of-speech tags (Petrov et al., 2012), and the Interset interlingua for morphosyntactic tagsets (Zeman, 2008)
    corecore