16 research outputs found

    The strategic impact of META-NET on the regional, national and international level

    Get PDF
    This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer ReviewedPostprint (author's final draft

    Corpus-based automatic detection of example sentences for dictionaries for Estonian learners

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneNĂ€itelause tĂ€idab sĂ”nastikus kindlat eesmĂ€rki, aidates aru saada sĂ”na tĂ€hendusest ja illustreerides sĂ”na erinevaid kasutuskontekste. NĂ€itelausete pĂ”hiallikas on mahukas tekstikorpus, kust aga kĂ€sitsi on nĂ€itelauset leida vĂ€ga keeruline. Elektroonilise leksikograafia arenguga on Eestisse jĂ”udnud mitmed töövahendid, mis aitavad automaatselt tuvastada eri sĂ”nastike jaoks vajalikku infot, sealhulgas nĂ€itelauseid. VĂ€itekirjas uuritakse, missugused parameetrid iseloomustavad Eesti Keele Instituudis koostatud sĂ”nastike "Eesti keele sĂ”naraamat 2019", "Eesti keele pĂ”hisĂ”navara sĂ”nastik 2014", "Eesti keele naabersĂ”nad 2019" nĂ€itelauseid ning "Eesti keele A1−C1 Ă”pikute korpuse 2018" lauseid. Uurimuse eesmĂ€rk on vĂ€lja töötada meetod, mis vĂ”imaldab neid parameetreid arvestades korpusest automaatselt tuvastada eesti keele Ă”ppijatele sobivaid lauseid. Töö keskmes on reeglipĂ”hine lĂ€henemine, mida rakendatakse korpuspĂ€ringusĂŒsteemi Sketch Engine integreeritud tööriista GDEX ehk Good Dictionary Examples nĂ€itel. Parameetrite hÀÀlestamiseks on osaliselt kasutatud ka masinĂ”ppe elemente. SĂ”nastiku nĂ€itelausete ja Ă”pikulausete analĂŒĂŒs nĂ€itas, et hea eesti keele nĂ€itelause peab olema tĂ€islause ja vastama muuhulgas jĂ€rgmistele parameetritele: on 4–20 sĂ”net pikk; ei sisalda sĂ”nesid, mis on pikemad kui 20 tĂ€hemĂ€rki; ei alga teatud sĂ”naliikidega (nt sidesĂ”naga) ega tagasi viitavate sĂ”nade (nt sellepĂ€rast) vĂ”i sĂ”napaaridega (nt sellisel puhul); ei sisalda vulgaarseid ja halvustavaid sĂ”nu, madala sagedusega sĂ”nu jmt. Uurimuse tulemusena on loodud "Eesti keele Ă”ppekorpus 2018 (etSkELL)", mis sisaldab ainult vĂ€lja töötatud parameetritele vastavaid lauseid. Õppekorpus on omakorda aluseks eesti keele Ă”ppekeskkonnale Sketch Engine for Estonian Language Learning ehk etSkELL ja veebilausetele Eesti Keele Instituudi keeleportaalis SĂ”naveeb.The function of an example sentence in a dictionary is to help the reader understand the meaning of the headword and illustrate its contexts of use. Nowadays, the main source of example sentences is a large text corpus, where suitable sentences are hard to find. Luckily, e-lexicography has generated automatic tools to help detect various information for dictionaries, including example sentences. The dissertation examines certain parameters of the example sentences presented in the Dictionary of Estonian (2019), Basic Estonian Dictionary (2014), Estonian Collocations Dictionary (2019), and Estonian Coursebook Corpus (2018); all four were compiled at the Institute of the Estonian language. The aim of my study is to elaborate an automatic method using parameters which identify sentences suitable for learners of Estonian. To that end, a rule-based approach was applied to the example of Good Dictionary Examples (GDEX) integrated in the Sketch Engine corpus query tool. Machine learning elements were also adopted to fine-tune the parameters. According to the analysis of the example sentences used in the dictionaries and coursebook sentences, a good Estonian example sentence should be a full sentence meeting, inter alia, the following parameters: length 4–20 tokens; no tokens longer than 20 characters; never begins with certain parts of speech (e.g., conjunction) or an anaphoric word (e.g., sellepĂ€rast ‘this is why’) or word pair (e.g., sellisel puhul ‘in such a case’); and vulgar or disparaging words, rare words, etc., are excluded. The study resulted in the compilation of the Estonian Corpus for Learners 2018 (etSkELL), which contains no other sentences but those corresponding to the developed parameters. The corpus, in turn, serves as the basis for the corpus-based web tool Sketch Engine for Estonian Language Learning (etSkELL) and the web sentences in the language portal SĂ”naveeb of the Institute of the Estonian Language.https://www.ester.ee/record=b530293

    24th Nordic Conference on Computational Linguistics (NoDaLiDa)

    Get PDF

    CLARIN. The infrastructure for language resources

    Get PDF
    CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future. The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    corecore