23 research outputs found

    D8.6 Dissemination, training and exploitation results

    Get PDF
    Mauerhofer, C., Rajagopal, K., & Greller, W. (2011). D8.6 Dissemination, training and exploitation results. LTfLL-project.Report on sustainability, dissemination and exploitation of the LtfLL projectThe work on this publication has been sponsored by the LTfLL STREP that is funded by the European Commission's 7th Framework Programme. Contract 212578 [http://www.ltfll-project.org

    Creating a German–Basque electronic dictionary for German learners

    Get PDF
    In this paper, we introduce the new electronic dictionary project EuDeLex, which is currently being worked on at UPV-EHU University of the Basque Country. The introduction addresses the need for and functions of a new electronic dictionary for that language pair, as well as general considerations about bilingual lexicography and German as foreign language (GFL). The language pair German–Basque, which can be called less-resourced or medium-density, does not have any lexicographical antecedents that could be updated or adapted. Nevertheless, existing monolingual lexicographical databases and a newly created German–Basque parallel corpus support the editing process of the new dictionary. We explain our workflow in macrostructure and microstructure design and editing, and propose a first iteration of the online user interface and publishing process.Keywords: Bilingual lexicography, electronic dictionaries, Basque language, German as foreign language, parallel corpora, user interface, wiktionaryDie samestelling van 'n Duits–Baskiese elektroniese woordeboek vir Duitse aanleerdersIn hierdie artikel word EuDeLex, die nuwe projek vir 'n elektroniese woordeboek wat tans aan die UPV-EHU Universiteit van die baskiese gebied saamgestel word, bespreek. In die inleiding word gewys op die behoefte aan en funksies van 'n nuwe elektroniese woordeboek vir hierdie taalpaar asook algemene aspekte van tweetalige leksikografie met Duits as vreemde taal. Die taalpaar Duits–Baskies, waarna verwys kan word as 'n taalpaar met minder hulpmiddele en medium digtheid, het geen leksikografiese voorgangers wat hersien of aangepas kan word nie. Desondanks word die samestellingsproses van 'n nuwe woordeboek ondersteun deur bestaande eentalige leksikografiese databasisse en 'n nuwe Duits–Baskiese parallelkorpus. Die werkswyse word bespreek m.b.t. die ontwerp van die makro- en mikrostruktuur en die redigering, en voorstelle word gemaak vir 'n eerste weergawe van 'n aanlyn koppelvlak en die publikasieproses.Sleutelwoorde: Baskiese taal, Duits as vreemde taal, elektroniese woordeboeke, gebruikerskoppelvlak, parallelkorpora, tweetalige leksikografie, wiktionar

    Wiktionary Matcher

    Get PDF
    In this paper, we introduce Wiktionary Matcher, an ontology matching tool that exploits Wiktionary as external background knowledge source. Wiktionary is a large lexical knowledge resource that is collaboratively built online. Multiple current language versions of Wiktionary are merged and used for monolingual ontology matching by exploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. We show that Wiktionary can be used as external background knowledge source for the task of ontology matching with reasonable matching and runtime performance

    Wiktionary matcher results for OAEI 2020

    Get PDF
    This paper presents the results of the Wiktionary Matcher in the Ontology Alignment Evaluation Initiative(OAEI) 2020.Wiktionary Matcher is an ontology matching tool that exploits Wiktionary as external background knowledge source. Wiktionary is a large lexical knowledge resource that is collaboratively built online. Multiple current language versions of Wiktionary are merged and used for monolingual ontology matching by exploiting synonymy relations and for multilingual matching by exploiting the translations given in the resource. This is the second OAEI participation of the matching system. Wiktionary Matcher has been improved and is the best performing system on the knowledge graph track this year

    Syntax-based Transfer Learning for the Task of Biomedical Relation Extraction

    Get PDF
    International audienceTransfer learning (TL) proposes to enhance machine learning performance on a problem, by reusing labeled data originally designed for a related problem. In particular, domain adaptation consists, for a specific task, in reusing training data developed for the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because those usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. In this paper, we experiment with TL for the task of Relation Extraction (RE) from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical RE tasks and equal performances for two others, for which few annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in TL for RE

    Size of corpora and collocations: The case of Russian

    Get PDF
    With the arrival of information technologies to linguistics, compiling a large corpus of data, and of web texts in particular, has now become a mere technical matter. These new opportunities have revived the question of corpus volume that can be formulated in the following way: are larger corpora better for linguistic research or, more precisely, do lexicographers need to analyze bigger amounts of collocations? The paper deals with experiments on collocation identification in low-frequency lexis using corpora of different volumes (1 million, 10 million, 100 million and 1.2 billion words). We have selected low-frequency adjectives, nouns and verbs in the Russian Frequency Dictionary and tested the following hypotheses: 1) collocations in low-frequency lexis are better represented by larger corpora; 2) frequent collocations presented in dictionaries have low occurrences in small corpora; 3) statistical measures for collocation extraction behave differently in corpora of different volumes. The results prove the fact that corpora of under 100 M are not representative enough to study collocations, especially those with nouns and verbs. MI and Dice tend to extract less reliable collocations as the corpus volume extends, whereas t-score and Fisher’s exact test demonstrate better results for larger corpora

    Laughing one's head off in Spanish subtitles: a corpus-based study on diatopic variation and its consequences for translation

    Get PDF
    Looking for phraseological information is common practice among translators. When rendering idioms, information is mostly needed to find the appropriate equivalent, but, also, to check usage and diasystemic restrictions. One of the most complex issues in this respect is diatopic variation. English and Spanish are transnational languages that are spoken in several countries around the globe. Crossvariety differences as regards idiomaticity range from the actual choice of phraseological units, to different lexical or grammatical variants, usage preferences and differential distribution. In this respect, translators are severely underequipped as regards information found in dictionaries. While some diatopic marks are generally used to indicate geographical restrictions, not all idioms are clearly identified and very little information is provided about preferences and/or crucial differences that occur when the same idiom is used in various national varieties. In translation, source language textemes usually turn into target language repertoremes, i.e. established units within the target system. Toury’s law of growing standardisation helps explaining why translated texts tend to be more simple, conventional and prototypical than non-translated texts, among other characteristic features. Provided a substantial part of translational Spanish is composed of textual repertoremes, any source textemes are bound to be ‘dissolved’ into typical ways of expressing in ‘standard’ Spanish. This means filtering source idiomatic diatopy through the ‘neutral, standard sieve’. This paper delves into the rendering into Spanish of the English idiom to laugh one’s head off. After a cursory look at the notions of transnational and translational Spanish(es) in Section 2, Section 3 analyses the translation strategies deployed in a giga-token parallel subcorpus of Spanish-English subtitles. In Section 4, dictionary and textual equivalents retrieved from the parallel corpus are studied against the background of two sets of synonymous idioms for ‘laughing out loud’ in 19 giga-token comparable subcorpora of Spanish national varieties. Corpas Pastor’s (2015) corpus-based research protocol will be adopted in order to uncover varietal differences, detect diatopic configurations and derive consequences for contrastive studies and translation, as summarised in Section 5. This is the first study, to the best of our knowledge, investigating the translation of to laugh one’s head off and also analysing the Spanish equivalent idioms in national and transnational corpora

    INEX Tweet Contextualization Task: Evaluation, Results and Lesson Learned

    Get PDF
    Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposal of a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweet by providing him with a short explanatory summary (500 words). This summary should be built automatically using resources like Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary. Running for four years, results show that the best systems combine NLP techniques with more traditional methods. More precisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering. This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, in this paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the state of the art for this task. As an important result from the 4 years competition, we also describe the open access resources that have been built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate to evaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness of produced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designing a task

    Multiword expressions at length and in depth

    Get PDF
    The annual workshop on multiword expressions takes place since 2001 in conjunction with major computational linguistics conferences and attracts the attention of an ever-growing community working on a variety of languages, linguistic phenomena and related computational processing issues. MWE 2017 took place in Valencia, Spain, and represented a vibrant panorama of the current research landscape on the computational treatment of multiword expressions, featuring many high-quality submissions. Furthermore, MWE 2017 included the first shared task on multilingual identification of verbal multiword expressions. The shared task, with extended communal work, has developed important multilingual resources and mobilised several research groups in computational linguistics worldwide. This book contains extended versions of selected papers from the workshop. Authors worked hard to include detailed explanations, broader and deeper analyses, and new exciting results, which were thoroughly reviewed by an internationally renowned committee. We hope that this distinctly joint effort will provide a meaningful and useful snapshot of the multilingual state of the art in multiword expressions modelling and processing, and will be a point point of reference for future work
    corecore