826 research outputs found

    MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora

    Get PDF
    Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features

    Mixed up with machine Translation: Multi-word Units Disambiguation Challenge.

    Get PDF
    With the rapid evolution of the Internet, translation has become part of the daily life of ordinary users, not only of professional translators. Machine translation has evolved along with different types of computer-assisted translation tools. Qualitative progress has been made in the field of machine translation, but not all problems have been solved. The current times are auspicious for the development of more sophisticated evaluation tools that measure the performance of specific linguistic phenomena. One problem in particular, namely the poor analysis and translation of multi-word units, is an arena where investment in linguistic knowledge systems with the goal of improving machine translation would be beneficial. This paper addresses the difficulties multi-word units present to machine translation, by comparing translations performed by systems adopting different approaches to machine translation. It proposes a solution for improving the quality of the translation of multi-word units by adopting a methodology that combines Lexicon Grammar resources with OpenLogos lexical resources and semantico-syntactic rules. Finally, it discusses how an ideal machine translation evaluation tool might look to correctly evaluate the performance of machine translation engines with regards to multi-word units and thus to contribute to the improvement of translation quality

    Using parallel text for the extraction of German multiword expressions

    Get PDF
    A procedure for the identification of semantically opaque (i.e. idiomatic) German multiwords is presented. We focus on verb + PP combinations that are lexicographically relevant (extracted via dependency parsing [Schiehlen 2003]) of the kind ins Leben rufen – “to initiate”, lit.: “to call into life”. Starting from [Villada Moirón and Tiedemann 2006], the method exploits the fact that opaque combinations are translated as a whole, whereas compositional uses would show regular, individual translations of the words involved. The translations into other languages are obtained by applying GIZA++ [Och and Ney 2003] word alignment to the EUROPARL corpus [Koehn 2005]. Numerous experiments are performed to further optimise the original method: several parameters are analysed individually as well as in combination with each other. This leads to the following results: depending on the actual parameter settings, values between 0.800 and 0.936 (in terms of uninterpolated average precision) are reached amongst the highest scoring 200 multiword candidates, as opposed to a baseline of 0.584, using the 200 most frequent multiwords in decreasing order of their occurrence frequency

    Implementing universal dependency, morphology, and multiword expression annotation standards for Turkish language processing

    Get PDF
    Released only a year ago as the outputs of a research project (“Parsing Web 2.0 Sentences”, supported in part by a TUBİTAK 1001 grant (No. 112E276) and a part of the ICT COST Action PARSEME (IC1207)), IMST and IWT are currently the most comprehensive Turkish dependency treebanks in the literature. This article introduces the final states of our treebanks, as well as a newly integrated hierarchical categorization of the multiheaded dependencies and their organization in an exclusive deep dependency layer in the treebanks. It also presents the adaptation of recent studies on standardizing multiword expression and named entity annotation schemes for the Turkish language and integration of benchmark annotations into the dependency layers of our treebanks and the mapping of the treebanks to the latest Universal Dependencies (v2.0) standard, ensuring further compliance with rising universal annotation trends. In addition to significantly boosting the universal recognition of Turkish treebanks, our recent efforts have shown an improvement in their syntactic parsing performance (up to 77.8%/82.8% LAS and 84.0%/87.9% UAS for IMST/IWT, respectively). The final states of the treebanks are expected to be more suited to different natural language processing tasks, such as named entity recognition, multiword expression detection, transfer-based machine translation, semantic parsing, and semantic role labeling.Peer reviewe

    Towards Universal Semantic Tagging

    Get PDF
    The paper proposes the task of universal semantic tagging---tagging word tokens with language-neutral, semantically informative tags. We argue that the task, with its independent nature, contributes to better semantic analysis for wide-coverage multilingual text. We present the initial version of the semantic tagset and show that (a) the tags provide semantically fine-grained information, and (b) they are suitable for cross-lingual semantic parsing. An application of the semantic tagging in the Parallel Meaning Bank supports both of these points as the tags contribute to formal lexical semantics and their cross-lingual projection. As a part of the application, we annotate a small corpus with the semantic tags and present new baseline result for universal semantic tagging.Comment: 9 pages, International Conference on Computational Semantics (IWCS

    Non-Compositional Term Dependence for Information Retrieval

    Full text link
    Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection [21] to IR. Our approach, integrated into ranking using Markov Random Fields [31], yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR

    Multiword expression aware neural machine translation

    Get PDF
    Multiword Expressions (MWEs) are a frequently occurring phenomenon found in all natural languages that is of great importance to linguistic theory, natural language processing applications, and machine translation systems. Neural Machine Translation (NMT) architectures do not handle these expression well and previous studies have not explicitly addressed MWEs in this framework. In this work, we show that using external linguistic resources and data augmentation we can improve both translations of MWEs that occur in the source, and the generation of MWEs on the target, and improve performance by up to 5.09 BLEU points on MWE test sets. We also devise a MWE score to specifically assess the quality of MWE translation which agrees with human evaluation. We make available the MWEscore implementation – along with MWE-annotated training sets and corpus-based lists of MWEs – for reproduction and extension

    Image Semantics in the Description and Categorization of Journalistic Photographs

    Get PDF
    This paper reports a study on the description and categorization of images. The aim of the study was to evaluate existing indexing frameworks in the context of reportage photographs and to find out how the use of this particular image genre influences the results. The effect of different tasks on image description and categorization was also studied. Subjects performed keywording and free description tasks and the elicited terms were classified using the most extensive one of the reviewed frameworks. Differences were found in the terms used in constrained and unconstrained descriptions. Summarizing terms such as abstract concepts, themes, settings and emotions were used more frequently in keywording than in free description. Free descriptions included more terms referring to locations within the images, people and descriptive terms due to the narrative form the subjects used without prompting. The evaluated framework was found to lack some syntactic and semantic classes present in the data and modifications were suggested. According to the results of this study image categorization is based on high-level interpretive concepts, including affective and abstract themes. The results indicate that image genre influences categorization and keywording modifies and truncates natural image description
    corecore