10,277 research outputs found

    MultiMWE: building a multi-lingual multi-word expression (MWE) parallel corpora

    Get PDF
    Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features

    A BERT-based dual embedding model for Chinese idiom prediction

    Get PDF
    Chinese idioms are special fixed phrases usually derived from ancient stories, whose meanings are oftentimes highly idiomatic and non-compositional. The Chinese idiom prediction task is to select the correct idiom from a set of candidate idioms given a context with a blank. We propose a BERT-based dual embedding model to encode the contextual words as well as to learn dual embeddings of the idioms. Specifically, we first match the embedding of each candidate idiom with the hidden representation corresponding to the blank in the context. We then match the embedding of each candidate idiom with the hidden representations of all the tokens in the context thorough context pooling. We further propose to use two separate idiom embeddings for the two kinds of matching. Experiments on a recently released Chinese idiom cloze test dataset show that our proposed method performs better than the existing state of the art. Ablation experiments also show that both context pooling and dual embedding contribute to the improvement of performance.Comment: COLING 202

    InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

    Full text link
    We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.Comment: Accepted to "2022 International Conference on Asian Language Processing (IALP)

    Phraseology in Corpus-based transaltion studies : stylistic study of two contempoarary Chinese translation of Cervantes's Don Quijote

    No full text
    The present work sets out to investigate the stylistic profiles of two modern Chinese versions of Cervantes???s Don Quijote (I): by Yang Jiang (1978), the first direct translation from Castilian to Chinese, and by Liu Jingsheng (1995), which is one of the most commercially successful versions of the Castilian literary classic. This thesis focuses on a detailed linguistic analysis carried out with the help of the latest textual analytical tools, natural language processing applications and statistical packages. The type of linguistic phenomenon singled out for study is four-character expressions (FCEXs), which are a very typical category of Chinese phraseology. The work opens with the creation of a descriptive framework for the annotation of linguistic data extracted from the parallel corpus of Don Quijote. Subsequently, the classified and extracted data are put through several statistical tests. The results of these tests prove to be very revealing regarding the different use of FCEXs in the two Chinese translations. The computational modelling of the linguistic data would seem to indicate that among other findings, while Liu???s use of archaic idioms has followed the general patterns of the original and also of Yang???s work in the first half of Don Quijote I, noticeable variations begin to emerge in the second half of Liu???s more recent version. Such an idiosyncratic use of archaisms by Liu, which may be defined as style shifting or style variation, is then analyzed in quantitative terms through the application of the proposed context-motivated theory (CMT). The results of applying the CMT-derived statistical models show that the detected stylistic variation may well point to the internal consistency of the translator in rendering the second half of Part I of the novel, which reflects his freer, more creative and experimental style of translation. Through the introduction and testing of quantitative research methods adapted from corpus linguistics and textual statistics, this thesis has made a major contribution to methodological innovation in the study of style within the context of corpus-based translation studies.Imperial Users onl

    Phraseology in Corpus-Based Translation Studies: A Stylistic Study of Two Contemporary Chinese Translations of Cervantes's Don Quijote

    No full text
    The present work sets out to investigate the stylistic profiles of two modern Chinese versions of Cervantes’s Don Quijote (I): by Yang Jiang (1978), the first direct translation from Castilian to Chinese, and by Liu Jingsheng (1995), which is one of the most commercially successful versions of the Castilian literary classic. This thesis focuses on a detailed linguistic analysis carried out with the help of the latest textual analytical tools, natural language processing applications and statistical packages. The type of linguistic phenomenon singled out for study is four-character expressions (FCEXs), which are a very typical category of Chinese phraseology. The work opens with the creation of a descriptive framework for the annotation of linguistic data extracted from the parallel corpus of Don Quijote. Subsequently, the classified and extracted data are put through several statistical tests. The results of these tests prove to be very revealing regarding the different use of FCEXs in the two Chinese translations. The computational modelling of the linguistic data would seem to indicate that among other findings, while Liu’s use of archaic idioms has followed the general patterns of the original and also of Yang’s work in the first half of Don Quijote I, noticeable variations begin to emerge in the second half of Liu’s more recent version. Such an idiosyncratic use of archaisms by Liu, which may be defined as style shifting or style variation, is then analyzed in quantitative terms through the application of the proposed context-motivated theory (CMT). The results of applying the CMT-derived statistical models show that the detected stylistic variation may well point to the internal consistency of the translator in rendering the second half of Part I of the novel, which reflects his freer, more creative and experimental style of translation. Through the introduction and testing of quantitative research methods adapted from corpus linguistics and textual statistics, this thesis has made a major contribution to methodological innovation in the study of style within the context of corpus-based translation studies
    corecore