1,921 research outputs found

    Vietnamese to Chinese Machine Translation via Chinese Character as Pivot

    Get PDF

    Multilingual Schema Matching for Wikipedia Infoboxes

    Full text link
    Recent research has taken advantage of Wikipedia's multilingualism as a resource for cross-language information retrieval and machine translation, as well as proposed techniques for enriching its cross-language structure. The availability of documents in multiple languages also opens up new opportunities for querying structured Wikipedia content, and in particular, to enable answers that straddle different languages. As a step towards supporting such queries, in this paper, we propose a method for identifying mappings between attributes from infoboxes that come from pages in different languages. Our approach finds mappings in a completely automated fashion. Because it does not require training data, it is scalable: not only can it be used to find mappings between many language pairs, but it is also effective for languages that are under-represented and lack sufficient training samples. Another important benefit of our approach is that it does not depend on syntactic similarity between attribute names, and thus, it can be applied to language pairs that have distinct morphologies. We have performed an extensive experimental evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and English. The results show that not only does our approach obtain high precision and recall, but it also outperforms state-of-the-art techniques. We also present a case study which demonstrates that the multilingual mappings we derive lead to substantial improvements in answer quality and coverage for structured queries over Wikipedia content.Comment: VLDB201

    Building lexical resources: towards programmable contributive platforms

    Get PDF
    International audienceLexical resources are very important in nowadays society, with the globalization and the increase of world communi- cation and exchanges. There are clearly identified needs, both for humans and machines. Nevertheless, very few efforts are actually done in this domain. Consequently, there is an important lack of freely available good quality resources, especially for under- resourced languages. Furthermore, the majority of existing bilin- gual dictionaries is built with one language as English. Therefore, if one wants to translate from one language (that is not English) to another, it uses English as a pivot. And even for English native speakers, it creates a lot of misunderstandings that can be critical in many situations. In order to create and extend freely available good quality rich lexical resources for under-resourced languages online with a community of voluntary contributors, Jibiki, an online generic platform for managing (lookup, editing, import, export) any kind of lexical resources encoded in XML, has been developed. This platform is successfully used in several dictionary construction projects. Concerning the data, a serious game has been launched in order to collect precious lexical information such as collocations that will be integrated later into dictionary entries. Work is now done on extending our platform in order to reuse the resulting resources and enriching them by synchronization with the other systems (language learners and translators environments, machine translation systems, etc.)

    Identifying Semantic Divergences in Parallel Text without Annotations

    Full text link
    Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.Comment: Accepted as a full paper to NAACL 201

    Crosslingual Document Embedding as Reduced-Rank Ridge Regression

    Get PDF
    There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19
    • …
    corecore