Search CORE

1,066 research outputs found

Transitive probabilistic CLIR models.

Author: Jong F.M.G. de
Kraaij W.
Publication venue: Centre de hautes etudes internationales (CID)
Publication date: 01/01/2004
Field of study

Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator

CiteSeerX

University of Twente Research Information

A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora

Author: Fung Pascale
Publication venue
Publication date: 01/01/1995
Field of study

We present a pattern matching method for compiling a bilingual lexicon of nouns and proper nouns from unaligned, noisy parallel texts of Asian/Indo-European language pairs. Tagging information of one language is used. Word frequency and position information for high and low frequency words are represented in two different vector forms for pattern matching. New anchor point finding and noise elimination techniques are introduced. We obtained a 73.1\% precision. We also show how the results can be used in the compilation of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the Proceedings of the 33rd AC

arXiv.org e-Print Archive

CiteSeerX

Crossref

Columbia University Academic Commons

Automatic Construction of Clean Broad-Coverage Translation Lexicons

Author: Melamed I. Dan
Publication venue
Publication date: 01/01/1996
Field of study

Word-level translational equivalences can be extracted from parallel texts by surprisingly simple statistical techniques. However, these techniques are easily fooled by {\em indirect associations} --- pairs of unrelated words whose statistical properties resemble those of mutual translations. Indirect associations pollute the resulting translation lexicons, drastically reducing their precision. This paper presents an iterative lexicon cleaning method. On each iteration, most of the remaining incorrect lexicon entries are filtered out, without significant degradation in recall. This lexicon cleaning technique can produce translation lexicons with recall and precision both exceeding 90\%, as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9

arXiv.org e-Print Archive

CiteSeerX

Trans-gram, Fast Cross-lingual Word-embeddings

Author: Benhalloum Amine
Coulmance Jocelyn
Marty Jean-Marc
Wenzek Guillaume
Publication venue
Publication date: 01/01/2015
Field of study

We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align wordembeddings for a variety of languages, using only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned wordembeddings for twenty-one languages using English as a pivot language. We show that some linguistic features are aligned across languages for which we do not have aligned data, even though those properties do not exist in the pivot language. We also achieve state of the art results on standard cross-lingual text classification and word translation tasks.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Examining the validity of cross-lingual word sense disambiguation

Author: Hoste Veronique
Lefever Els
Publication venue
Publication date: 01/01/2011
Field of study

Ghent University Academic Bibliography

Investigating cross-lingual alignment methods for contextualized embeddings with Token-level evaluation

Author: Korhonen A
Liu Q
McCarthy D
Vulić I
Publication venue: CoNLL 2019 - 23rd Conference on Computational Natural Language Learning, Proceedings of the Conference
Publication date: 01/01/2019
Field of study

In this paper, we present a thorough investigation on methods that align pre-trained contextualized embeddings into shared cross-lingual context-aware embedding space, providing strong reference benchmarks for future context-aware crosslingual models. We propose a novel and challenging task, Bilingual Token-level Sense Retrieval (BTSR). It specifically evaluates the accurate alignment of words with the same meaning in cross-lingual non-parallel contexts, currently not evaluated by existing tasks such as Bilingual Contextual Word Similarity and Sentence Retrieval. We show how the proposed BTSR task highlights the merits of different alignment methods. In particular, we find that using context average type-level alignment is effective in transferring monolingual contextualized embeddings cross-lingually especially in non-parallel contexts, and at the same time improves the monolingual space. Furthermore, aligning independently trained models yields better performance than aligning multilingual embeddings with shared vocabulary.Peterhouse College Studentship; ERC Consolidator Grant LEXICA

Crossref

Apollo (Cambridge)