1,066 research outputs found
Transitive probabilistic CLIR models.
Transitive translation could be a useful technique to enlarge the number of supported language pairs for a cross-language information retrieval (CLIR) system in a cost-effective manner. The paper describes several setups for transitive translation based on probabilistic translation models. The transitive CLIR models were evaluated on the CLEF test collection and yielded a retrieval effectiveness\ud
up to 83% of monolingual performance, which is significantly better than a baseline using the synonym operator
A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora
We present a pattern matching method for compiling a bilingual lexicon of
nouns and proper nouns from unaligned, noisy parallel texts of
Asian/Indo-European language pairs. Tagging information of one language is
used. Word frequency and position information for high and low frequency words
are represented in two different vector forms for pattern matching. New anchor
point finding and noise elimination techniques are introduced. We obtained a
73.1\% precision. We also show how the results can be used in the compilation
of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the
Proceedings of the 33rd AC
Automatic Construction of Clean Broad-Coverage Translation Lexicons
Word-level translational equivalences can be extracted from parallel texts by
surprisingly simple statistical techniques. However, these techniques are
easily fooled by {\em indirect associations} --- pairs of unrelated words whose
statistical properties resemble those of mutual translations. Indirect
associations pollute the resulting translation lexicons, drastically reducing
their precision. This paper presents an iterative lexicon cleaning method. On
each iteration, most of the remaining incorrect lexicon entries are filtered
out, without significant degradation in recall. This lexicon cleaning technique
can produce translation lexicons with recall and precision both exceeding 90\%,
as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9
Trans-gram, Fast Cross-lingual Word-embeddings
We introduce Trans-gram, a simple and computationally-efficient method to
simultaneously learn and align wordembeddings for a variety of languages, using
only monolingual data and a smaller set of sentence-aligned data. We use our
new method to compute aligned wordembeddings for twenty-one languages using
English as a pivot language. We show that some linguistic features are aligned
across languages for which we do not have aligned data, even though those
properties do not exist in the pivot language. We also achieve state of the art
results on standard cross-lingual text classification and word translation
tasks.Comment: EMNLP 201
Investigating cross-lingual alignment methods for contextualized embeddings with Token-level evaluation
In this paper, we present a thorough investigation on methods that align pre-trained contextualized embeddings into shared cross-lingual context-aware embedding space, providing strong reference benchmarks for future context-aware crosslingual models. We propose a novel and challenging task, Bilingual Token-level Sense Retrieval (BTSR). It specifically evaluates the accurate alignment of words with the same meaning in cross-lingual non-parallel contexts, currently not evaluated by existing tasks such as Bilingual Contextual Word Similarity and Sentence Retrieval. We show how the proposed BTSR task highlights the merits of different alignment methods. In particular, we find that using context average type-level alignment is effective in transferring monolingual contextualized embeddings cross-lingually especially in non-parallel contexts, and at the same time improves the monolingual space. Furthermore, aligning independently trained models yields better performance than aligning multilingual embeddings with shared vocabulary.Peterhouse College Studentship; ERC Consolidator Grant LEXICA
- …