19 research outputs found
A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora
We present a pattern matching method for compiling a bilingual lexicon of
nouns and proper nouns from unaligned, noisy parallel texts of
Asian/Indo-European language pairs. Tagging information of one language is
used. Word frequency and position information for high and low frequency words
are represented in two different vector forms for pattern matching. New anchor
point finding and noise elimination techniques are introduced. We obtained a
73.1\% precision. We also show how the results can be used in the compilation
of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the
Proceedings of the 33rd AC
Models of Co-occurrence
A model of co-occurrence in bitext is a boolean predicate that indicates
whether a given pair of word tokens co-occur in corresponding regions of the
bitext space. Co-occurrence is a precondition for the possibility that two
tokens might be mutual translations. Models of co-occurrence are the glue that
binds methods for mapping bitext correspondence with methods for estimating
translation models into an integrated system for exploiting parallel texts.
Different models of co-occurrence are possible, depending on the kind of bitext
map that is available, the language-specific information that is available, and
the assumptions made about the nature of translational equivalence. Although
most statistical translation models are based on models of co-occurrence,
modeling co-occurrence correctly is more difficult than it may at first appear
Avtomatska razširitev in čiščenje sloWNeta
International audienceIn this paper we present a language-independent and automatic approach to extend a wordnet by recycling different types of already existing language resources, such as machine-readable dictionaries, parallel corpora and Wikipedia. The approach, applied to Slovene, takes into account monosemous and polysemous words, general and specialized vocabulary as well as simple and multi-word lexemes. The extracted words are assigned one or several synset ids based on a classifier that relies on several features including distributional similarity. In the next step we also identify and remove highly dubious (literal, synset) pairs, based on simple distributional information extracted from a large corpus in an unsupervised way. Automatic and manual evaluation show that the proposed approach yields very promising results.V prispevku predstavljamo jezikovno neodvisno in avtomatsko razširitev wordneta z uporabo heterogenih že obstoječih jezikovnih virov, kot so strojno berljivi slovarji, vzporedni korpusi in Wikipedija. Pristop, ki ga preizkusimo na slovenščini, upošteva tako eno- kot večpomenske besede, splošno in specializirano besedišče, pa tudi eno- in večbesedne lekseme. Izluščenim besedam enega ali več pomenov pripišemo s pomočjo klasifikatorja, ki temelji na naboru različnih značilk, predvsem pa na distribucijski podobnosti. V naslednjem koraku s pomočjo distribucijskih informacij, izluščenih iz velikega korpusa, identificiramo in odstranimo zelo dvomljive kandidate. Avtomatska in ročna evalvacija rezultatov pokaže, da uporabljeni pristop daje zelo spodbudne rezultate
Automatic Extension of WOLF
International audienceIn this paper we present the extension of WOLF, a freely available, automatically creat- ed wordnet for French, the biggest drawback of which has until now been the lack of general concepts that are typically expressed with highly polysemous vocabulary that is on the one hand the most valuable for applications in human language technologies but also the most difficult to add to wordnet accurately with automatic methods on the other. Using a set of features, we train a Maximum Entropy classifier on the existing core wordnet to be able to assign appropriate synset ids to new words, extracted from multiple, multilingual sources of lexical knowledge, such as Wik- tionaries, Wikipedias and corpora. Automatic and manual evaluation shows high coverage as well as high quality of the resulting lexico-semantic repository of. Another important ad- vantage of the approach is that it is fully au- tomatic and language-independent and could therefore be applied to any other language still lacking a wordnet
Improving machine translation performance using comparable corpora
Abstract The overwhelming majority of the languages in the world are spoken by less than 50 million native speakers, and automatic translation of many of these languages is less investigated due to the lack of linguistic resources such as parallel corpora. In the ACCURAT project we will work on novel methods how comparable corpora can compensate for this shortage and improve machine translation systems of under-resourced languages. Translation systems on eighteen European language pairs will be investigated and methodologies in corpus linguistics will be greatly advanced. We will explore the use of preliminary SMT models to identify the parallel parts within comparable corpora, which will allow us to derive better SMT models via a bootstrapping loop
METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development
International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance