21 research outputs found
Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons
This paper shows how to induce an N-best translation lexicon from a bilingual
text corpus using statistical properties of the corpus together with four
external knowledge sources. The knowledge sources are cast as filters, so that
any subset of them can be cascaded in a uniform framework. A new objective
evaluation measure is used to compare the quality of lexicons induced with
different filter cascades. The best filter cascades improve lexicon quality by
up to 137% over the plain vanilla statistical method, and approach human
performance. Drastically reducing the size of the training corpus has a much
smaller impact on lexicon quality when these knowledge sources are used. This
makes it practical to train on small hand-built corpora for language pairs
where large bilingual corpora are unavailable. Moreover, three of the four
filters prove useful even when used with large training corpora.Comment: To appear in Proceedings of the Third Workshop on Very Large Corpora,
15 pages, uuencoded compressed PostScrip
Models of Co-occurrence
A model of co-occurrence in bitext is a boolean predicate that indicates
whether a given pair of word tokens co-occur in corresponding regions of the
bitext space. Co-occurrence is a precondition for the possibility that two
tokens might be mutual translations. Models of co-occurrence are the glue that
binds methods for mapping bitext correspondence with methods for estimating
translation models into an integrated system for exploiting parallel texts.
Different models of co-occurrence are possible, depending on the kind of bitext
map that is available, the language-specific information that is available, and
the assumptions made about the nature of translational equivalence. Although
most statistical translation models are based on models of co-occurrence,
modeling co-occurrence correctly is more difficult than it may at first appear
A Scalable Architecture for Bilingual Lexicography
SABLE is a Scalable Architecture for Bilingual LExicography. It is designed to produce clean broad-coverage translation lexicons from raw, unaligned parallel texts. Its black-box functionality makes it suitable for naive users. The architecture has been implemented for different language pairs, and has been tested on very large and noisy input. SABLE does not rely on language-specific resources such as part-of-speech taggers, but it can take advantage of them when they are available
Automatic Construction of Clean Broad-Coverage Translation Lexicons
Word-level translational equivalences can be extracted from parallel texts by
surprisingly simple statistical techniques. However, these techniques are
easily fooled by {\em indirect associations} --- pairs of unrelated words whose
statistical properties resemble those of mutual translations. Indirect
associations pollute the resulting translation lexicons, drastically reducing
their precision. This paper presents an iterative lexicon cleaning method. On
each iteration, most of the remaining incorrect lexicon entries are filtered
out, without significant degradation in recall. This lexicon cleaning technique
can produce translation lexicons with recall and precision both exceeding 90\%,
as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9
Automatic Discovery of Non-Compositional Compounds in Parallel Data
Automatic segmentation of text into minimal content-bearing units is an
unsolved problem even for languages like English. Spaces between words offer an
easy first approximation, but this approximation is not good enough for machine
translation (MT), where many word sequences are not translated word-for-word.
This paper presents an efficient automatic method for discovering sequences of
words that are translated as a unit. The method proceeds by comparing pairs of
statistical translation models induced from parallel texts in two languages. It
can discover hundreds of non-compositional compounds on each iteration, and
constructs longer compounds out of shorter ones. Objective evaluation on a
simple machine translation task has shown the method's potential to improve the
quality of MT output. The method makes few assumptions about the data, so it
can be applied to parallel data other than parallel texts, such as word
spellings and pronunciations.Comment: 12 pages; uses natbib.sty, here.st
A Geometric Approach to Mapping Bitext Correspondence
The first step in most corpus-based multilingual NLP work is to construct a
detailed map of the correspondence between a text and its translation. Several
automatic methods for this task have been proposed in recent years. Yet even
the best of these methods can err by several typeset pages. The Smooth
Injective Map Recognizer (SIMR) is a new bitext mapping algorithm. SIMR's
errors are smaller than those of the previous front-runner by more than a
factor of 4. Its robustness has enabled new commercial-quality applications.
The greedy nature of the algorithm makes it independent of memory resources.
Unlike other bitext mapping algorithms, SIMR allows crossing correspondences to
account for word order differences. Its output can be converted quickly and
easily into a sentence alignment. SIMR's output has been used to align over 200
megabytes of the Canadian Hansards for publication by the Linguistic Data
Consortium.Comment: 15 pages, minor revisions on Sept. 30, 199
Dependency-based translation equivalents for factored machine translation
Abstract. One of the major concerns of the machine translation practitioners is to create good translation models: correctly extracted translation equivalents and a reduced size of the translation table are the most important evaluation criteria. This paper presents a method for extracting translation examples using the dependency linkage of both the source and target sentence. To decompose the source/target sentence into fragments, we identified two types of dependency link-structures -super-links and chains -and used these structures to set the translation example borders. The option for the dependency-linked ngrams approach is based on the assumption that a decomposition of the sentence in coherent segments, with complete syntactical structure and which accounts for extra-phrasal syntactic dependency would guarantee "better" translation examples and would make a better use of the storage space. The performance of the dependency-based approach is measured with the BLEU-NIST score and in comparison with a baseline system
Word-to-Word Models of Translational Equivalence
Parallel texts (bitexts) have properties that distinguish them from other
kinds of parallel data. First, most words translate to only one other word.
Second, bitext correspondence is noisy. This article presents methods for
biasing statistical translation models to reflect these properties. Analysis of
the expected behavior of these biases in the presence of sparse data predicts
that they will result in more accurate models. The prediction is confirmed by
evaluation with respect to a gold standard -- translation models that are
biased in this fashion are significantly more accurate than a baseline
knowledge-poor model. This article also shows how a statistical translation
model can take advantage of various kinds of pre-existing knowledge that might
be available about particular language pairs. Even the simplest kinds of
language-specific knowledge, such as the distinction between content words and
function words, is shown to reliably boost translation model performance on
some tasks. Statistical models that are informed by pre-existing knowledge
about the model domain combine the best of both the rationalist and empiricist
traditions