58 research outputs found
A Geometric Approach to Mapping Bitext Correspondence
The first step in most corpus-based multilingual NLP work is to construct a
detailed map of the correspondence between a text and its translation. Several
automatic methods for this task have been proposed in recent years. Yet even
the best of these methods can err by several typeset pages. The Smooth
Injective Map Recognizer (SIMR) is a new bitext mapping algorithm. SIMR's
errors are smaller than those of the previous front-runner by more than a
factor of 4. Its robustness has enabled new commercial-quality applications.
The greedy nature of the algorithm makes it independent of memory resources.
Unlike other bitext mapping algorithms, SIMR allows crossing correspondences to
account for word order differences. Its output can be converted quickly and
easily into a sentence alignment. SIMR's output has been used to align over 200
megabytes of the Canadian Hansards for publication by the Linguistic Data
Consortium.Comment: 15 pages, minor revisions on Sept. 30, 199
Models of Co-occurrence
A model of co-occurrence in bitext is a boolean predicate that indicates
whether a given pair of word tokens co-occur in corresponding regions of the
bitext space. Co-occurrence is a precondition for the possibility that two
tokens might be mutual translations. Models of co-occurrence are the glue that
binds methods for mapping bitext correspondence with methods for estimating
translation models into an integrated system for exploiting parallel texts.
Different models of co-occurrence are possible, depending on the kind of bitext
map that is available, the language-specific information that is available, and
the assumptions made about the nature of translational equivalence. Although
most statistical translation models are based on models of co-occurrence,
modeling co-occurrence correctly is more difficult than it may at first appear
Automatic Detection of Omissions in Translations
ADOMIT is an algorithm for Automatic Detection of OMIssions in Translations. The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information. This property allows it to deal equally well with omissions that do not correspond to linguistic units, such as might result from word-processing mishaps. ADOMIT has proven itself by discovering many errors in a hand-constructed gold standard for evaluating bitext mapping algorithms. Quantitative evaluation on simulated omissions showed that, even with today\u27s poor bitext mapping technology, ADOMIT is a valuable quality control tool for translators and translation bureaus
A Scalable Architecture for Bilingual Lexicography
SABLE is a Scalable Architecture for Bilingual LExicography. It is designed to produce clean broad-coverage translation lexicons from raw, unaligned parallel texts. Its black-box functionality makes it suitable for naive users. The architecture has been implemented for different language pairs, and has been tested on very large and noisy input. SABLE does not rely on language-specific resources such as part-of-speech taggers, but it can take advantage of them when they are available
Manual Annotation of Translational Equivalence: The Blinker Project
Bilingual annotators were paid to link roughly sixteen thousand corresponding
words between on-line versions of the Bible in modern French and modern
English. These annotations are freely available to the research community from
http://www.cis.upenn.edu/~melamed . The annotations can be used for several
purposes. First, they can be used as a standard data set for developing and
testing translation lexicons and statistical translation models. Second,
researchers in lexical semantics will be able to mine the annotations for
insights about cross-linguistic lexicalization patterns. Third, the annotations
can be used in research into certain recently proposed methods for monolingual
word-sense disambiguation. This paper describes the annotated texts, the
specially-designed annotation tool, and the strategies employed to increase the
consistency of the annotations. The annotation process was repeated five times
by different annotators. Inter-annotator agreement rates indicate that the
annotations are reasonably reliable and that the method is easy to replicate
Towards a user-friendly webservice architecture for statistical machine translation in the PANACEA project
This paper presents a webservice architecture for Statistical Machine Translation aimed at non-technical users. A workďŹow editor allows a user to combine different
webservices using a graphical user interface. In the current state of this project, the webservices have been implemented
for a range of sentential and sub-sentential aligners. The advantage of a common interface and a common data format allows the user to build workďŹows exchanging different aligners
Automatic Construction of Chinese-English Translation Lexicons
The process of constructing translation lexicons from parallel texts (bitexts) can be broken down into three stages: mapping bitext correspondence, counting co-occurrences, and estimating a translation model. State-of-the-art techniques for accomplishing each stage of the process had already been developed, but only for bitexts involving fairly similar languages. Correct and efficient implementation of each stage poses special challenges when the parallel texts involve two very different languages. This report describes our theoretical and empirical investigations into how existing techniques might be extended and applied to Chinese/English bitexts
Canvas: A fast and accurate geometric sentence alignment system using lexical cues within complex misalignment settings
In this paper, we present a new sentence alignment system (Canvas), which is a Python implementation of a geometric approach to sentence alignment, based on lexical cues. Canvas system is designed mainly to handle parallel texts exhibiting complex misalignment patterns, namely within English-Arabic pairs for United Nations documents. The system relies heavily on pre-indexing words/tokens in the source and target texts, and it creates correspondences between the token indexes. From this point onward, the alignment problem is reduced to a geometric problem of finding the path that runs through the True Correspondence Points (TCPs). The likelihood of a point being a TCP depends on the clustering of other points nearby; so, we collect the most likely points, and we identify the shortest path containing the maximum number of these points using a modified form of Dijkstra\u27s algorithm. The results of Canvas system are very promising, as they demonstrate that it can handle intricate misalignment patterns, with much better speed than other alignment approaches using lexical cues, and with good accuracy in general, in a completely automated fashion. The only drawback is that the system does not cover all the alignment segments and this coverage is generally lower than other systems, which can be a subject of future research
Automatic Construction of Clean Broad-Coverage Translation Lexicons
Word-level translational equivalences can be extracted from parallel texts by
surprisingly simple statistical techniques. However, these techniques are
easily fooled by {\em indirect associations} --- pairs of unrelated words whose
statistical properties resemble those of mutual translations. Indirect
associations pollute the resulting translation lexicons, drastically reducing
their precision. This paper presents an iterative lexicon cleaning method. On
each iteration, most of the remaining incorrect lexicon entries are filtered
out, without significant degradation in recall. This lexicon cleaning technique
can produce translation lexicons with recall and precision both exceeding 90\%,
as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9
Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text
Parallel corpora are a valuable resource for machine translation, but at
present their availability and utility is limited by genre- and
domain-specificity, licensing restrictions, and the basic difficulty of
locating parallel texts in all but the most dominant of the world's languages.
A parallel corpus resource not yet explored is the World Wide Web, which hosts
an abundance of pages in parallel translation, offering a potential solution to
some of these problems and unique opportunities of its own. This paper presents
the necessary first step in that exploration: a method for automatically
finding parallel translated documents on the Web. The technique is conceptually
simple, fully language independent, and scalable, and preliminary evaluation
results indicate that the method may be accurate enough to apply without human
intervention.Comment: LaTeX2e, 11 pages, 7 eps figures; uses psfig, llncs.cls, theapa.sty.
An Appendix at http://umiacs.umd.edu/~resnik/amta98/amta98_appendix.html
contains test dat
- âŚ