This paper shows how to induce an N-best translation lexicon from a bilingual
text corpus using statistical properties of the corpus together with four
external knowledge sources. The knowledge sources are cast as filters, so that
any subset of them can be cascaded in a uniform framework. A new objective
evaluation measure is used to compare the quality of lexicons induced with
different filter cascades. The best filter cascades improve lexicon quality by
up to 137% over the plain vanilla statistical method, and approach human
performance. Drastically reducing the size of the training corpus has a much
smaller impact on lexicon quality when these knowledge sources are used. This
makes it practical to train on small hand-built corpora for language pairs
where large bilingual corpora are unavailable. Moreover, three of the four
filters prove useful even when used with large training corpora.Comment: To appear in Proceedings of the Third Workshop on Very Large Corpora,
15 pages, uuencoded compressed PostScrip