3,704 research outputs found
Using Global Constraints and Reranking to Improve Cognates Detection
Global constraints and reranking have not been used in cognates detection
research to date. We propose methods for using global constraints by performing
rescoring of the score matrices produced by state of the art cognates detection
systems. Using global constraints to perform rescoring is complementary to
state of the art methods for performing cognates detection and results in
significant performance improvements beyond current state of the art
performance on publicly available datasets with different language pairs and
various conditions such as different levels of baseline state of the art
performance and different data size conditions, including with more realistic
large data size conditions than have been evaluated with in the past.Comment: 10 pages, 6 figures, 6 tables; published in the Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics, pages
1983-1992, Vancouver, Canada, July 201
An Algorithm For Building Language Superfamilies Using Swadesh Lists
The main contributions of this thesis are the following: i. Developing an algorithm to generate language families and superfamilies given for each input language a Swadesh list represented using the international phonetic alphabet (IPA) notation. ii. The algorithm is novel in using the Levenshtein distance metric on the IPA representation and in the way it measures overall distance between pairs of Swadesh lists. iii. Building a Swadesh list for the author\u27s native Kinyarwanda language because a Swadesh list could not be found even after an extensive search for it.
Adviser: Peter Reves
An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification
End-to-end neural machine translation has overtaken statistical machine
translation in terms of translation quality for some language pairs, specially
those with large amounts of parallel data. Besides this palpable improvement,
neural networks provide several new properties. A single system can be trained
to translate between many languages at almost no additional cost other than
training time. Furthermore, internal representations learned by the network
serve as a new semantic representation of words -or sentences- which, unlike
standard word embeddings, are learned in an essentially bilingual or even
multilingual context. In view of these properties, the contribution of the
present work is two-fold. First, we systematically study the NMT context
vectors, i.e. output of the encoder, and their power as an interlingua
representation of a sentence. We assess their quality and effectiveness by
measuring similarities across translations, as well as semantically related and
semantically unrelated sentence pairs. Second, as extrinsic evaluation of the
first point, we identify parallel sentences in comparable corpora, obtaining an
F1=98.2% on data from a shared task when using only NMT context vectors. Using
context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure
Learning cross-lingual phonological and orthagraphic adaptations: a case study in improving neural machine translation between low-resource languages
Out-of-vocabulary (OOV) words can pose serious challenges for machine
translation (MT) tasks, and in particular, for low-resource language (LRL)
pairs, i.e., language pairs for which few or no parallel corpora exist. Our
work adapts variants of seq2seq models to perform transduction of such words
from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs
built from a bilingual dictionary of Hindi--Bhojpuri words. We demonstrate that
our models can be effectively used for language pairs that have limited
parallel corpora; our models work at the character level to grasp phonetic and
orthographic similarities across multiple types of word adaptations, whether
synchronic or diachronic, loan words or cognates. We describe the training
aspects of several character level NMT systems that we adapted to this task and
characterize their typical errors. Our method improves BLEU score by 6.3 on the
Hindi-to-Bhojpuri translation task. Further, we show that such transductions
can generalize well to other languages by applying it successfully to Hindi --
Bangla cognate pairs. Our work can be seen as an important step in the process
of: (i) resolving the OOV words problem arising in MT tasks, (ii) creating
effective parallel corpora for resource-constrained languages, and (iii)
leveraging the enhanced semantic knowledge captured by word-level embeddings to
perform character-level tasks.Comment: 47 pages, 4 figures, 21 tables (including Appendices
Automatic Evaluation and Uniform Filter Cascades for Inducing N-Best Translation Lexicons
This paper shows how to induce an N-best translation lexicon from a bilingual
text corpus using statistical properties of the corpus together with four
external knowledge sources. The knowledge sources are cast as filters, so that
any subset of them can be cascaded in a uniform framework. A new objective
evaluation measure is used to compare the quality of lexicons induced with
different filter cascades. The best filter cascades improve lexicon quality by
up to 137% over the plain vanilla statistical method, and approach human
performance. Drastically reducing the size of the training corpus has a much
smaller impact on lexicon quality when these knowledge sources are used. This
makes it practical to train on small hand-built corpora for language pairs
where large bilingual corpora are unavailable. Moreover, three of the four
filters prove useful even when used with large training corpora.Comment: To appear in Proceedings of the Third Workshop on Very Large Corpora,
15 pages, uuencoded compressed PostScrip
- …