8,890 research outputs found
Sentence similarity-based source context modelling in PBSMT
Target phrase selection, a crucial component of the state-of-the-art phrase-based statistical machine translation (PBSMT) model, plays a key role in generating accurate translation hypotheses. Inspired by context-rich word-sense disambiguation techniques, machine translation (MT) researchers have successfully integrated various types of source language context into the PBSMT model to improve target phrase selection. Among the various types of lexical and syntactic features, lexical syntactic descriptions in the form of supertags that preserve long-range word-to-word dependencies in a sentence have proven to be effective. These rich contextual features are able to disambiguate a source phrase, on the basis of the local syntactic behaviour of that phrase. In addition to local contextual information, global contextual information such as the grammatical structure of a sentence, sentence length and n-gram word sequences could provide additional important information to enhance this phrase-sense disambiguation. In this work, we explore various sentence similarity features by measuring similarity between a source sentence to be translated with the source-side of the bilingual training sentences and integrate them directly into the PBSMT model. We performed experiments on an English-to-Chinese translation task by applying sentence-similarity features both individually, and collaboratively with supertag-based features. We evaluate the performance of our approach and report a statistically significant relative improvement of 5.25% BLEU score when adding a sentence-similarity feature together with a supertag-based feature
An algorithm for cross-lingual sense-clustering tested in a MT evaluation setting
Unsupervised sense induction methods offer a solution to the
problem of scarcity of semantic resources. These methods
automatically extract semantic information from textual data
and create resources adapted to specific applications and domains of interest. In this paper, we present a clustering algorithm for cross-lingual sense induction which generates
bilingual semantic inventories from parallel corpora. We describe the clustering procedure and the obtained resources. We then proceed to a large-scale evaluation by integrating the resources into a Machine Translation (MT) metric (METEOR). We show that the use of the data-driven sense-cluster inventories leads to better correlation with human judgments of translation quality, compared to precision-based metrics, and to improvements similar to those obtained when a handcrafted semantic resource is used
A Survey of Paraphrasing and Textual Entailment Methods
Paraphrasing methods recognize, generate, or extract phrases, sentences, or
longer natural language expressions that convey almost the same information.
Textual entailment methods, on the other hand, recognize, generate, or extract
pairs of natural language expressions, such that a human who reads (and trusts)
the first element of a pair would most likely infer that the other element is
also true. Paraphrasing can be seen as bidirectional textual entailment and
methods from the two areas are often similar. Both kinds of methods are useful,
at least in principle, in a wide range of natural language processing
applications, including question answering, summarization, text generation, and
machine translation. We summarize key ideas from the two areas by considering
in turn recognition, generation, and extraction methods, also pointing to
prominent articles and resources.Comment: Technical Report, Natural Language Processing Group, Department of
Informatics, Athens University of Economics and Business, Greece, 201
BIKE: Bilingual Keyphrase Experiments
This paper presents a novel strategy for translating lists
of keyphrases. Typical keyphrase lists appear in
scientific articles, information retrieval systems and
web page meta-data. Our system combines a statistical
translation model trained on a bilingual corpus of
scientific papers with sense-focused look-up in a large
bilingual terminological resource. For the latter,
we developed a novel technique that benefits from viewing
the keyphrase list as contextual help for sense
disambiguation. The optimal combination of modules was
discovered by a genetic algorithm. Our work applies to
the French / English language pair
Automatic Construction of Clean Broad-Coverage Translation Lexicons
Word-level translational equivalences can be extracted from parallel texts by
surprisingly simple statistical techniques. However, these techniques are
easily fooled by {\em indirect associations} --- pairs of unrelated words whose
statistical properties resemble those of mutual translations. Indirect
associations pollute the resulting translation lexicons, drastically reducing
their precision. This paper presents an iterative lexicon cleaning method. On
each iteration, most of the remaining incorrect lexicon entries are filtered
out, without significant degradation in recall. This lexicon cleaning technique
can produce translation lexicons with recall and precision both exceeding 90\%,
as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9
Language classification from bilingual word embedding graphs
We study the role of the second language in bilingual word embeddings in
monolingual semantic evaluation tasks. We find strongly and weakly positive
correlations between down-stream task performance and second language
similarity to the target language. Additionally, we show how bilingual word
embeddings can be employed for the task of semantic language classification and
that joint semantic spaces vary in meaningful ways across second languages. Our
results support the hypothesis that semantic language similarity is influenced
by both structural similarity as well as geography/contact.Comment: To be published at Coling 201
- …