71,598 research outputs found
Approximating solution structure of the Weighted Sentence Alignment problem
We study the complexity of approximating solution structure of the bijective
weighted sentence alignment problem of DeNero and Klein (2008). In particular,
we consider the complexity of finding an alignment that has a significant
overlap with an optimal alignment. We discuss ways of representing the solution
for the general weighted sentence alignment as well as phrases-to-words
alignment problem, and show that computing a string which agrees with the
optimal sentence partition on more than half (plus an arbitrarily small
polynomial fraction) positions for the phrases-to-words alignment is NP-hard.
For the general weighted sentence alignment we obtain such bound from the
agreement on a little over 2/3 of the bits. Additionally, we generalize the
Hamming distance approximation of a solution structure to approximating it with
respect to the edit distance metric, obtaining similar lower bounds
The impact of morphological errors in phrase-based statistical machine translation from German and English into Swedish
We have investigated the potential for improvement in target language morphology when translating into Swedish from English and German, by measuring the errors made by a state of the art phrase-based statistical machine translation system. Our results show that there is indeed a performance gap to be filled by better modelling of inflectional morphology and compounding; and that the gap is not filled by
simply feeding the translation system with more training data
Induction of Word and Phrase Alignments for Automatic Document Summarization
Current research in automatic single document summarization is dominated by
two effective, yet naive approaches: summarization by sentence extraction, and
headline generation via bag-of-words models. While successful in some tasks,
neither of these models is able to adequately capture the large set of
linguistic devices utilized by humans when they produce summaries. One possible
explanation for the widespread use of these models is that good techniques have
been developed to extract appropriate training data for them from existing
document/abstract and document/headline corpora. We believe that future
progress in automatic summarization will be driven both by the development of
more sophisticated, linguistically informed models, as well as a more effective
leveraging of document/abstract corpora. In order to open the doors to
simultaneously achieving both of these goals, we have developed techniques for
automatically producing word-to-word and phrase-to-phrase alignments between
documents and their human-written abstracts. These alignments make explicit the
correspondences that exist in such document/abstract pairs, and create a
potentially rich data source from which complex summarization algorithms may
learn. This paper describes experiments we have carried out to analyze the
ability of humans to perform such alignments, and based on these analyses, we
describe experiments for creating them automatically. Our model for the
alignment task is based on an extension of the standard hidden Markov model,
and learns to create alignments in a completely unsupervised fashion. We
describe our model in detail and present experimental results that show that
our model is able to learn to reliably identify word- and phrase-level
alignments in a corpus of pairs
Biomedical ontology alignment: An approach based on representation learning
While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results
A Survey of Word Reordering in Statistical Machine Translation: Computational Models and Language Phenomena
Word reordering is one of the most difficult aspects of statistical machine
translation (SMT), and an important factor of its quality and efficiency.
Despite the vast amount of research published to date, the interest of the
community in this problem has not decreased, and no single method appears to be
strongly dominant across language pairs. Instead, the choice of the optimal
approach for a new translation task still seems to be mostly driven by
empirical trials. To orientate the reader in this vast and complex research
area, we present a comprehensive survey of word reordering viewed as a
statistical modeling challenge and as a natural language phenomenon. The survey
describes in detail how word reordering is modeled within different
string-based and tree-based SMT frameworks and as a stand-alone task, including
systematic overviews of the literature in advanced reordering modeling. We then
question why some approaches are more successful than others in different
language pairs. We argue that, besides measuring the amount of reordering, it
is important to understand which kinds of reordering occur in a given language
pair. To this end, we conduct a qualitative analysis of word reordering
phenomena in a diverse sample of language pairs, based on a large collection of
linguistic knowledge. Empirical results in the SMT literature are shown to
support the hypothesis that a few linguistic facts can be very useful to
anticipate the reordering characteristics of a language pair and to select the
SMT framework that best suits them.Comment: 44 pages, to appear in Computational Linguistic
- …