260 research outputs found
Using distributional similarity to organise biomedical terminology
We investigate an application of distributional similarity techniques to the problem of structural organisation of biomedical terminology. Our application domain is the relatively small GENIA corpus. Using terms that have been accurately marked-up by hand within the corpus, we consider the problem of automatically determining semantic proximity. Terminological units are dened for our purposes as normalised classes of individual terms. Syntactic analysis of the corpus data is carried out using the Pro3Gres parser and provides the data required to calculate distributional similarity using a variety of dierent measures. Evaluation is performed against a hand-crafted gold standard for this domain in the form of the GENIA ontology. We show that distributional similarity can be used to predict semantic type with a good degree of accuracy
From Paraphrase Database to Compositional Paraphrase Model and Back
The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive
semantic resource, consisting of a list of phrase pairs with (heuristic)
confidence estimates. However, it is still unclear how it can best be used, due
to the heuristic nature of the confidences and its necessarily incomplete
coverage. We propose models to leverage the phrase pairs from the PPDB to build
parametric paraphrase models that score paraphrase pairs more accurately than
the PPDB's internal scores while simultaneously improving its coverage. They
allow for learning phrase embeddings as well as improved word embeddings.
Moreover, we introduce two new, manually annotated datasets to evaluate
short-phrase paraphrasing models. Using our paraphrase model trained using
PPDB, we achieve state-of-the-art results on standard word and bigram
similarity tasks and beat strong baselines on our new short phrase paraphrase
tasks.Comment: 2015 TACL paper updated with an appendix describing new 300
dimensional embeddings. Submitted 1/2015. Accepted 2/2015. Published 6/201
Extended Parallel Corpus for Amharic-English Machine Translation
This paper describes the acquisition, preprocessing, segmentation, and
alignment of an Amharic-English parallel corpus. It will be useful for machine
translation of an under-resourced language, Amharic. The corpus is larger than
previously compiled corpora; it is released for research purposes. We trained
neural machine translation and phrase-based statistical machine translation
models using the corpus. In the automatic evaluation, neural machine
translation models outperform phrase-based statistical machine translation
models.Comment: Accepted to 2nd AfricanNLP workshop at EACL 202
- …