1,490 research outputs found
Using the Outlier Detection Task to Evaluate Distributional Semantic Models
In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as a text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for the outlier detection task in English and Portuguese are freely availableThis work was supported by a 2016 BBVA Foundation Grant for Researchers and Cultural Creators and by Project TELEPARES, Ministry of Economy and Competitiveness (FFI2014-51978-C2-1-R). It has received financial support from the ConsellerÃa de Cultura, Educación e Ordenación Universitaria (accreditation 2016–2019, ED431G/08) and the European Regional Development Fund (ERDF)S
Evaluation of Distributional Models with the Outlier Detection Task
In this article, we define the outlier detection task and use it to compare neural-based word embeddings with transparent count-based distributional representations. Using the English Wikipedia as text source to train the models, we observed that embeddings outperform count-based representations when their contexts are made up of bag-of-words. However, there are no sharp differences between the two models if the word contexts are defined as syntactic dependencies. In general, syntax-based models tend to perform better than those based on bag-of-words for this specific task. Similar experiments were carried out for Portuguese with similar results. The test datasets we have created for outlier detection task in English and Portuguese are released
Biomedical ontology alignment: An approach based on representation learning
While representation learning techniques have shown great promise in application to a number of different NLP tasks, they have had little impact on the problem of ontology matching. Unlike past work that has focused on feature engineering, we present a novel representation learning approach that is tailored to the ontology matching task. Our approach is based on embedding ontological terms in a high-dimensional Euclidean space. This embedding is derived on the basis of a novel phrase retrofitting strategy through which semantic similarity information becomes inscribed onto fields of pre-trained word vectors. The resulting framework also incorporates a novel outlier detection mechanism based on a denoising autoencoder that is shown to improve performance. An ontology matching system derived using the proposed framework achieved an F-score of 94% on an alignment scenario involving the Adult Mouse Anatomical Dictionary and the Foundational Model of Anatomy ontology (FMA) as targets. This compares favorably with the best performing systems on the Ontology Alignment Evaluation Initiative anatomy challenge. We performed additional experiments on aligning FMA to NCI Thesaurus and to SNOMED CT based on a reference alignment extracted from the UMLS Metathesaurus. Our system obtained overall F-scores of 93.2% and 89.2% for these experiments, thus achieving state-of-the-art results
Creation and Evaluation of Datasets for Distributional Semantics Tasks in the Digital Humanities Domain
Word embeddings are already well studied in the general domain, usually
trained on large text corpora, and have been evaluated for example on word
similarity and analogy tasks, but also as an input to downstream NLP processes.
In contrast, in this work we explore the suitability of word embedding
technologies in the specialized digital humanities domain. After training
embedding models of various types on two popular fantasy novel book series, we
evaluate their performance on two task types: term analogies, and word
intrusion. To this end, we manually construct test datasets with domain
experts. Among the contributions are the evaluation of various word embedding
techniques on the different task types, with the findings that even embeddings
trained on small corpora perform well for example on the word intrusion task.
Furthermore, we provide extensive and high-quality datasets in digital
humanities for further investigation, as well as the implementation to easily
reproduce or extend the experiments
- …