Search CORE

93 research outputs found

Exploring Semantic Textual Similarity

Author: González Agirre Aitor
Publication venue
Publication date: 01/01/2012
Field of study

[EN]Measuring semantic similarity and relatedness between textual items (words, sentences, paragraphs or even documents) is a very important research area in Natural Language Processing (NLP). In fact, it has many practical applications in other NLP tasks. For instance, Word Sense Disambiguation, Textual Entailment, Paraphrase detection, Machine Translation, Summarization and other related tasks such as Information Retrieval or Question Answering. In this masther thesis we study di erent approaches to compute the semantic similarity between textual items. In the framework of the european PATHS project1, we also evaluate a knowledge-base method on a dataset of cultural item descriptions. Additionaly, we describe the work carried out for the Semantic Textual Similarity (STS) shared task of SemEval-2012. This work has involved supporting the creation of datasets for similarity tasks, as well as the organization of the task itself

Archivo Digital para la Docencia y la Investigación

Exploiting domain information for Word Sense Disambiguation of medical documents

Author: Agirre Eneko
Soroa Aitor
Stevenson Mark
Publication venue: BMJ Group
Publication date: 01/03/2012
Field of study

OBJECTIVE: Current techniques for knowledge-based Word Sense Disambiguation (WSD) of ambiguous biomedical terms rely on relations in the Unified Medical Language System Metathesaurus but do not take into account the domain of the target documents. The authors' goal is to improve these methods by using information about the topic of the document in which the ambiguous term appears. DESIGN: The authors proposed and implemented several methods to extract lists of key terms associated with Medical Subject Heading terms. These key terms are used to represent the document topic in a knowledge-based WSD system. They are applied both alone and in combination with local context. MEASUREMENTS: A standard measure of accuracy was calculated over the set of target words in the widely used National Library of Medicine WSD dataset. RESULTS AND DISCUSSION: The authors report a significant improvement when combining those key terms with local context, showing that domain information improves the results of a WSD system based on the Unified Medical Language System Metathesaurus alone. The best results were obtained using key terms obtained by relevance feedback and weighted by inverse document frequency

PubMed Central

University of Melbourne Institutional Repository

Analyzing the Limitations of Cross-lingual Word Embedding Mappings

Author: Agirre Eneko
Artetxe Mikel
Labaka Gorka
Ormazabal Aitor
Soroa Aitor
Publication venue
Publication date: 01/01/2019
Field of study

Recent research in cross-lingual word embeddings has almost exclusively focused on offline methods, which independently train word embeddings in different languages and map them to a shared space through linear transformations. While several authors have questioned the underlying isomorphism assumption, which states that word embeddings in different languages have approximately the same structure, it is not clear whether this is an inherent limitation of mapping approaches or a more general issue when learning cross-lingual embeddings. So as to answer this question, we experiment with parallel corpora, which allows us to compare offline mapping to an extension of skip-gram that jointly learns both embedding spaces. We observe that, under these ideal conditions, joint learning yields to more isomorphic embeddings, is less sensitive to hubness, and obtains stronger results in bilingual lexicon induction. We thus conclude that current mapping methods do have strong limitations, calling for further research to jointly learn cross-lingual embeddings with a weaker cross-lingual signal.Comment: ACL 201

arXiv.org e-Print Archive

Crossref

Semantic Services in FreeLing 2.1: WordNet and UKB

Author: Agirre Eneko
Padró Lluís
Reese Samuel
Soroa Aitor
Publication venue
Publication date: 01/01/2010
Field of study

FreeLing is an open-source open-source multilingual language processing library providing a wide range of language analyzers for several languages. It offers text processing and language annotation facilities to natural language processing application developers, simplifying the task of building those applications. FreeLing is customizable and extensible. Developers can use the default linguistic resources (dictionaries, lexicons, grammars, etc.) directly, or extend them, adapt them to specific domains, or even develop new ones for specific languages. This paper presents the semantic services included in FreeLing, which are based on WordNet and EuroWordNet databases. The recent release of the UKB program under a GPL license made it possible to integrate a long awaited word sense disambiguation module into FreeLing. UKB provides state of the art all-words sense disambiguation for any language with an available WordNet.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Improving search over Electronic Health Records using UMLS-based query expansion through random walks

Author: Agirre Eneko
Martinez David
Otegi Arantxa
Soroa Aitor
Publication venue: Elsevier Inc.
Publication date: 31/10/2014
Field of study

ObjectiveMost of the information in Electronic Health Records (EHRs) is represented in free textual form. Practitioners searching EHRs need to phrase their queries carefully, as the record might use synonyms or other related words. In this paper we show that an automatic query expansion method based on the Unified Medicine Language System (UMLS) Metathesaurus improves the results of a robust baseline when searching EHRs.Materials and methodsThe method uses a graph representation of the lexical units, concepts and relations in the UMLS Metathesaurus. It is based on random walks over the graph, which start on the query terms. Random walks are a well-studied discipline in both Web and Knowledge Base datasets.ResultsOur experiments over the TREC Medical Record track show improvements in both the 2011 and 2012 datasets over a strong baseline.DiscussionOur analysis shows that the success of our method is due to the automatic expansion of the query with extra terms, even when they are not directly related in the UMLS Metathesaurus. The terms added in the expansion go beyond simple synonyms, and also add other kinds of topically related terms.ConclusionsExpansion of queries using related terms in the UMLS Metathesaurus beyond synonymy is an effective way to overcome the gap between query and document vocabularies when searching for patient cohorts

Elsevier - Publisher Connector

Towards zero-shot cross-lingual named entity disambiguation

Author: Agirre Bengoa Eneko
Barrena Madinabeitia Ander
Soroa Echave Aitor
Publication venue: 'Elsevier BV'
Publication date: 01/12/2021
Field of study

[EN]In cross-Lingual Named Entity Disambiguation (XNED) the task is to link Named Entity mentions in text in some native language to English entities in a knowledge graph. XNED systems usually require training data for each native language, limiting their application for low resource languages with small amounts of training data. Prior work have proposed so-called zero-shot transfer systems which are only trained in English training data, but required native prior probabilities of entities with respect to mentions, which had to be estimated from native training examples, limiting their practical interest. In this work we present a zero-shot XNED architecture where, instead of a single disambiguation model, we have a model for each possible mention string, thus eliminating the need for native prior probabilities. Our system improves over prior work in XNED datasets in Spanish and Chinese by 32 and 27 points, and matches the systems which do require native prior information. We experiment with different multilingual transfer strategies, showing that better results are obtained with a purpose-built multilingual pre-training method compared to state-of-the-art generic multilingual models such as XLM-R. We also discovered, surprisingly, that English is not necessarily the most effective zero-shot training language for XNED into English. For instance, Spanish is more effective when training a zero-shot XNED system that dis-ambiguates Basque mentions with respect to an English knowledge graph.This work has been partially funded by the Basque Government (IXA excellence research group (IT1343-19) and DeepText project), Project BigKnowledge (Ayudas Fundacion BBVA a equipos de investigacion cientifica 2018) and via the IARPA BETTER Program contract 2019-19051600006 (ODNI, IARPA activity). Ander Barrena enjoys a post-doctoral grant ESPDOC18/101 from the UPV/EHU and also acknowledges the support of the NVIDIA Corporation with the donation of a Titan V GPU used for this research. The author thankfully acknowledges the computer resources at CTE-Power9 + V100 and technical support provided by Barcelona Supercomputing Center (RES-IM-2020-1-0020)

Archivo Digital para la Docencia y la Investigación

Youth interaction with television and online video content in the digital age

Author: Astigarraga Agirre Idoia
Pavon Arrizabalaga Amaia
Zuberogoitia Espilla Aitor
Publication venue: Universidade de Aveiro
Publication date: 01/01/2018
Field of study

This article examines the relationship of university students with television and online video content. Convergence processes in many areas during the digital age have significantly changed both audiovisual content consumption patterns and the content on offer itself. In addition, Web 2.0 has made it possible for interaction to go beyond mere consumption. The purpose of this research study was to ascertain what kind of interaction takes place between young people and audiovisual content. The categories analyzed are watch, share and create, with a focus on students’ everyday life. A mixed-method approach was used across a sample of 475 students from Mondragon University. Our main finding is that, although young people have the resources necessary to interact with media, this condition is not sufficient to favor behaviors that are more active. Young people show different practices and attitudes depending on the individual, the content, and the context but, in general, the interactive patterns that they have with television and online video content have more links with the mass communication paradigm than with the new communicative paradigm that arose in the Web 2.0 era

Directory of Open Access Journals

eBiltegia