289 research outputs found
Sometimes less is more : Romanian word sense disambiguation revisited
Recent approaches to Word Sense Disambiguation (WSD) generally fall into two classes: (1) information-intensive approaches and (2) information-poor approaches. Our hypothesis is that for memory-based learning (MBL), a reduced amount of data is more beneficial than the full range of features used in the past. Our experiments show that MBL combined with a restricted set of features and a feature selection method that minimizes the feature set leads to competitive results, outperforming all systems that participated in the SENSEVAL-3 competition on the Romanian data. Thus, with this specific method, a tightly controlled feature set improves the accuracy of the classifier, reaching 74.0% in the fine-grained and 78.7% in the coarse-grained evaluation
Improving search over Electronic Health Records using UMLS-based query expansion through random walks
ObjectiveMost of the information in Electronic Health Records (EHRs) is represented in free textual form. Practitioners searching EHRs need to phrase their queries carefully, as the record might use synonyms or other related words. In this paper we show that an automatic query expansion method based on the Unified Medicine Language System (UMLS) Metathesaurus improves the results of a robust baseline when searching EHRs.Materials and methodsThe method uses a graph representation of the lexical units, concepts and relations in the UMLS Metathesaurus. It is based on random walks over the graph, which start on the query terms. Random walks are a well-studied discipline in both Web and Knowledge Base datasets.ResultsOur experiments over the TREC Medical Record track show improvements in both the 2011 and 2012 datasets over a strong baseline.DiscussionOur analysis shows that the success of our method is due to the automatic expansion of the query with extra terms, even when they are not directly related in the UMLS Metathesaurus. The terms added in the expansion go beyond simple synonyms, and also add other kinds of topically related terms.ConclusionsExpansion of queries using related terms in the UMLS Metathesaurus beyond synonymy is an effective way to overcome the gap between query and document vocabularies when searching for patient cohorts
Recommended from our members
Perspective Identification in Informal Text
This dissertation studies the problem of identifying the ideological perspective of people as expressed in their written text. One's perspective is often expressed in his/her stance towards polarizing topics. We are interested in studying how nuanced linguistic cues can be used to identify the perspective of a person in informal genres. Moreover, we are interested in exploring the problem from a multilingual perspective comparing and contrasting linguistics devices used in both English informal genres datasets discussing American ideological issues and Arabic discussion fora posts related to Egyptian politics. %In doing so, we solve several challenges.
Our first and utmost goal is building computational systems that can successfully identify the perspective from which a given informal text is written while studying what linguistic cues work best for each language and drawing insights into the similarities and differences between the notion of perspective in both studied languages. We build computational systems that can successfully identify the stance of a person in English informal text that deal with different topics that are determined by one's perspective, such as legalization of abortion, feminist movement, gay and gun rights; additionally, we are able to identify a more general notion of perspectiveānamely the 2012 choice of presidential candidateāas well as build systems for automatically identifying different elements of a person's perspective given an Egyptian discussion forum comment. The systems utilize several lexical and semantic features for both languages. Specifically, for English we explore the use of word sense disambiguation, opinion features, latent and frame semantics as well; as Linguistic Inquiry and Word Count features; in Arabic, however, in addition to using sentiment and latent semantics, we study whether linguistic code-switching (LCS) between the standard and dialectal forms for the language can help as a cue for uncovering the perspective from which a comment was written.
This leads us to the challenge of devising computational systems that can handle LCS in Arabic. The Arabic language has a diglossic nature where the standard form of the language (MSA) coexists with the regional dialects (DA) corresponding to the native mother tongue of Arabic speakers in different parts of the Arab world. DA is ubiquitously prevalent in written informal genres and in most cases it is code-switched with MSA. The presence of code-switching degrades the performance of almost any MSA-only trained Natural Language Processing tool when applied to DA or to code-switched MSA-DA content. In order to solve this challenge, we build a state-of-the-art systemāAIDAāto computationally handle token and sentence-level code-switching.
On a conceptual level, for handling and processing Egyptian ideological perspectives, we note the lack of a taxonomy for the most common perspectives among Egyptians and the lack of corresponding annotated corpora. In solving this challenge, we develop a taxonomy for the most common community perspectives among Egyptians and use an iterative feedback-loop process to devise guidelines on how to successfully annotate a given online discussion forum post with different elements of a person's perspective. Using the proposed taxonomy and annotation guidelines, we annotate a large set of Egyptian discussion fora posts to identify a comment's perspective as conveyed in the priority expressed by the comment, as well as the stance on major political entities
Closing the gap in WSD: supervised results with unsupervised methods
Word-Sense Disambiguation (WSD), holds promise for many NLP applications requiring
broad-coverage language understanding, such as summarization (Barzilay and
Elhadad, 1997) and question answering (Ramakrishnan et al., 2003). Recent studies
have also shown that WSD can benefit machine translation (Vickrey et al., 2005) and
information retrieval (Stokoe, 2005). Much work has focused on the computational
treatment of sense ambiguity, primarily using data-driven methods. The most accurate
WSD systems to date are supervised and rely on the availability of sense-labeled
training data. This restriction poses a significant barrier to widespread use of WSD
in practice, since such data is extremely expensive to acquire for new languages and
domains.
Unsupervised WSD holds the key to enable such application, as it does not require
sense-labeled data. However, unsupervised methods fall far behind supervised ones
in terms of accuracy and ease of use. In this thesis we explore the reasons for this,
and present solutions to remedy this situation. We hypothesize that one of the main
problems with unsupervised WSD is its lack of a standard formulation and general
purpose tools common to supervised methods. As a first step, we examine existing approaches
to unsupervised WSD, with the aim of detecting independent principles that
can be utilized in a general framework. We investigate ways of leveraging the diversity
of existing methods, using ensembles, a common tool in the supervised learning
framework. This approach allows us to achieve accuracy beyond that of the individual
methods, without need for extensive modification of the underlying systems.
Our examination of existing unsupervised approaches highlights the importance of
using the predominant sense in case of uncertainty, and the effectiveness of statistical
similarity methods as a tool for WSD. However, it also serves to emphasize the need for
a way to merge and combine learning elements, and the potential of a supervised-style
approach to the problem. Relying on existing methods does not take full advantage of
the insights gained from the supervised framework.
We therefore present an unsupervised WSD system which circumvents the question
of actual disambiguation method, which is the main source of discrepancy in unsupervised
WSD, and deals directly with the data. Our method uses statistical and semantic
similarity measures to produce labeled training data in a completely unsupervised fashion.
This allows the training and use of any standard supervised classifier for the actual
disambiguation. Classifiers trained with our method significantly outperform those using
other methods of data generation, and represent a big step in bridging the accuracy
gap between supervised and unsupervised methods.
Finally, we address a major drawback of classical unsupervised systems ā their reliance
on a fixed sense inventory and lexical resources. This dependence represents
a substantial setback for unsupervised methods in cases where such resources are unavailable.
Unfortunately, these are exactly the areas in which unsupervised methods are
most needed. Unsupervised sense-discrimination, which does not share those restrictions,
presents a promising solution to the problem. We therefore develop an unsupervised
sense discrimination system. We base our system on a well-studied probabilistic
generative model, Latent Dirichlet Allocation (Blei et al., 2003), which has many of
the advantages of supervised frameworks. The modelās probabilistic nature lends itself
to easy combination and extension, and its generative aspect is well suited to linguistic
tasks. Our model achieves state-of-the-art performance on the unsupervised sense
induction task, while remaining independent of any fixed sense inventory, and thus
represents a fully unsupervised, general purpose, WSD tool
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
- ā¦