28 research outputs found

    Distinguishing Word Senses in Untagged Text

    Full text link
    This paper describes an experimental comparison of three unsupervised learning algorithms that distinguish the sense of an ambiguous word in untagged text. The methods described in this paper, McQuitty's similarity analysis, Ward's minimum-variance method, and the EM algorithm, assign each instance of an ambiguous word to a known sense definition based solely on the values of automatically identifiable features in text. These methods and feature sets are found to be more successful in disambiguating nouns rather than adjectives or verbs. Overall, the most accurate of these procedures is McQuitty's similarity analysis in combination with a high dimensional feature set.Comment: 11 pages, latex, uses aclap.st

    Computational Approaches to Measuring the Similarity of Short Contexts : A Review of Applications and Methods

    Full text link
    Measuring the similarity of short written contexts is a fundamental problem in Natural Language Processing. This article provides a unifying framework by which short context problems can be categorized both by their intended application and proposed solution. The goal is to show that various problems and methodologies that appear quite different on the surface are in fact very closely related. The axes by which these categorizations are made include the format of the contexts (headed versus headless), the way in which the contexts are to be measured (first-order versus second-order similarity), and the information used to represent the features in the contexts (micro versus macro views). The unifying thread that binds together many short context applications and methods is the fact that similarity decisions must be made between contexts that share few (if any) words in common.Comment: 23 page

    A new clustering method for detecting rare senses of abbreviations in clinical notes

    Get PDF
    AbstractAbbreviations are widely used in clinical documents and they are often ambiguous. Building a list of possible senses (also called sense inventory) for each ambiguous abbreviation is the first step to automatically identify correct meanings of abbreviations in given contexts. Clustering based methods have been used to detect senses of abbreviations from a clinical corpus [1]. However, rare senses remain challenging and existing algorithms are not good enough to detect them. In this study, we developed a new two-phase clustering algorithm called Tight Clustering for Rare Senses (TCRS) and applied it to sense generation of abbreviations in clinical text. Using manually annotated sense inventories from a set of 13 ambiguous clinical abbreviations, we evaluated and compared TCRS with the existing Expectation Maximization (EM) clustering algorithm for sense generation, at two different levels of annotation cost (10 vs. 20 instances for each abbreviation). Our results showed that the TCRS-based method could detect 85% senses on average; while the EM-based method found only 75% senses, when similar annotation effort (about 20 instances) was used. Further analysis demonstrated that the improvement by the TCRS method was mainly from additionally detected rare senses, thus indicating its usefulness for building more complete sense inventories of clinical abbreviations

    A Quantitative Evaluation of Global Word Sense Induction

    Get PDF
    International audienceWord sense induction (WSI) is the task aimed at automatically identifying the senses of words in texts, without the need for handcrafted resources or annotated data. Up till now, most WSI algorithms extract the different senses of a word 'locally' on a per-word basis, i.e. the different senses for each word are determined separately. In this paper, we compare the performance of such algorithms to an algorithm that uses a 'global' approach, i.e. the different senses of a particular word are determined by comparing them to, and demarcating them from, the senses of other words in a full-blown word space model. We adopt the evaluation framework proposed in the SemEval-2010 Word Sense Induction \& Disambiguation task. All systems that participated in this task use a local scheme for determining the different senses of a word. We compare their results to the ones obtained by the global approach, and discuss the advantages and weaknesses of both approaches

    An Intelligent Information Retrieval System Using Automatic Word Sense Disambiguation

    Get PDF
    This is the published version. Copyright De GruyterThis paper aims to establish that an intelligent contextual infonnation retrieval (IR) system can improve the quality of search results by retrieving more relevant results than those obtained with traditional search engines. Search engines capable of implicit, explicit, and no contextual retrieval were designed and implemented and their performances studied. Experimental results showed that search engines with contextual IR produce results that are more relevant, and the outcomes further indicate that there is no perceived gain in choosing specifically any one of the two approaches of implicit or explicit. The performance of the indexing mechanism, as it classifies document tokens with their appropriate contexts/word sense, was evaluated. The effectiveness of the word sense disambiguation process was found to depend to a great extent on the process (implementation) as well as the raw data (thesaurus)

    Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively.</p> <p>Results</p> <p>The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate.</p> <p>Conclusion</p> <p>Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation.</p> <p>Availability</p> <p>The three benchmark datasets created for the purpose of disambiguation are available in Additional file <supplr sid="S1">1</supplr>.</p> <suppl id="S1"> <title> <p>Additional file 1</p> </title> <text> <p><b>Benchmark datasets used in the experiments.</b> The three corpora (High quality/Low quantity corpus; Medium quality/Medium quantity corpus; Low quality/High quantity corpus) are given in the form of PubMed identifiers (PMID) for True/False cases for the 7 ambiguous terms examined (GO/MeSH/UMLS identifiers are also given).</p> </text> <file name="1471-2105-10-28-S1.txt"> <p>Click here for file</p> </file> </suppl
    corecore