7 research outputs found

    Mistake-Driven Learning in Text Categorization

    Full text link
    Learning problems in the text processing domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Three characteristic properties of this domain are (a) very high dimensionality, (b) both the learned concepts and the instances reside very sparsely in the feature space, and (c) a high variation in the number of active features in an instance. In this work we study three mistake-driven learning algorithms for a typical task of this nature -- text categorization. We argue that these algorithms -- which categorize documents by learning a linear separator in the feature space -- have a few properties that make them ideal for this domain. We then show that a quantum leap in performance is achieved when we further modify the algorithms to better address some of the specific characteristics of the domain. In particular, we demonstrate (1) how variation in document length can be tolerated by either normalizing feature weights or by using negative weights, (2) the positive effect of applying a threshold range in training, (3) alternatives in considering feature frequency, and (4) the benefits of discarding features while training. Overall, we present an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.Comment: 9 pages, uses aclap.st

    Similarity-based word sense disambiguation

    No full text
    We describe a method for automatic word sense disambiguation using a text corpus and a machinereadable dictionary (MRD). The method is based on word similarity and context similarity measures. Words are considered similar if they appear in similar contexts; contexts are similar if they contain similar words. The circularity of this definition is resolved by an iterative, converging process, in which the system learns from the corpus a set of typical usages for each of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word is assigned the sense associated with the typical usage most similar to its context. Experiments show that this method can learn even from very sparse training data, achieving over 92 % correct disambiguation performance

    Similarity-based Word Sense Disambiguation

    No full text
    We describe a method for automatic word sense disambiguation using a text corpus and a machine-readable dictionary (MRD). The method is based on word similarity and context similarity measures. Words are considered similar if they appear in similar contexts; contexts are similar if they contain similar words. The circularity of this definition is resolved by an iterative, converging process, in which the system learns from the corpus a set of typical usages for each of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word is assigned the sense associated with the typical usage most similar to its context. Experiments show that this method can learn even from very sparse training data, achieving over 92% correct disambiguation performance

    Learning Similarity-Based Word Sense Disambiguation from Sparse Data

    No full text
    We describe a method for automatic word sense disambiguation using a text corpus and a machine-readable dictionary (MRD). The method is based on word similarity and context similarity measures. Words are considered similar if they appear in similar contexts; contexts are similar if they contain similar words. The circularity of this definition is resolved by an iterative, converging process, in which the system learns from the corpus a set of typical usages for each of the senses of the polysemous word listed in the MRD. A new instance of a polysemous word is assigned the sense associated with the typical usage most similar to its context. Experiments show that this method performs well, and can learn even from very sparse training data

    MicroRNA expression detected by oligonucleotide microarrays: System establishment and expression profiling in human tissues

    No full text
    MicroRNAs (MIRs) are a novel group of conserved short ∼22 nucleotide-long RNAs with important roles in regulating gene expression. We have established a MIR-specific oligonucleotide microarray system that enables efficient analysis of the expression of the human MIRs identified so far. We show that the 60-mer oligonucleotide probes on the microarrays hybridize with labeled cRNA of MIRs, but not with their precursor hairpin RNAs, derived from amplified, size-fractionated, total RNA of human origin. Signal intensity is related to the location of the MIR sequences within the 60-mer probes, with location at the 5′ region giving the highest signals, and at the 3′ end, giving the lowest signals. Accordingly, 60-mer probes harboring one MIR copy at the 5′ end gave signals of similar intensity to probes containing two or three MIR copies. Mismatch analysis shows that mutations within the MIR sequence significantly reduce or eliminate the signal, suggesting that the observed signals faithfully reflect the abundance of matching MIRs in the labeled cRNA. Expression profiling of 150 MIRs in five human tissues and in HeLa cells revealed a good overall concordance with previously published results, but also with some differences. We present novel data on MIR expression in thymus, testes, and placenta, and have identified MIRs highly enriched in these tissues. Taken together, these results highlight the increased sensitivity of the DNA microarray over other methods for the detection and study of MIRs, and the immense potential in applying such microarrays for the study of MIRs in health and disease

    Metaphor: A Computational Perspective

    No full text
    corecore