29 research outputs found

    A Supervised Learning Approach to Acronym Identification

    Get PDF
    This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a supervised learning approach to the acronym identification task. Our approach reduces the search space of the supervised learning system by putting some weak constraints on the kinds of acronym-definition pairs that can be identified. We obtain results comparable to hand-crafted systems that use stronger constraints. We describe our method for reducing the search space, the features used by our supervised learning system, and our experiments with various learning schemes

    Using compression to identify acronyms in text

    Get PDF
    Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens---names, dates, locations, etc.---can be identified and located in running text, using compression models to provide the leverage necessary to distinguish different token types (Witten et al., 1999)Comment: 10 pages. A short form published in DCC200

    Acronym recognition and processing in 22 languages

    Full text link
    We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acronym recognition patterns, initially developed for medical terms, needed to be adapted to the more general news domain and we present evaluation results. We describe our effort to automatically merge the numerous long-form variants referring to the same short-form, while keeping non-related long-forms separate. Finally, we provide extensive statistics on the frequency and the distribution of short-form/long-form pairs across languages

    Identification of headers and footers in noisy documents

    Full text link
    Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms

    Finding acronyms and their definitions using HMM

    Full text link
    In this thesis, we report on design and implementation of a Hidden Markov Model (HMM) to extract acronyms and their expansions. We also report on the training of this HMM with Maximum Likelihood Estimation (MLE) algorithm using a set of examples. Finally, we report on our testing using standard recall and precision. The HMM achieves a recall and precision of 98% and 92% respectively

    Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

    Get PDF
    An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϔ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϔ)-time algorithm and no O ((N1n2 + N2n1)1−ϔ)-time algorithm, for any constant Ï” > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϔf(n1, n2))-time algorithm, for any constant Ï” > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison
    corecore