29 research outputs found
A Supervised Learning Approach to Acronym Identification
This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a
supervised learning approach to the acronym identification task. Our approach reduces the search space of the supervised learning system by putting some weak constraints on the kinds of acronym-definition pairs that can be identified. We obtain results comparable to hand-crafted systems that use stronger constraints. We describe our method for reducing the search space, the features
used by our supervised learning system, and our experiments with various learning schemes
Using compression to identify acronyms in text
Text mining is about looking for patterns in natural language text, and may
be defined as the process of analyzing text to extract information from it for
particular purposes. In previous work, we claimed that compression is a key
technology for text mining, and backed this up with a study that showed how
particular kinds of lexical tokens---names, dates, locations, etc.---can be
identified and located in running text, using compression models to provide the
leverage necessary to distinguish different token types (Witten et al., 1999)Comment: 10 pages. A short form published in DCC200
Acronym recognition and processing in 22 languages
We are presenting work on recognising acronyms of the form Long-Form
(Short-Form) such as "International Monetary Fund (IMF)" in millions of news
articles in twenty-two languages, as part of our more general effort to
recognise entities and their variants in news text and to use them for the
automatic analysis of the news, including the linking of related news across
languages. We show how the acronym recognition patterns, initially developed
for medical terms, needed to be adapted to the more general news domain and we
present evaluation results. We describe our effort to automatically merge the
numerous long-form variants referring to the same short-form, while keeping
non-related long-forms separate. Finally, we provide extensive statistics on
the frequency and the distribution of short-form/long-form pairs across
languages
Identification of headers and footers in noisy documents
Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms
Recommended from our members
Increasing United States College Access for Native Arabic Speakers: Applying a Simplification Intervention and Evaluating Machine and Human Translations
Across many language backgrounds, a consistent hurdle to accessing United States higher education is understanding the basic information necessary to apply for admission and financial aid and complete the many enrollment management processes necessary to begin oneâs college career (apply for housing, receive and submit vaccinations, register for classes, etc.). However, to date, no studies have explored how this type of higher education information can be simplified and translated into Arabic, one of the most widely spoken languages in the world and a linguistic background shared by tens of thousands of prospective international students (and their families) seeking higher education in the United States. This case study reports on research-to-practice work conducted with the University of Iowa, specifically how the university simplified their enrollment management information and how that information was translated into Arabic for native Arabic speakers seeking access to the University of Iowa. Findings reveal that the institution simplified text to speak more directly to prospective student audiences by using second person pronouns and simpler sentence structure and diction to engage this audience. Moreover, analyses of machine and human translations of English to Arabic suggest that human translation should be the preferred mechanism of translating higher education information, as Google Translate and Chat GPT [A1] provided adequate but not perfect translations of Iowaâs information. Implications for practice and college access are addressed.Educatio
Finding acronyms and their definitions using HMM
In this thesis, we report on design and implementation of a Hidden Markov Model (HMM) to extract acronyms and their expansions. We also report on the training of this HMM with Maximum Likelihood Estimation (MLE) algorithm using a set of examples.
Finally, we report on our testing using standard recall and precision. The HMM achieves a recall and precision of 98% and 92% respectively
Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications
An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si â T[i] for all i â [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1âÏ”)-time algorithm, thus no O ((N1m2 + N2m1)1âÏ”)-time algorithm and no O ((N1n2 + N2n1)1âÏ”)-time algorithm, for any constant Ï” > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2âÏ”f(n1, n2))-time algorithm, for any constant Ï” > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Ă(N1Ïâ1n2 + N2Ïâ1n1)-time algorithm for EDSI, where Ï is the exponent of matrix multiplication; the Ă notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison