Search CORE

29 research outputs found

A Supervised Learning Approach to Acronym Identification

Author: Nadeau David
Turney Peter
Publication venue: Springer
Publication date: 01/01/2005
Field of study

This paper addresses the task of finding acronym-definition pairs in text. Most of the previous work on the topic is about systems that involve manually generated rules or regular expressions. In this paper, we present a supervised learning approach to the acronym identification task. Our approach reduces the search space of the supervised learning system by putting some weak constraints on the kinds of acronym-definition pairs that can be identified. We obtain results comparable to hand-crafted systems that use stronger constraints. We describe our method for reducing the search space, the features used by our supervised learning system, and our experiments with various learning schemes

CiteSeerX

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

Using compression to identify acronyms in text

Author: Bainbridge David
Witten Ian H.
Yeates Stuart
Publication venue
Publication date: 01/01/2000
Field of study

Text mining is about looking for patterns in natural language text, and may be defined as the process of analyzing text to extract information from it for particular purposes. In previous work, we claimed that compression is a key technology for text mining, and backed this up with a study that showed how particular kinds of lexical tokens---names, dates, locations, etc.---can be identified and located in running text, using compression models to provide the leverage necessary to distinguish different token types (Witten et al., 1999)Comment: 10 pages. A short form published in DCC200

arXiv.org e-Print Archive

CiteSeerX

Research Commons@Waikato

Acronym recognition and processing in 22 languages

Author: della Rocca Leonida
Ehrmann Maud
Steinberger Ralf
Tanev Hristo
Publication venue
Publication date: 24/09/2013
Field of study

We are presenting work on recognising acronyms of the form Long-Form (Short-Form) such as "International Monetary Fund (IMF)" in millions of news articles in twenty-two languages, as part of our more general effort to recognise entities and their variants in news text and to use them for the automatic analysis of the news, including the linking of related news across languages. We show how the acronym recognition patterns, initially developed for medical terms, needed to be adapted to the more general news domain and we present evaluation results. We describe our effort to automatically merge the numerous long-form variants referring to the same short-form, while keeping non-related long-forms separate. Finally, we provide extensive statistics on the frequency and the distribution of short-form/long-form pairs across languages

arXiv.org e-Print Archive

CiteSeerX

Identification of headers and footers in noisy documents

Author: Liu Qin
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/2003
Field of study

Optical Recognition Technology is typically used to convert hard copy printed material into its electronic form. Many presentational artifacts such as end-of-line hyphenations, running headers and footers are literally converted. These artifacts can possibly hinder proximity and exact match searching; This thesis develops an algorithm to extract running headers and footers from electronic documents generated by OCR. This method associates each page of the document with its neighboring pages and detects the headers and footers by comparing the page with its neighboring pages. Experiments are also taken to test the effectiveness of these algorithms

University of Nevada, Las Vegas Repository

Recommended from our members

Increasing United States College Access for Native Arabic Speakers: Applying a Simplification Intervention and Evaluating Machine and Human Translations

Author: Babekir Tahagod
McCartt Brett
Taylor Zachary W.
Publication venue: Texas Education Review
Publication date: 01/01/2024
Field of study

Across many language backgrounds, a consistent hurdle to accessing United States higher education is understanding the basic information necessary to apply for admission and financial aid and complete the many enrollment management processes necessary to begin one’s college career (apply for housing, receive and submit vaccinations, register for classes, etc.). However, to date, no studies have explored how this type of higher education information can be simplified and translated into Arabic, one of the most widely spoken languages in the world and a linguistic background shared by tens of thousands of prospective international students (and their families) seeking higher education in the United States. This case study reports on research-to-practice work conducted with the University of Iowa, specifically how the university simplified their enrollment management information and how that information was translated into Arabic for native Arabic speakers seeking access to the University of Iowa. Findings reveal that the institution simplified text to speak more directly to prospective student audiences by using second person pronouns and simpler sentence structure and diction to engage this audience. Moreover, analyses of machine and human translations of English to Arabic suggest that human translation should be the preferred mechanism of translating higher education information, as Google Translate and Chat GPT [A1] provided adequate but not perfect translations of Iowa’s information. Implications for practice and college access are addressed.Educatio

Texas ScholarWorks

Finding acronyms and their definitions using HMM

Author: Vyas Lakshmi
Publication venue: Digital Scholarship@UNLV
Publication date: 01/05/2011
Field of study

In this thesis, we report on design and implementation of a Hidden Markov Model (HMM) to extract acronyms and their expansions. We also report on the training of this HMM with Maximum Likelihood Estimation (MLE) algorithm using a set of examples. Finally, we report on our testing using standard recall and precision. The HMM achieves a recall and precision of 98% and 92% respectively

University of Nevada, Las Vegas Repository

Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

Author: Gabory Esteban
Mwaniki Moses Njagi
Pisanti Nadia
Pissis Solon P.
Radoszewski Jakub
Sweering Michelle
Zuba Wiktor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Publication date: 01/01/2023
Field of study

An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϵ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϵ)-time algorithm and no O ((N1n2 + N2n1)1−ϵ)-time algorithm, for any constant ϵ > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϵf(n1, n2))-time algorithm, for any constant ϵ > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server