9,890 research outputs found
Using compression to identify acronyms in text
Text mining is about looking for patterns in natural language text, and may
be defined as the process of analyzing text to extract information from it for
particular purposes. In previous work, we claimed that compression is a key
technology for text mining, and backed this up with a study that showed how
particular kinds of lexical tokens---names, dates, locations, etc.---can be
identified and located in running text, using compression models to provide the
leverage necessary to distinguish different token types (Witten et al., 1999)Comment: 10 pages. A short form published in DCC200
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
Acronym-Meaning Extraction from Corpora Using Multi-Tape Weighted Finite-State Machines
The automatic extraction of acronyms and their meaning from corpora is an
important sub-task of text mining. It can be seen as a special case of string
alignment, where a text chunk is aligned with an acronym. Alternative
alignments have different cost, and ideally the least costly one should give
the correct meaning of the acronym. We show how this approach can be
implemented by means of a 3-tape weighted finite-state machine (3-WFSM) which
reads a text chunk on tape 1 and an acronym on tape 2, and generates all
alternative alignments on tape 3. The 3-WFSM can be automatically generated
from a simple regular expression. No additional algorithms are required at any
stage. Our 3-WFSM has a size of 27 states and 64 transitions, and finds the
best analysis of an acronym in a few milliseconds.Comment: 6 pages, LaTe
Power to the people: end-user building of digital library collections
Naturally, digital library systems focus principally on the reader: th e consumer of the material that constitutes the library. In contrast, this paper describes an interface that makes it easy for people to build their own library collections. Collections may be built and served locally from the user's own web server, or (given appropriate permissions) remotely on a shared digital library host. End users can easily build new collections styled after existing ones from material on the Web or from their local files-or both, and collections can be updated and new ones brought on-line at any time. The interface, which is intended for non-professional end users, is modeled after widely used commercial software installation packages. Lest one quail at the prospect of end users building their own collections on a shared system, we also describe an interface for the administrative user who is responsible for maintaining a digital library installation
Acronyms as an integral part of multi–word term recognition - A token of appreciation
Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation
Named Entity Recognition and Text Compression
Import 13/01/2017In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.460 - Katedra informatikyvyhově
Neural networks application to divergence-based passive ranging
The purpose of this report is to summarize the state of knowledge and outline the planned work in divergence-based/neural networks approach to the problem of passive ranging derived from optical flow. Work in this and closely related areas is reviewed in order to provide the necessary background for further developments. New ideas about devising a monocular passive-ranging system are then introduced. It is shown that image-plan divergence is independent of image-plan location with respect to the focus of expansion and of camera maneuvers because it directly measures the object's expansion which, in turn, is related to the time-to-collision. Thus, a divergence-based method has the potential of providing a reliable range complementing other monocular passive-ranging methods which encounter difficulties in image areas close to the focus of expansion. Image-plan divergence can be thought of as some spatial/temporal pattern. A neural network realization was chosen for this task because neural networks have generally performed well in various other pattern recognition applications. The main goal of this work is to teach a neural network to derive the divergence from the imagery
MedTxting: learning based and knowledge rich SMS-style medical text contraction
In mobile health (M-health), Short Message Service (SMS) has shown to improve disease related self-management and health service outcomes, leading to enhanced patient care. However, the hard limit on character size for each message limits the full value of exploring SMS communication in health care practices. To overcome this problem and improve the efficiency of clinical workflow, we developed an innovative system, MedTxting (available at http://medtxting.askhermes.org), which is a learning-based but knowledge-rich system that compresses medical texts in a SMS style. Evaluations on clinical questions and discharge summary narratives show that MedTxting can effectively compress medical texts with reasonable readability and noticeable size reduction. Findings in this work reveal potentials of MedTxting to the clinical settings, allowing for real-time and cost-effective communication, such as patient condition reporting, medication consulting, physicians connecting to share expertise to improve point of care
- …