29,324 research outputs found
Acronym-Meaning Extraction from Corpora Using Multi-Tape Weighted Finite-State Machines
The automatic extraction of acronyms and their meaning from corpora is an
important sub-task of text mining. It can be seen as a special case of string
alignment, where a text chunk is aligned with an acronym. Alternative
alignments have different cost, and ideally the least costly one should give
the correct meaning of the acronym. We show how this approach can be
implemented by means of a 3-tape weighted finite-state machine (3-WFSM) which
reads a text chunk on tape 1 and an acronym on tape 2, and generates all
alternative alignments on tape 3. The 3-WFSM can be automatically generated
from a simple regular expression. No additional algorithms are required at any
stage. Our 3-WFSM has a size of 27 states and 64 transitions, and finds the
best analysis of an acronym in a few milliseconds.Comment: 6 pages, LaTe
Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future
In this paper, we present a meta-analysis of several Web content extraction
algorithms, and make recommendations for the future of content extraction on
the Web. First, we find that nearly all Web content extractors do not consider
a very large, and growing, portion of modern Web pages. Second, it is well
understood that wrapper induction extractors tend to break as the Web changes;
heuristic/feature engineering extractors were thought to be immune to a Web
site's evolution, but we find that this is not the case: heuristic content
extractor performance also tends to degrade over time due to the evolution of
Web site forms and practices. We conclude with recommendations for future work
that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration
Comparing Fifty Natural Languages and Twelve Genetic Languages Using Word Embedding Language Divergence (WELD) as a Quantitative Measure of Language Distance
We introduce a new measure of distance between languages based on word
embedding, called word embedding language divergence (WELD). WELD is defined as
divergence between unified similarity distribution of words between languages.
Using such a measure, we perform language comparison for fifty natural
languages and twelve genetic languages. Our natural language dataset is a
collection of sentence-aligned parallel corpora from bible translations for
fifty languages spanning a variety of language families. Although we use
parallel corpora, which guarantees having the same content in all languages,
interestingly in many cases languages within the same family cluster together.
In addition to natural languages, we perform language comparison for the coding
regions in the genomes of 12 different organisms (4 plants, 6 animals, and two
human subjects). Our result confirms a significant high-level difference in the
genetic language model of humans/animals versus plants. The proposed method is
a step toward defining a quantitative measure of similarity between languages,
with applications in languages classification, genre identification, dialect
identification, and evaluation of translations
- …