8,568 research outputs found
Learning Parse and Translation Decisions From Examples With Rich Context
We present a knowledge and context-based system for parsing and translating
natural language and evaluate it on sentences from the Wall Street Journal.
Applying machine learning techniques, the system uses parse action examples
acquired under supervision to generate a deterministic shift-reduce parser in
the form of a decision structure. It relies heavily on context, as encoded in
features which describe the morphological, syntactic, semantic and other
aspects of a given parse state.Comment: 8 pages, LaTeX, 3 postscript figures, uses aclap.st
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
PonyGE2: Grammatical Evolution in Python
Grammatical Evolution (GE) is a population-based evolutionary algorithm,
where a formal grammar is used in the genotype to phenotype mapping process.
PonyGE2 is an open source implementation of GE in Python, developed at UCD's
Natural Computing Research and Applications group. It is intended as an
advertisement and a starting-point for those new to GE, a reference for
students and researchers, a rapid-prototyping medium for our own experiments,
and a Python workout. As well as providing the characteristic genotype to
phenotype mapping of GE, a search algorithm engine is also provided. A number
of sample problems and tutorials on how to use and adapt PonyGE2 have been
developed.Comment: 8 pages, 4 figures, submitted to the 2017 GECCO Workshop on
Evolutionary Computation Software Systems (EvoSoft
- …