186 research outputs found
Using Decision Trees for Coreference Resolution
This paper describes RESOLVE, a system that uses decision trees to learn how
to classify coreferent phrases in the domain of business joint ventures. An
experiment is presented in which the performance of RESOLVE is compared to the
performance of a manually engineered set of rules for the same task. The
results show that decision trees achieve higher performance than the rules in
two of three evaluation metrics developed for the coreference task. In addition
to achieving better performance than the rules, RESOLVE provides a framework
that facilitates the exploration of the types of knowledge that are useful for
solving the coreference problem.Comment: 6 pages; LaTeX source; 1 uuencoded compressed EPS file (separate);
uses ijcai95.sty, named.bst, epsf.tex; to appear in Proc. IJCAI '9
Information extraction
In this paper we present a new approach to extract relevant information by knowledge graphs from natural language text. We give a multiple level model based on knowledge graphs for describing template information, and investigate the concept of partial structural parsing. Moreover, we point out that expansion of concepts plays an important role in thinking, so we study the expansion of knowledge graphs to use context information for reasoning and merging of templates
Building a Generation Knowledge Source using Internet-Accessible Newswire
In this paper, we describe a method for automatic creation of a knowledge
source for text generation using information extraction over the Internet. We
present a prototype system called PROFILE which uses a client-server
architecture to extract noun-phrase descriptions of entities such as people,
places, and organizations. The system serves two purposes: as an information
extraction tool, it allows users to search for textual descriptions of
entities; as a utility to generate functional descriptions (FD), it is used in
a functional-unification based generation system. We present an evaluation of
the approach and its applications to natural language generation and
summarization.Comment: 8 pages, uses eps
Corpus-Driven Knowledge Acquisition for Discourse Analysis
The availability of large on-line text corpora provides a natural and
promising bridge between the worlds of natural language processing (NLP) and
machine learning (ML). In recent years, the NLP community has been aggressively
investigating statistical techniques to drive part-of-speech taggers, but
application-specific text corpora can be used to drive knowledge acquisition at
much higher levels as well. In this paper we will show how ML techniques can be
used to support knowledge acquisition for information extraction systems. It is
often very difficult to specify an explicit domain model for many information
extraction applications, and it is always labor intensive to implement
hand-coded heuristics for each new domain. We have discovered that it is
nevertheless possible to use ML algorithms in order to capture knowledge that
is only implicitly present in a representative text corpus. Our work addresses
issues traditionally associated with discourse analysis and intersentential
inference generation, and demonstrates the utility of ML algorithms at this
higher level of language analysis. The benefits of our work address the
portability and scalability of information extraction (IE) technologies. When
hand-coded heuristics are used to manage discourse analysis in an information
extraction system, months of programming effort are easily needed to port a
successful IE system to a new domain. We will show how ML algorithms can reduce
thisComment: 6 pages, AAAI-9
New Resources and Perspectives for Biomedical Event Extraction
Event extraction is a major focus of recent work in biomedical information extraction. Despite substantial advances, many challenges still remain for reliable automatic extraction of events from text. We introduce a new biomedical event extraction resource consisting of analyses automatically created by systems participating in the recent BioNLP Shared Task (ST) 2011. In providing for the first time the outputs of a broad set of state-ofthe-art event extraction systems, this resource opens many new opportunities for studying aspects of event extraction, from the identification of common errors to the study of effective approaches to combining the strengths of systems. We demonstrate these opportunities through a multi-system analysis on three BioNLP ST 2011 main tasks, focusing on events that none of the systems can successfully extract. We further argue for new perspectives to the performance evaluation of domain event extraction systems, considering a document-level, “off-the-page ” representation and evaluation to complement the mentionlevel evaluations pursued in most recent work.
Pattern Matching and Discourse Processing in Information Extraction from Japanese Text
Information extraction is the task of automatically picking up information of
interest from an unconstrained text. Information of interest is usually
extracted in two steps. First, sentence level processing locates relevant
pieces of information scattered throughout the text; second, discourse
processing merges coreferential information to generate the output. In the
first step, pieces of information are locally identified without recognizing
any relationships among them. A key word search or simple pattern search can
achieve this purpose. The second step requires deeper knowledge in order to
understand relationships among separately identified pieces of information.
Previous information extraction systems focused on the first step, partly
because they were not required to link up each piece of information with other
pieces. To link the extracted pieces of information and map them onto a
structured output format, complex discourse processing is essential. This paper
reports on a Japanese information extraction system that merges information
using a pattern matcher and discourse processor. Evaluation results show a high
level of system performance which approaches human performance.Comment: See http://www.jair.org/ for any accompanying file
University of Sheffield TREC-8 Q & A System
The system entered by the University of Sheffield in the question answering track of TREC-8 is the result of coupling two existing technologies - information retrieval (IR) and information extraction (IE). In essence the approach is this: the IR system treats the question as a query and returns a set of top ranked documents or passages; the IE system uses NLP techniques to parse the question, analyse the top ranked documents or passages returned by the IR system, and instantiate a query variable in the semantic representation of the question against the semantic representation of the analysed documents or passages. Thus, while the IE system by no means attempts “full text understanding", this approach is a relatively deep approach which attempts to work with meaning representations.
Since the information retrieval systems we used were not our own (AT&T and UMass) and were used more or less “off the shelf", this paper concentrates on describing the modifications made to our existing information extraction system to allow it to participate in the Q & A task
A Corpus-Based Approach for Building Semantic Lexicons
Semantic knowledge can be a great asset to natural language processing
systems, but it is usually hand-coded for each application. Although some
semantic information is available in general-purpose knowledge bases such as
WordNet and Cyc, many applications require domain-specific lexicons that
represent words and categories for a particular topic. In this paper, we
present a corpus-based method that can be used to build semantic lexicons for
specific categories. The input to the system is a small set of seed words for a
category and a representative text corpus. The output is a ranked list of words
that are associated with the category. A user then reviews the top-ranked words
and decides which ones should be entered in the semantic lexicon. In
experiments with five categories, users typically found about 60 words per
category in 10-15 minutes to build a core semantic lexicon.Comment: 8 pages - to appear in Proceedings of EMNLP-
CRYSTAL: Inducing a Conceptual Dictionary
One of the central knowledge sources of an information extraction system is a
dictionary of linguistic patterns that can be used to identify the conceptual
content of a text. This paper describes CRYSTAL, a system which automatically
induces a dictionary of "concept-node definitions" sufficient to identify
relevant information from a training corpus. Each of these concept-node
definitions is generalized as far as possible without producing errors, so that
a minimum number of dictionary entries cover the positive training instances.
Because it tests the accuracy of each proposed definition, CRYSTAL can often
surpass human intuitions in creating reliable extraction rules.Comment: 6 pages, Postscript, IJCAI-95
http://ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.htm
- …