Search CORE

65 research outputs found

Corpus-Driven Knowledge Acquisition for Discourse Analysis

Author: Lehnert Wendy
Soderland Stephen
Publication venue
Publication date: 01/01/1994
Field of study

The availability of large on-line text corpora provides a natural and promising bridge between the worlds of natural language processing (NLP) and machine learning (ML). In recent years, the NLP community has been aggressively investigating statistical techniques to drive part-of-speech taggers, but application-specific text corpora can be used to drive knowledge acquisition at much higher levels as well. In this paper we will show how ML techniques can be used to support knowledge acquisition for information extraction systems. It is often very difficult to specify an explicit domain model for many information extraction applications, and it is always labor intensive to implement hand-coded heuristics for each new domain. We have discovered that it is nevertheless possible to use ML algorithms in order to capture knowledge that is only implicitly present in a representative text corpus. Our work addresses issues traditionally associated with discourse analysis and intersentential inference generation, and demonstrates the utility of ML algorithms at this higher level of language analysis. The benefits of our work address the portability and scalability of information extraction (IE) technologies. When hand-coded heuristics are used to manage discourse analysis in an information extraction system, months of programming effort are easily needed to port a successful IE system to a new domain. We will show how ML algorithms can reduce thisComment: 6 pages, AAAI-9

arXiv.org e-Print Archive

CiteSeerX

CRYSTAL: Inducing a Conceptual Dictionary

Author: Aseltine Jonathan
Fisher David
Lehnert Wendy
Soderland Stephen
Publication venue
Publication date: 01/01/1995
Field of study

One of the central knowledge sources of an information extraction system is a dictionary of linguistic patterns that can be used to identify the conceptual content of a text. This paper describes CRYSTAL, a system which automatically induces a dictionary of "concept-node definitions" sufficient to identify relevant information from a training corpus. Each of these concept-node definitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed definition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules.Comment: 6 pages, Postscript, IJCAI-95 http://ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.htm

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

CRYSTAL: Inducing a Conceptual Dictionary

Author: Soderland Stephen
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/1995
Field of study

One of the central knowledge sources of an in- formation extraction (IE) system is a dictio- nary of linguistic patterns that can be used to identify references to relevant information in a text. Automatic creation of conceptual dictionaries is important for portability and scalability of an IE system. This paper de- scribes CRYSTAL, a system which automat- ically induces a dictionary of \concept-node denitions sucient to identify relevant in- formation from a training corpus. Each of these concept-node denitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed denition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules

ScholarWorks@UMass Amherst

Lemmatic machine translation

Author: Bo Qin
Jonathan Pool
Mausam Christopher Lim
Oren Etzioni
Stephen Soderland
Publication venue
Publication date: 01/01/2009
Field of study

Abstract Statistical MT is limited by reliance on large parallel corpora. We propose Lemmatic MT, a new paradigm that extends MT to a far broader set of languages, but requires substantial manual encoding effort. We present PANLINGUAL TRANSLATOR, a prototype Lemmatic MT system with high translation adequacy on 59% to 99% of sentences (average 84%) on a sample of 6 language pairs that Google Translate (GT) handles. GT ranged from 34% to 93%, average 65%. PANLINGUAL TRANSLATOR also had high translation adequacy on 27% to 82% of sentences (average 62%) from a sample of 5 language pairs not handled by GT

CiteSeerX

Open information extraction from the web

Author: Banko M.
Banko M.
Brin S.
Daniel S. Weld
Downey D.
Lafferty J.
McCallum A.
Michele Banko
Oren Etzioni
Poon H.
Proceedings
Riloff E.
Riloff E.
Stephen Soderland
Weld D.
Yates A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Learning to Extract Text-based Information from the World Wide Web

Author: Stephen Soderland
Publication venue
Publication date
Field of study

There is a wealth of information to be mined from narrative text on the World Wide Web. Unfortunately, standard natural language processing (NLP) extraction techniques expect full, grammatical sentences, and perform poorly on the choppy sentence fragments that are often found on web pages. This paper 1 introduces Webfoot, a preprocessor that parses web pages into logically coherent segments based on page layout cues. Output from Webfoot is then passed on to CRYSTAL, an NLP system that learns text extraction rules from example. Webfoot and CRYSTAL transform the text into a formal representation that is equivalent to relational database entries. This is a necessary first step for knowledge discovery and other automated analysis of free text. Information Extraction from the Web The World Wide Web contains a wealth of text information in the form of free text. Until a text extraction system transforms it into an unambiguous format, much of this information remains inaccessible to autom..

CiteSeerX

Learning text analysis rules for domain-specific natural language processing

Author: Soderland Stephen Glenn
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/1997
Field of study

An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a specific domain, which is a corpus of texts together with a predefined set of concepts that are of interest to that domain. Two widely different domains are used to illustrate this domain-specific approach. One domain is a collection of Wall Street Journal articles in which the target concept is management succession events: identifying persons moving into corporate management positions or moving out. A second domain is a collection of hospital discharge summaries in which the target concepts are various classes of diagnosis or symptom. The goal of an information extraction system is to identify references to the concept of interest for a particular domain. A key knowledge source for this purpose is a set of text analysis rules based on the vocabulary, semantic classes, and writing style peculiar to the domain. This thesis presents CRYSTAL, an implemented system that automatically induces domain-specific text analysis rules from training examples. CRYSTAL learns rules that approach the performance of hand-coded rules, are robust in the face of noise and inadequate features, and require only a modest amount of training data. CRYSTAL belongs to a class of machine learning algorithms called covering algorithms, and presents a novel control strategy with time and space complexities that are independent of the number of features. CRYSTAL navigates efficiently through an extremely large space of possible rules. CRYSTAL also demonstrates that expressive rule representation is essential for high performance, robust text analysis rules. While simple rules are adequate to capture the most salient regularities in the training data, high performance can only be achieved when rules are expressive enough to reflect the subtlety and variability of unrestricted natural language

CiteSeerX

ScholarWorks@UMass Amherst