312 research outputs found
Corpus-Driven Knowledge Acquisition for Discourse Analysis
The availability of large on-line text corpora provides a natural and
promising bridge between the worlds of natural language processing (NLP) and
machine learning (ML). In recent years, the NLP community has been aggressively
investigating statistical techniques to drive part-of-speech taggers, but
application-specific text corpora can be used to drive knowledge acquisition at
much higher levels as well. In this paper we will show how ML techniques can be
used to support knowledge acquisition for information extraction systems. It is
often very difficult to specify an explicit domain model for many information
extraction applications, and it is always labor intensive to implement
hand-coded heuristics for each new domain. We have discovered that it is
nevertheless possible to use ML algorithms in order to capture knowledge that
is only implicitly present in a representative text corpus. Our work addresses
issues traditionally associated with discourse analysis and intersentential
inference generation, and demonstrates the utility of ML algorithms at this
higher level of language analysis. The benefits of our work address the
portability and scalability of information extraction (IE) technologies. When
hand-coded heuristics are used to manage discourse analysis in an information
extraction system, months of programming effort are easily needed to port a
successful IE system to a new domain. We will show how ML algorithms can reduce
thisComment: 6 pages, AAAI-9
Recommended from our members
CRYSTAL: Inducing a Conceptual Dictionary
One of the central knowledge sources of an in- formation extraction (IE) system is a dictio- nary of linguistic patterns that can be used to identify references to relevant information in a text. Automatic creation of conceptual dictionaries is important for portability and scalability of an IE system. This paper de- scribes CRYSTAL, a system which automat- ically induces a dictionary of \concept-node denitions sucient to identify relevant in- formation from a training corpus. Each of these concept-node denitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed denition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules
CRYSTAL: Inducing a Conceptual Dictionary
One of the central knowledge sources of an information extraction system is a
dictionary of linguistic patterns that can be used to identify the conceptual
content of a text. This paper describes CRYSTAL, a system which automatically
induces a dictionary of "concept-node definitions" sufficient to identify
relevant information from a training corpus. Each of these concept-node
definitions is generalized as far as possible without producing errors, so that
a minimum number of dictionary entries cover the positive training instances.
Because it tests the accuracy of each proposed definition, CRYSTAL can often
surpass human intuitions in creating reliable extraction rules.Comment: 6 pages, Postscript, IJCAI-95
http://ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.htm
A High Pressure Distorted a-Uranium (Pnma) Structure in Plutonium
Under pressure many rare earths and actinide metals transform to a-U
structure or its lower symmetry distorted forms. We have reinterpreted the
diffraction data of Dabos et al for Pu (reference 4) and find that a Am IV type
distorted a-U structure in Pnma space group can explain this for its high
pressure phase. The structures of this phase and a-Pu are both shown to have a
distorted hcp topology. The upturn in the atomic volume of Pu at 0.1 MPa can
also be rationalized on the basis of this proposalComment: 10pages,3 figure
Growing a list
It is easy to find expert knowledge on the Internet on almost any topic, but obtaining a complete overview of a given topic is not always easy: Information can be scattered across many sources and must be aggregated to be useful. We introduce a method for intelligently growing a list of relevant items, starting from a small seed of examples. Our algorithm takes advantage of the wisdom of the crowd, in the sense that there are many experts who post lists of things on the Internet. We use a collection of simple machine learning components to find these experts and aggregate their lists to produce a single complete and meaningful list. We use experiments with gold standards and open-ended experiments without gold standards to show that our method significantly outperforms the state of the art. Our method uses the clustering algorithm Bayesian Sets even when its underlying independence assumption is violated, and we provide a theoretical generalization bound to motivate its use.
Lemmatic machine translation
Abstract Statistical MT is limited by reliance on large parallel corpora. We propose Lemmatic MT, a new paradigm that extends MT to a far broader set of languages, but requires substantial manual encoding effort. We present PANLINGUAL TRANSLATOR, a prototype Lemmatic MT system with high translation adequacy on 59% to 99% of sentences (average 84%) on a sample of 6 language pairs that Google Translate (GT) handles. GT ranged from 34% to 93%, average 65%. PANLINGUAL TRANSLATOR also had high translation adequacy on 27% to 82% of sentences (average 62%) from a sample of 5 language pairs not handled by GT
Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia
This is the author’s version of a work that was accepted for publication in Journal
Data & Knowledge Engineering. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal Data & Knowledge Engineering, 61, 3, (2007) DOI: 10.1016/j.datak.2006.06.011This paper describes an automatic approach to identify lexical patterns that represent semantic relationships between concepts in an on-line encyclopedia. Next, these patterns can be applied to extend existing ontologies or semantic networks with new relations. The experiments have been performed with the Simple English Wikipedia and WordNet 1.7. A new algorithm has been devised for automatically generalising the lexical patterns found in the encyclopedia entries. We have found general patterns for the hyperonymy, hyponymy, holonymy and meronymy relations and, using them, we have extracted more than 2600 new relationships that did not appear in WordNet originally. The precision of these relationships depends on the degree of generality chosen for the patterns and the type of relation, being around 60-70% for the best combinations proposed.This work has been sponsored by MEC, project number TIN-2005-0688
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
- …