Search CORE

312 research outputs found

Corpus-Driven Knowledge Acquisition for Discourse Analysis

Author: Lehnert Wendy
Soderland Stephen
Publication venue
Publication date: 01/01/1994
Field of study

The availability of large on-line text corpora provides a natural and promising bridge between the worlds of natural language processing (NLP) and machine learning (ML). In recent years, the NLP community has been aggressively investigating statistical techniques to drive part-of-speech taggers, but application-specific text corpora can be used to drive knowledge acquisition at much higher levels as well. In this paper we will show how ML techniques can be used to support knowledge acquisition for information extraction systems. It is often very difficult to specify an explicit domain model for many information extraction applications, and it is always labor intensive to implement hand-coded heuristics for each new domain. We have discovered that it is nevertheless possible to use ML algorithms in order to capture knowledge that is only implicitly present in a representative text corpus. Our work addresses issues traditionally associated with discourse analysis and intersentential inference generation, and demonstrates the utility of ML algorithms at this higher level of language analysis. The benefits of our work address the portability and scalability of information extraction (IE) technologies. When hand-coded heuristics are used to manage discourse analysis in an information extraction system, months of programming effort are easily needed to port a successful IE system to a new domain. We will show how ML algorithms can reduce thisComment: 6 pages, AAAI-9

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

CRYSTAL: Inducing a Conceptual Dictionary

Author: Soderland Stephen
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/1995
Field of study

One of the central knowledge sources of an in- formation extraction (IE) system is a dictio- nary of linguistic patterns that can be used to identify references to relevant information in a text. Automatic creation of conceptual dictionaries is important for portability and scalability of an IE system. This paper de- scribes CRYSTAL, a system which automat- ically induces a dictionary of \concept-node denitions sucient to identify relevant in- formation from a training corpus. Each of these concept-node denitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed denition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules

ScholarWorks@UMass Amherst

CRYSTAL: Inducing a Conceptual Dictionary

Author: Aseltine Jonathan
Fisher David
Lehnert Wendy
Soderland Stephen
Publication venue
Publication date: 01/01/1995
Field of study

One of the central knowledge sources of an information extraction system is a dictionary of linguistic patterns that can be used to identify the conceptual content of a text. This paper describes CRYSTAL, a system which automatically induces a dictionary of "concept-node definitions" sufficient to identify relevant information from a training corpus. Each of these concept-node definitions is generalized as far as possible without producing errors, so that a minimum number of dictionary entries cover the positive training instances. Because it tests the accuracy of each proposed definition, CRYSTAL can often surpass human intuitions in creating reliable extraction rules.Comment: 6 pages, Postscript, IJCAI-95 http://ciir.cs.umass.edu/info/psfiles/tepubs/tepubs.htm

arXiv.org e-Print Archive

CiteSeerX

A High Pressure Distorted a-Uranium (Pnma) Structure in Plutonium

Author: Akella
Bouchet
Dabos
Grosshans
Haire
Lander
Lashley
Le Bihan
Lindbaum
Roof
S.K. Sikka
Smith
Soderland
Velisavlgoevic
Wallace
Wick
Zachariasen
Publication venue: 'Elsevier BV'
Publication date: 03/09/2004
Field of study

Under pressure many rare earths and actinide metals transform to a-U structure or its lower symmetry distorted forms. We have reinterpreted the diffraction data of Dabos et al for Pu (reference 4) and find that a Am IV type distorted a-U structure in Pnma space group can explain this for its high pressure phase. The structures of this phase and a-Pu are both shown to have a distorted hcp topology. The upturn in the atomic volume of Pu at 0.1 MPa can also be rationalized on the basis of this proposalComment: 10pages,3 figure

arXiv.org e-Print Archive

Crossref

Growing a list

Author: Benjamin Letham
Cynthia Rudin
DF Hsu
Katherine A. Heller
M Lalmas
MMS Beg
O Bousquet
O Etzioni
R Gupta
S Soderland
Publication venue: Massachusetts Institute of Technology, Operations Research Center
Publication date: 21/08/2012
Field of study

It is easy to find expert knowledge on the Internet on almost any topic, but obtaining a complete overview of a given topic is not always easy: Information can be scattered across many sources and must be aggregated to be useful. We introduce a method for intelligently growing a list of relevant items, starting from a small seed of examples. Our algorithm takes advantage of the wisdom of the crowd, in the sense that there are many experts who post lists of things on the Internet. We use a collection of simple machine learning components to find these experts and aggregate their lists to produce a single complete and meaningful list. We use experiments with gold standards and open-ended experiments without gold standards to show that our method significantly outperforms the state of the art. Our method uses the clustering algorithm Bayesian Sets even when its underlying independence assumption is violated, and we provide a theoretical generalization bound to motivate its use.

CiteSeerX

DSpace@MIT

Crossref

Lemmatic machine translation

Author: Bo Qin
Jonathan Pool
Mausam Christopher Lim
Oren Etzioni
Stephen Soderland
Publication venue
Publication date: 01/01/2009
Field of study

Abstract Statistical MT is limited by reliance on large parallel corpora. We propose Lemmatic MT, a new paradigm that extends MT to a far broader set of languages, but requires substantial manual encoding effort. We present PANLINGUAL TRANSLATOR, a prototype Lemmatic MT system with high translation adequacy on 59% to 99% of sentences (average 84%) on a sample of 6 language pairs that Google Translate (GT) handles. GT ranged from 34% to 93%, average 65%. PANLINGUAL TRANSLATOR also had high translation adequacy on 27% to 82% of sentences (average 62%) from a sample of 5 language pairs not handled by GT

CiteSeerX

Using Support Vector Machines for Terrorism Information Extraction

Author: E. Riloff
J.-T. Kim
S. Baluja
S. Soderland
V. N. Vapnik
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2003
Field of study

Crossref

Institutional Knowledge at Singapore Management University

Automatising the learning of lexical patterns: An application to the enrichment of WordNet by extracting semantic relationships from Wikipedia

Author: Alfonseca
Arevalo
Baeza-Yates
Batali
Berners-Lee
Bonzi
Church
Ding
Enrique Alfonseca
Etzioni
Gruber
Harabagiu
Hearst
Maedche
Marcus
Maria Ruiz-Casado
Miller
Navigli
Pablo Castells
Ruiz-Casado
Soderland
Wagner
Wilks
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

This is the author’s version of a work that was accepted for publication in Journal Data & Knowledge Engineering. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal Data & Knowledge Engineering, 61, 3, (2007) DOI: 10.1016/j.datak.2006.06.011This paper describes an automatic approach to identify lexical patterns that represent semantic relationships between concepts in an on-line encyclopedia. Next, these patterns can be applied to extend existing ontologies or semantic networks with new relations. The experiments have been performed with the Simple English Wikipedia and WordNet 1.7. A new algorithm has been devised for automatically generalising the lexical patterns found in the encyclopedia entries. We have found general patterns for the hyperonymy, hyponymy, holonymy and meronymy relations and, using them, we have extracted more than 2600 new relationships that did not appear in WordNet originally. The precision of these relationships depends on the degree of generality chosen for the patterns and the type of relation, being around 60-70% for the best combinations proposed.This work has been sponsored by MEC, project number TIN-2005-0688

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Biblos-e Archivo

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref