Search CORE

36 research outputs found

Empirical study of automated dictionary construction for information extraction in three domains

Author: Riloff Ellen M.
Publication venue: 'Elsevier BV'
Publication date: 01/01/1996
Field of study

ManuscriptA primary goal of natural language processing researchers is to develop a knowledge-based natural language processing (NLP) system that is portable across domains. However, most knowledge-based NLP systems rely on a domain-specific dictionary of concepts, which represents a substantial knowledge-engineering bottleneck. We have developed a system called AutoSlog that addresses the knowledge-engineering bottleneck for a task called information extraction. AutoSlog automatically creates domain-specific dictionaries for information extraction, given an appropriate training corpus. We have used AutoSlog to create a dictionary of extraction patterns for terrorism, which achieved 98% of the performance of a handcrafted dictionary that required approximately 1500 person-hours to build. In this paper, we describe experiments with AutoSlog in two additional domains: joint ventures and microelectronics. We compare the performance of AutoSlog across the three domains, discuss the lessons learned about the generality of this approach, and present results from two experiments which demonstrate that novice users can generate effective dictionaries using AutoSlog

The University of Utah: J. Willard Marriott Digital Library

Learning subjective nouns using extraction pattern bootstrapping

Author: Riloff Ellen M.
Wiebe Janyce
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2003
Field of study

Journal ArticleWe explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms. The goal of our research is to develop a system that can distinguish subjective sentences from objective sentences. First, we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns. Then we train a Naive Bayes classifier using the subjective nouns, discourse features, and subjectivity clues identified in prior research. The bootstrapping algorithms learned over 1000 subjective nouns, and the subjectivity classifier performed well, achieving 77% recall with 81% precision

The University of Utah: J. Willard Marriott Digital Library

Corpus-based approach for building semantic lexicons

Author: Riloff Ellen M.
Shepherd Jessica
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/1997
Field of study

Journal ArticleSemantic knowledge can be a great asset to natural language processing systems, but it is usually hand-coded for each application. Although some semantic information is available in general-purpose knowledge bases such as Word Net and Cyc, many applications require domain-specific lexicons that represent words and categories for a particular topic. In this paper, we present a corpus-based method that can be used t o build semantic lexicons for specific categories. The input t o the system is a small set of seed words for a category and a representative text corpus. The output is a ranked list of words that are associated with the category. A user then reviews the top-ranked words and decides which ones should be entered in the semantic lexicon. Tn experiments with five categories, users typically found about 60 words per category in 10-15 minutes to build a core semantic lexicon

The University of Utah: J. Willard Marriott Digital Library

Learning and evaluating the content and structure of a term taxonomy

Author: Kozareva Zornitsa
Riloff Ellen M.
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 01/01/2009
Field of study

Journal ArticleIn this paper, we describe a weakly supervised bootstrapping algorithm that reads Web texts and learns taxonomy terms. The bootstrapping algorithm starts with two seed words (a seed hypernym (Root concept) and a seed hyponym) that are inserted into a doubly anchored hyponym pattern. In alternating rounds, the algorithm learns new hyponym terms and new hypernym terms that are subordinate to the Root concept. We conducted an extensive evaluation with human annotators to evaluate the learned hyponym and hypernym terms for two categories: animals and people

The University of Utah: J. Willard Marriott Digital Library

Empirical approach to conceptual case frame acquisition

Author: Riloff Ellen M.
Schmelzenbach Mark
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/1998
Field of study

Journal ArticleConceptual natural language processing systems usually rely on case frame instantiation to recognize events and role objects in text. But generating a good set of case frames for a domain is time-consuming, tedious, and prone to errors of omission. We have developed a corpus-based algorithm for acquiring conceptual case frames empirically from unannotated text. Our algorithm builds on previous research on corpus-based methods for acquiring extraction patterns and semantic lexicons. Given extraction patterns and a semantic lexicon for a domain, our algorithm learns semantic preferences for each extraction pattern and merges the syntactically compatible patterns to produce multi-slot case frames with selectional restrictions. The case frames generate more cohesive output and produce fewer false hits than the original extraction patterns. Our system requires only preclassified training texts and a few hours of manual review to filter the dictionaries, demonstrating that conceptual case frames can be acquired from unannotated text without special training resources

The University of Utah: J. Willard Marriott Digital Library

Bootstrapping for text learning tasks

Author: Jones Rosie
Riloff Ellen M.
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 01/01/1999
Field of study

Journal ArticleWhen applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large quantity of labeled data, this paper advocates using a small amount of seed information and a large collection of easily-obtained unlabeled data. Bootstrapping initializes a learner with the seed information; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% precision for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a level close to human agreement, when placing computer science research papers into a topic hierarchy. The success of these two examples argues for the strength of the general boot¬ strapping approach for text learning tasks

The University of Utah: J. Willard Marriott Digital Library

Learning dictionaries for information extraction by multi-level bootstrapping

Author: Jones Rosie
Riloff Ellen M.
Publication venue: 'Association for the Advancement of Artificial Intelligence (AAAI)'
Publication date: 01/01/1999
Field of study

Journal ArticleInformation extraction systems usually require two dictionaries: a semantic lexicon and a dictionary of extraction patterns for the domain. We present a multilevel bootstrapping algorithm that generates both the semantic lexicon and extraction patterns simultaneously. As input, our technique requires only unannotated training texts and a handful of seed words for a category. We use a mutual bootstrapping technique to alternately select the best extraction pattern for the category and bootstrap its extractions into the semantic lexicon, which is the basis for selecting the next extraction pattern. To make this approach more robust, we add a second level of bootstrapping (metabootstrapping) that retains only the most reliable lexicon entries produced by mutual bootstrapping and then restarts the process. We evaluated this multilevel bootstrapping technique on a collection of corporate web pages and a corpus of terrorism news articles. The algorithm produced high-quality dictionaries for several semantic categories

The University of Utah: J. Willard Marriott Digital Library

Identifying sources of opinions with conditional random fields and extraction patterns

Author: Choi Yejin
Riloff Ellen M.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

Journal ArticleRecent systems have been developed for sentiment classification, opinion recognition, and opinion analysis (e.g., detecting polarity and strength). We pursue another aspect of opinion analysis: identifying the sources of opinions, emotions, and sentiments. We view this problem as an information extraction task and adopt a hybrid approach that combines Conditional Random Fields (Lafferty et al., 2001) and a variation of AutoSlog (Riloff, 1996a). While CRFs model source identification as a sequence tagging task, AutoSlog learns extraction patterns. Our results show that the combination of these two methods performs better than either one alone. The resulting system identifies opinion sources with 79:3% precision and 59:5% recall using a head noun matching measure, and 81:2% precision and 60:6% recall using an overlap measure

The University of Utah: J. Willard Marriott Digital Library

Exploiting strong syntactic heuristics and co-training to learn semantic lexicons

Author: Phillips William
Riloff Ellen M.
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2002
Field of study

Journal ArticleWe present a bootstrapping method that uses strong syntactic heuristics to learn semantic lexicons. The three sources of information are appositives, compound nouns, and ISA clauses. We apply heuristics to these syntactic structures, embed them in a bootstrapping architecture, and combine them with co-training. Results on WSJ articles and a pharmaceutical corpus show that this method obtains high precision and finds a large number of terms

Crossref

The University of Utah: J. Willard Marriott Digital Library

Corpus-based bootstrapping algorithm for semi-automated semantic lexicon construction

Author: Riloff Ellen M.
Shepherd Jessica
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/1999
Field of study

Journal ArticleMany applications need a lexicon that represents semantic information but acquiring lexical information is time consuming. We present a corpus-based bootstrapping algorithm that assists users in creating domain-specifi c semantic lexicons quickly. Our algorithm uses a representative text corpus for the domain and a small set of 'seed words' that belong to a semantic class of interest. The algorithm hypothesizes new words that are also likely to belong to the semantic class because they occur in the same contexts as the seed words. The best hypotheses are added to the seed word list dynamically, and the process iterates in a bootstrapping fashion. When the bootstrapping process halts, a ranked list of hypothesized category words is presented to a user for review. We used this algorithm to generate a semantic lexicon for eleven semantic classes associated with the MUC-4 terrorism domain

CiteSeerX

The University of Utah: J. Willard Marriott Digital Library