16 research outputs found
Compacting the Penn Treebank Grammar
Treebanks, such as the Penn Treebank (PTB), offer a simple approach to
obtaining a broad coverage grammar: one can simply read the grammar off the
parse trees in the treebank. While such a grammar is easy to obtain, a
square-root rate of growth of the rule set with corpus size suggests that the
derived grammar is far from complete and that much more treebanked text would
be required to obtain a complete grammar, if one exists at some limit. However,
we offer an alternative explanation in terms of the underspecification of
structures within the treebank. This hypothesis is explored by applying an
algorithm to compact the derived grammar by eliminating redundant rules --
rules whose right hand sides can be parsed by other rules. The size of the
resulting compacted grammar, which is significantly less than that of the full
treebank grammar, is shown to approach a limit. However, such a compacted
grammar does not yield very good performance figures. A version of the
compaction algorithm taking rule probabilities into account is proposed, which
is argued to be more linguistically motivated. Combined with simple
thresholding, this method can be used to give a 58% reduction in grammar size
without significant change in parsing performance, and can produce a 69%
reduction with some gain in recall, but a loss in precision.Comment: 5 pages, 2 figure
A Data-Oriented Approach to Semantic Interpretation
In Data-Oriented Parsing (DOP), an annotated language corpus is used as a
stochastic grammar. The most probable analysis of a new input sentence is
constructed by combining sub-analyses from the corpus in the most probable way.
This approach has been succesfully used for syntactic analysis, using corpora
with syntactic annotations such as the Penn Treebank. If a corpus with
semantically annotated sentences is used, the same approach can also generate
the most probable semantic interpretation of an input sentence. The present
paper explains this semantic interpretation method, and summarizes the results
of a preliminary experiment. Semantic annotations were added to the syntactic
annotations of most of the sentences of the ATIS corpus. A data-oriented
semantic interpretation algorithm was succesfully tested on this semantically
enriched corpus.Comment: 10 pages, Postscript; to appear in Proceedings Workshop on
Corpus-Oriented Semantic Analysis, ECAI-96, Budapes
Automatic extraction of knowledge from web documents
A large amount of digital information available is written as text documents in the form of web pages, reports, papers, emails, etc. Extracting the knowledge of interest from such documents from multiple sources in a timely fashion is therefore crucial. This paper provides an update on the Artequakt system which uses natural language tools to automatically extract knowledge about artists from multiple documents based on a predefined ontology. The ontology represents the type and form of knowledge to extract. This knowledge is then used to generate tailored biographies. The information extraction process of Artequakt is detailed and evaluated in this paper
Artequakt: Generating tailored biographies from automatically annotated fragments from the web
The Artequakt project seeks to automatically generate narrativebiographies of artists from knowledge that has been extracted from the Web and maintained in a knowledge base. An overview of the system architecture is presented here and the three key components of that architecture are explained in detail, namely knowledge extraction, information management and biography construction. Conclusions are drawn from the initial experiences of the project and future progress is detailed
Constituent Structure for Filipino: Induction through Probabilistic Approaches
PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200
Data-Oriented Language Processing. An Overview
During the last few years, a new approach to language processing has started
to emerge, which has become known under various labels such as "data-oriented
parsing", "corpus-based interpretation", and "tree-bank grammar" (cf. van den
Berg et al. 1994; Bod 1992-96; Bod et al. 1996a/b; Bonnema 1996; Charniak
1996a/b; Goodman 1996; Kaplan 1996; Rajman 1995a/b; Scha 1990-92; Sekine &
Grishman 1995; Sima'an et al. 1994; Sima'an 1995-96; Tugwell 1995). This
approach, which we will call "data-oriented processing" or "DOP", embodies the
assumption that human language perception and production works with
representations of concrete past language experiences, rather than with
abstract linguistic rules. The models that instantiate this approach therefore
maintain large corpora of linguistic representations of previously occurring
utterances. When processing a new input utterance, analyses of this utterance
are constructed by combining fragments from the corpus; the
occurrence-frequencies of the fragments are used to estimate which analysis is
the most probable one.
In this paper we give an in-depth discussion of a data-oriented processing
model which employs a corpus of labelled phrase-structure trees. Then we review
some other models that instantiate the DOP approach. Many of these models also
employ labelled phrase-structure trees, but use different criteria for
extracting fragments from the corpus or employ different disambiguation
strategies (Bod 1996b; Charniak 1996a/b; Goodman 1996; Rajman 1995a/b; Sekine &
Grishman 1995; Sima'an 1995-96); other models use richer formalisms for their
corpus annotations (van den Berg et al. 1994; Bod et al., 1996a/b; Bonnema
1996; Kaplan 1996; Tugwell 1995).Comment: 34 pages, Postscrip
Financial news analysis using a semantic web approach
In this paper we present StockWatcher, an OWL-based web application that enables the extraction of relevant news items from RSS feeds concerning the NASDAQ-100 listed companies. The application's goal is to present a customized, aggregated view of the news categorized by different topics. We distinguish between four relevant news categories: i) news regarding the company itself, ii) news regarding direct competitors of the company, iii) news regarding important people of the company, and iv) news regarding the industry in which the company is active. At the same time, the system presented in this chapter is able to rate these news items based on their relevance. We identify three possible effects that a news message can have on the company, and thus on the stock price of that company: i) positive, ii) negative, and iii) neutral. Currently, StockWatcher provides support for the NASDAQ-100 companies. The selection of the relevant news items is based on a customizable user portfolio that may consist of one or more of these companies