Search CORE

3 research outputs found

Structured Queries Over Web Text

Author: Michael J. Cafarella
Oren Etzioni
Publication venue
Publication date: 04/12/2008
Field of study

Abstract The Web contains a vast amount of text that can only be queried using simple keywords-in, documents-out search queries. But Web text often contains structured elements, such as hotel location and price pairs embedded in a set of hotel reviews. Queries that process these structural text elements wouldbe much more powerful than our current document-centric queries. Of course, text does not contain metadata or a schema, making it unclear what a structured text query means precisely. In this paperwe describe three possible models for structured queries over text, each of which implies different query semantics and user interaction. 1 Introduction The Web contains a vast amount of data, most of it in the form of unstructured text. Search engines are thestandard tools for querying this text, and generally perform just one type of query. In response to a few keywords, a search engine will return a relevance-ranked list of documents. Unfortunately, treating Web text as nothingbut a collection of standalone documents ignores a substantial amount of embedded structure that is obvious to every human reader, even if current search engines are blind to it. Web text often has a rich, though implicit,structure that deserves a correspondingly-rich set of query tools. For example, consider a website that contains hotel reviews. Although each review is a self-containedtext that is authored by a single person, the set of all such reviews shows remarkable regularity in the type of information that is contained. Most reviews will contain standard values such as the hotel's price and generalquality. Reviews of urban hotels might discuss how central the hotel's location is, and reviews of resorts will mention the quality of the beach and pool. In other words, the hotel reviews have a messy and implicit &quot;schema&quot;that is not designed by any database administrator but is nonetheless present

CiteSeerX

Structured Queries Over Web Text

Author: Dan Suciu
Michael J. Cafarella
Oren Etzioni
Publication venue
Publication date: 02/04/2008
Field of study

The Web contains a vast amount of text that can only be queried using simple keywords-in, documentsout search queries. But Web text often contains structured elements, such as hotel location and price pairs embedded in a set of hotel reviews. Queries that process these structural text elements would be much more powerful than our current document-centric queries. Of course, text does not contain metadata or a schema, making it unclear what a structured text query means precisely. In this paper we describe three possible models for structured queries over text, each of which implies different query semantics and user interaction.

CiteSeerX

Structured querying of web text: A technical challenge

Author: Christopher Ré
Dan Suciu
Michael J. Cafarella
Michele Banko
Oren Etzioni
Publication venue
Publication date: 01/01/2007
Field of study

The Web contains a huge amount of text that is currently beyond the reach of structured access tools. This unstructured data often contains a substantial amount of implicit structure, much of which can be captured using information extraction (IE) algorithms. By combining an IE system with an appropriate data model and query language, we could enable structured access to all of the Web’s unstructured data. We propose a general-purpose query system called the extraction database, or ExDB, which supports SQL-like structured queries over Web text. We also describe the technical challenges involved, motivated in part by our experiences with an early 90M-page prototype. 1

CiteSeerX