Search CORE

149 research outputs found

Querying websites using compact skeletons

Author: Rajaraman Anand
Ullman Jeffrey D.
Publication venue: Elsevier Science (USA).
Publication date: 30/06/2003
Field of study

AbstractSeveral commercial applications, such as online comparison shopping and process automation, require integrating information that is scattered across multiple websites or XML documents. Much research has been devoted to this problem, resulting in several research prototypes and commercial implementations. Such systems rely on wrappers that provide relational or other structured interfaces to websites. Traditionally, wrappers have been constructed by hand on a per-website basis, constraining the scalability of the system. We introduce a website structure inference mechanism called compact skeletons that is a step in the direction of automated wrapper generation. Compact skeletons provide a transformation from websites or other hierarchical data, such as XML documents, to relational tables. We study several classes of compact skeletons and provide polynomial-time algorithms and heuristics for automated construction of compact skeletons from websites. Experimental results show that our heuristics work well in practice. We also argue that compact skeletons are a natural extension of commercially deployed techniques for wrapper construction

Elsevier - Publisher Connector

Query-driven document partitioning and collection selection

Author: Domenico Laforenza
Fabrizio Silvestri
Publication venue
Publication date: 01/01/2006
Field of study

Abstract — We present a novel strategy to partition a document collection onto several servers and to perform effective collection selection. The method is based on the analysis of query logs. We proposed a novel document representation called query-vectors model. Each document is represented as a list recording the queries for which the document itself is a match, along with their ranks. To both partition the collection and build the collection selection function, we co-cluster queries and documents. The document clusters are then assigned to the underlying IR servers, while the query clusters represent queries that return similar results, and are used for collection selection. We show that this document partition strategy greatly boosts the performance of standard collection selection algorithms, including CORI, w.r.t. a round-robin assignment. Secondly, we show that performing collection selection by matching the query to the existing query clusters and successively choosing only one server, we reach an average precision-at-5 up to 1.74 and we constantly improve CORI precision of a factor between 11 % and 15%. As a side result we show a way to select rarely asked-for documents. Separating these documents from the rest of the collection allows the indexer to produce a more compact index containing only relevant documents that are likely to be requested in the future. In our tests, around 52 % of the documents (3,128,366) are not returned among the first 100 top-ranked results of any query. I

CiteSeerX

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

Dissertations of the University of Groningen

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

Dissertations of the University of Groningen

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

ARTS repository - University of Groningen

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

Proceedings - University of Groningen

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

ARTS repository - University of Groningen

8th SC@RUG 2011 proceedings:Student Colloquium 2010-2011

Author
Publication venue: Rijksuniversiteit Groningen. Universiteitsbibliotheek
Publication date: 01/01/2011
Field of study

Proceedings - University of Groningen

Query-driven document partitioning and collection selection

Author: DIEGO PUPPIN
DOMENICO LAFORENZA
SILVESTRI F
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

Crossref

Archivio della ricerca- Università di Roma La Sapienza

Rank-aware, Approximate Query Processing on the Semantic Web

Author: Wagner Andreas Josef
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2014
Field of study

Search over the Semantic Web corpus frequently leads to queries having large result sets. So, in order to discover relevant data elements, users must rely on ranking techniques to sort results according to their relevance. At the same time, applications oftentimes deal with information needs, which do not require complete and exact results. In this thesis, we face the problem of how to process queries over Web data in an approximate and rank-aware fashion

KITopen