Search CORE

15 research outputs found

Exploiting links and text structure on the Web : a quantitative approach to improving search quality

Author: Kohlschütter Christian
Publication venue: Hannover : Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2011
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover

Utility analysis for topically biased PageRank

Author: Christian Kohlschütter
Publication venue: ACM Press
Publication date: 01/01/2007
Field of study

PageRank is known to be an efficient metric for computing general document importance in the Web. While commonly used as a one-size-fits-all measure, the ability to produce topically biased ranks has not yet been fully explored in detail. In particular, it was still unclear to what granularity of “topic ” the computation of biased page ranks makes sense. In this paper we present the results of a thorough quantitative and qualitative analysis of biasing PageRank on Open Directory categories. We show that the MAP quality of Biased PageRank generally increases with the ODP level up to a certain point, thus sustaining the usage of more specialized categories to bias PageRank on, in order to improve topic specific search

CiteSeerX

Federated Search: Integration von FAST DataSearch und Lucene. Vortragsfolien vom Lucene Workshop am 24. Januar 2006 in Stuttgart

Author: Kohlschütter Christian
Publication venue
Publication date: 10/11/2008
Field of study

Inhalt: - Motivation - Eingesetzte Such-Systeme - Meta- vs. Federated Search - Vektorraum-Modell (VSM) - TFxIDF, "Term Frequency", "Document Frequency" - TFxIDF: Verteilte Suche - FAST Data Search - Lucene - A + B = ? - A + B = A u B - Suchmaschinen-Plugin - Prototyp - Demo - Next Step

Südwestdeutscher Online Publikationsserver

Boilerplate Detection using Shallow Text Features

Author: Christian Kohlschütter
Peter Fankhauser
Wolfgang Nejdl
Publication venue
Publication date: 01/01/2010
Field of study

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, stateof-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy

CiteSeerX

Crossref

Enabling Federated Search with Heterogeneous Search Engines: Combining FAST Data Search and Lucene

Author: Chernov Sergey
Fehling Bernd
Kohlschütter Christian
Nejdl Wolfgang
Pieper Dirk
Summann Friedrich
Publication venue
Publication date: 01/01/2006
Field of study

Chernov S, Fehling B, Kohlschütter C, Nejdl W, Pieper D, Summann F. Enabling Federated Search with Heterogeneous Search Engines: Combining FAST Data Search and Lucene.; 2006.This report analyses Federated Search in the VASCODA context, specifically focusing on the existing TIB Hannover and UB Bielefeld search infrastructures. We first describe general requirements for a seamless integration of the two full-text search systems FAST (Bielefeld) and Lucene (Hannover), and evaluate possible scenarios, types of queries, and different ranking procedures. We then proceed to describe a Federated Search infrastructure to be implemented on top of these existing systems. An important feature of the proposed federation infrastructure is that participants do not have to change their existing search and cataloging systems. Communication within the federation is performed via additional plugins, which can be implemented by the participants, provided by search engine vendors or by a third party. When participating in the federation, all documents (both full-text and metadata) stay at the provider side, no library document / metadata exchange is necessary. The integration of collections is based on a common protocol, SDARTS, to be supported by every member of search federation. SDARTS is a hybrid of the two protocols SDLIP and STARTS. SDLIP was developed by Stanford University, the University of California at Berkeley, the California Digital Library, and others. STARTS protocol was designed in the Digital Library project at Stanford University and based on feedback from several search engines vendors. Additional advantages can be gained by agreeing on a common document schema, as proposed by the Vascoda initiative, though this is not a precondition for Federated Search

Publications at Bielefeld University

A Plugin Architecture Enabling Federated Search for Digital Libraries

Author: Christian Kohlschütter
Sergey Chernov
Wolfgang Nejdl
Publication venue
Publication date
Field of study

Abstract. Today, users expect a variety of digital libraries to be searchable from a single Web page. The German Vascoda project provides this service for dozens of information sources. Its ultimate goal is to provide search quality close to the ranking of a central database containing documents from all participating libraries. Currently, however, the Vascoda portal is based on a non-cooperative metasearch approach, where results from sources are merged randomly and ranking quality is sub-optimal. In this paper, we describe a Lucene-based plugin which replaces this method by a truly federated search across different search engines, where the exchange of document statistics improves document ranking. Preliminary evaluation results show ranking results equal to a centralized setup.

CiteSeerX