23 research outputs found
Improving Contextual Suggestions using Open Web Domain Knowledge
Also published online by CEUR Workshop Proceedings (CEUR-WS.org, ISSN 1613-0073)Contextual suggestion aims at recommending items to users given
their current context, such as location-based tourist recommendations.
Our contextual suggestion ranking model consists of two
main components: selecting candidate suggestions and providing a
ranked list of personalized suggestions. We focus on selecting appropriate
suggestions from the ClueWeb12 collection using tourist
domain knowledge inferred from social sites and resources available
on the public Web (Open Web). Specifically, we generate two
candidate subsets retrieved from the ClueWeb12 collection, one by
filtering the content on mentions of the location context, and one
by integrating domain knowledge derived from the OpenWeb. The
impact of these candidate selection methods on contextual suggestion
effectiveness is analyzed using the test collection constructed
for the TREC Contextual Suggestion Track in 2014. Our main findings
are that contextual suggestion performance on the subset created
using OpenWeb domain knowledge is significantly better than
using only geographical information. Second, using a prior probability
estimated from domain knowledge leads to better suggestions
and improves the performance
Better contextual suggestions in ClueWeb12 using domain knowledge inferred from the open web
Proceedings of the 23rd Text Retrieval Conference (TREC 2014), held in Gaithersburg, Maryland, USA, on 2014This paper provides an overview of our participation in the Contextual
Suggestion Track. The TREC 2014 Contextual Suggestion Track allowed participants
to submit personalized rankings using documents either from the OpenWeb
or from an archived, static Web collection, the ClueWeb12 dataset. In this paper,
we focus on filtering the entire ClueWeb12 collection to exploit domain knowledge
from touristic sites available in the Open Web. We show that the generated
recommendations to the provided user profiles and contexts improve significantly
using this inferred domain knowledge.This research was supported by the Netherlands Organization for Scientific Research
(NWO project #640.005.001
Uncovering the unarchived web
htmlabstractMany national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
This paper provides an overview of our participation in the Contextual Suggestion Track. The TREC 2014 Contextual Suggestion Track allowed participants to submit personalized rankings using documents either from the Open Web or from an archived, static Web collection (ClueWeb12) collection. One of the main steps in recommending attractions for a particular user in a given context is the selection of the candidate documents. This task is more challenging when relying on ClueWeb12 collection rather than public tourist APIs for finding suggestions. In this paper, we present our approach for selecting candi-
date suggestions from the entire ClueWeb12 collection using the tourist domain knowledge available in the Open Web. We show that the generated recommendations to the provided user profiles and contexts improve significantly using this inferred domain knowledge
Column Stores as an IR Prototyping Tool
. We make the suggestion that instead of implementing custom
index structures and query evaluation algorithms, IR researchers should
simply store document representations in a column-oriented relational
database and write ranking models using SQL. For rapid prototyping, this
is particularly advantageous since researchers can explore new ranking
functions and features by simply issuing SQL queries, without needing to
write imperative code. We demonstrate the feasibility of this approach
by an implementation of conjunctive BM25 using MonetDB on a part of
the ClueWeb12 collection
Uncovering the unarchived web
Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages
Lost but not forgotten: finding pages on the unarchived web
Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these
pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites
Analyzing the influence of bigrams on retrieval bias and effectiveness
Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing bias have been examined, there has been no work examining the impact of using bigram within the index on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the bias of a system changes depending on how the documents are represented using unigrams, bigrams or both. Our analysis of three different retrieval models on three TREC collections, shows that using a bigram only representation results in the lowest bias compared to unigram only representation, but at the expense of retrieval effectiveness. However, when both representations are combined it results in reducing the overall bias, as well as increasing effectiveness. These findings suggest that when configuring and indexing the collection, that the bag-of-words approach (unigrams), should be augmented with bigrams to create better and fairer retrieval systems
Querylog-based assessment of retrievability bias in a large newspaper corpus
Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simula