5 research outputs found
An Evaluation of Link Neighborhood Lexical Signatures to Rediscover Missing Web Pages
For discovering the new URI of a missing web page, lexical signatures, which
consist of a small number of words chosen to represent the "aboutness" of a
page, have been previously proposed. However, prior methods relied on computing
the lexical signature before the page was lost, or using cached or archived
versions of the page to calculate a lexical signature. We demonstrate a system
of constructing a lexical signature for a page from its link neighborhood, that
is the "backlinks", or pages that link to the missing page. After testing
various methods, we show that one can construct a lexical signature for a
missing web page using only ten backlink pages. Further, we show that only the
first level of backlinks are useful in this effort. The text that the backlinks
use to point to the missing page is used as input for the creation of a
four-word lexical signature. That lexical signature is shown to successfully
find the target URI in over half of the test cases.Comment: 24 pages, 13 figures, 8 tables, technical repor
Using the Web Infrastructure for Real Time Recovery of Missing Web Pages
Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time