2,117 research outputs found

    Moved But Not Gone: An Evaluation of Real-Time Methods for Discovering Replacement Web Pages

    Get PDF
    Inaccessible Web pages and 404 “Page Not Found” responses are a common Web phenomenon and a detriment to the user’s browsing experience. The rediscovery of missing Web pages is, therefore, a relevant research topic in the digital preservation as well as in the Information Retrieval realm. In this article, we bring these two areas together by analyzing four content- and link-based methods to rediscover missing Web pages. We investigate the retrieval performance of the methods individually as well as their combinations and give an insight into how effective these methods are over time. As the main result of this work, we are able to recommend not only the best performing methods but also the sequence in which they should be applied, based on their performance, complexity required to generate them, and evolution over time. Our least complex single method results in a rediscovery rate of almost 70% of Web pages of our sample dataset based on URIs sampled from the Open Directory Project (DMOZ). By increasing the complexity level and combining three different methods, our results show an increase of the success rate up to 77%. The results, based on our sample dataset, indicate that Web pages are often not completely lost but have moved to a different location and just need to be rediscovered

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time

    A Perspective on Resource Synchronization

    Get PDF
    Web applications frequently leverage resources made available by remote web servers. As resources are created, updated, deleted, or moved, these applications face challenges to remain in lockstep with changes on the server. Several approaches exist to help meet this challenge for use cases where good enough synchronization is acceptable. But when strict resource coverage or low synchronization latency is required, commonly accepted Web-based solutions remain illusive. This paper provides a perspective on the resource synchronization problem that results from inspiration gained from prior work, and initial insights resulting from the recently launched NISO/OAI ResourceSync effort

    Identifying the Bounds of an Internet Resource

    Get PDF
    Systems for retrieving or archiving Internet resources often assume a URL acts as a delimiter for the resource. But there are many situations where Internet resources do not have a one-to-one mapping with URLs. For URLs that point to the first page of a document that has been broken up over multiple pages, users are likely to consider the whole article as the resource, even though it is spread across multiple URLs. Comments, tags, ratings, and advertising might or might not be perceived as part of the resource whether they are retrieved as part of the primary URL or accessed via a link. Understanding what people perceive as part of a resource is necessary prior to developing algorithms to detect and make use of resource boundaries. A pilot study examined how content similarity, URL similarity, and the combination of the two matched human expectations. This pilot study showed that more nuanced techniques were needed that took into account the particular content and context of the resource and related content. Based on the lessons from the pilot study, a study was performed focused on two research questions: (1) how particular relationships between the content of pages effect expectations and (2) how encountered implementations of saving and perceptions of content value relate to the notion of internet resource bounds. Results showed that human expectations are affected by expected relationships, such as two web pages showing parts of the same news article. They are also affected when two content elements are part of the same set of content, as is the case when two photos are presented as members of the same collection or presentation. Expectations were also affected by the role of the content – advertisements presented alongside articles or photos were less likely to be considered as part of a resource. The exploration of web resource boundaries found that people’s assessments of resource bounds rely on understanding relationships between content fragments on the same web page and between content fragments on different web pages. These results were in the context of personal archiving scenarios. Would institutional archives have different expectations? A follow-on study gathered perceptions in the context of institutional archiving questions to explore whether such perceptions change based on whether the archive is for personal use or is institutional in nature. Results show that there are similar expectations for preserving continuations of the main content in personal and institutional archiving scenarios. Institutional archives are more likely to be expected to preserve the context of the main content, such as additional linked content, advertisements, and author information. This implies alternative resource bounds based on the type of content, relationships between content elements, and the type of archive in consideration. Based on the predictive features that gathered, an automatic classification for determining if two pieces of content should be considered as part of the same resource was designed. This classifier is an example of taking into account the features identified as important in the studies of human perceptions when developing techniques that bound materials captured during the archiving of online resources

    Mining activity clusters from low-level event logs

    Get PDF

    Detecting, Modeling, and Predicting User Temporal Intention

    Get PDF
    The content of social media has grown exponentially in the recent years and its role has evolved from narrating life events to actually shaping them. Unfortunately, content posted and shared in social networks is vulnerable and prone to loss or change, rendering the context associated with it (a tweet, post, status, or others) meaningless. There is an inherent value in maintaining the consistency of such social records as in some cases they take over the task of being the first draft of history as collections of these social posts narrate the pulse of the street during historic events, protest, riots, elections, war, disasters, and others as shown in this work. The user sharing the resource has an implicit temporal intent: either the state of the resource at the time of sharing, or the current state of the resource at the time of the reader \clicking . In this research, we propose a model to detect and predict the user\u27s temporal intention of the author upon sharing content in the social network and of the reader upon resolving this content. To build this model, we first examine the three aspects of the problem: the resource, time, and the user. For the resource we start by analyzing the content on the live web and its persistence. We noticed that a portion of the resources shared in social media disappear, and with further analysis we unraveled a relationship between this disappearance and time. We lose around 11% of the resources after one year of sharing and a steady 7% every following year. With this, we turn to the public archives and our analysis reveals that not all posted resources are archived and even they were an average 8% per year disappears from the archives and in some cases the archived content is heavily damaged. These observations prove that in regards to archives resources are not well-enough populated to consistently and reliably reconstruct the missing resource as it existed at the time of sharing. To analyze the concept of time we devised several experiments to estimate the creation date of the shared resources. We developed Carbon Date, a tool which successfully estimated the correct creation dates for 76% of the test sets. Since the resources\u27 creation we wanted to measure if and how they change with time. We conducted a longitudinal study on a data set of very recently-published tweet-resource pairs and recording observations hourly. We found that after just one hour, ~4% of the resources have changed by ≥30% while after a day the change rate slowed to be ~12% of the resources changed by ≥40%. In regards to the third and final component of the problem we conducted user behavioral analysis experiments and built a data set of 1,124 instances manually assigned by test subjects. Temporal intention proved to be a difficult concept for average users to understand. We developed our Temporal Intention Relevancy Model (TIRM) to transform the highly subjective temporal intention problem into the more easily understood idea of relevancy between a tweet and the resource it links to, and change of the resource through time. On our collected data set TIRM produced a significant 90.27% success rate. Furthermore, we extended TIRM and used it to build a time-based model to predict temporal intention change or steadiness at the time of posting with 77% accuracy. We built a service API around this model to provide predictions and a few prototypes. Future tools could implement TIRM to assist users in pushing copies of shared resources into public web archives to ensure the integrity of the historical record. Additional tools could be used to assist the mining of the existing social media corpus by derefrencing the intended version of the shared resource based on the intention strength and the time between the tweeting and mining

    Lincoln Upper Elementary School Washington, Iowa I-WALK Report Spring 2012

    Get PDF
    In the past three decades, the number of obese and overweight individuals in Iowa and across the nation has skyrocketed. With obesity comes the greater risk of health complications and life expectancy reduction. As a result, the current generation of youth face a new and growing threat to their overall quality of life. In Iowa alone, 37.1% of 3rd grade students are identified as either overweight or obese.https://lib.dr.iastate.edu/iwalk_reports/1031/thumbnail.jp
    corecore