16 research outputs found

    WARCreate: Create Wayback-Consumable WARC Files From Any Webpage

    Get PDF
    The Internet Archive\u27s Wayback Machine is the most common way that typical users interact with web archives. The Internet Archive uses the Heritrix web crawler to transform pages on the publicly available web into Web ARChive (WARC) files, which can then be accessed using the Wayback Machine. Because Heritrix can only access the publicly available web, many personal pages (e.g. password-protected pages, social media pages) cannot be easily archived into the standard WARC format. We have created a Google Chrome extension, WARCreate, that allows a user to create a WARC file from any webpage. Using this tool, content that might have been otherwise lost in time can be archived in a standard format by any user. This tool provides a way for casual users to easily create archives of personal online content. This is one of the first steps in resolving issues of long term storage, maintenance, and access of personal digital assets that have emotional, intellectual, and historical value to individuals

    Brass: A Queueing Manager for Warrick

    Get PDF
    When an individual loses their website and a backup can-not be found, they can download and run Warrick, a web-repository crawler which will recover their lost website by crawling the holdings of the Internet Archive and several search engine caches. Running Warrick locally requires some technical know-how, so we have created an on-line queueing system called Brass which simplifies the task of recovering lost websites. We discuss the technical aspects of recon-structing websites and the implementation of Brass. Our newly developed system allows anyone to recover a lost web-site with a few mouse clicks and allows us to track which websites the public is most interested in saving

    Personal Curation in a Museum

    Get PDF
    An established body of work in CSCW and related communities studies social and cooperative interaction in museums and cultural heritage sites. A separate and growing body of research in these same communities is developing ways to understand the design and use of social media from a curating perspective. A curating perspective focuses on how social media is designed and used by people to develop and manage their own digital archives. This paper uses a cultural heritage museum as the empirical basis and setting along with new information visualization methods we have developed to better integrate these bodies of work and introduce the concept of personal curation; a socio-technical practice in which people collect, edit, and share information using personal information devices and social media as they move through physical environments rich with meaning potential. In doing so this paper makes three contributions. First, it illustrates how to combine a spatial focus on people’s movement and interaction through the physical environment with an analysis of social media use in order to gain a deeper understanding of practices such as personal curation. Second, it shows in greater detail how visitors to museums and cultural heritage sites use and link digital information with physical information to shape others’ understandings of cultural heritage. Third, it suggests how museums and cultural heritage sites may leverage personal curation to support more expansive learning opportunities for visitors

    Unruly Records: Personal Archives, Sociotechnical Infrastructure, and Archival Practice

    Get PDF
    Personal records have long occupied a complicated space within archival theory and practice. The archival profession, as it is practiced in the United States today, developed with organizational records, such as those created by governments and businesses, in mind. Personal records were considered to fall beyond the bounds of archival work and were primarily cared for by libraries and other cultural heritage institutions. Since the mid-20th century, this divide has become less pronounced, and it has become common to find personal records within archival institutions. As a result of these conditions in the development of the profession, the archivists who work with personal records have had to reconcile the specific characteristics of personal materials with theoretical and practical approaches that were designed not only to accommodate organizational records but to explicitly exclude personal records. These conditions have been further complicated by the continually changing technological landscape in which personal records are now created. As ownership of personal computers, access to the World Wide Web, and the use of networked social platforms have grown, personal records have increasingly come to be created, stored, and accessed within complex socio-technical systems. The infrastructures that support personal digital record creation today precipitate new methods and strategies, and an abundance of new questions, for the archivists who are responsible for collecting and preserving digital cultural heritage. This dissertation considers how both the history of excluding personal records in the archival profession and the socio-technical systems that support contemporary personal record creation impact archival practice today. This research considers archival approaches to working with personal records created within three environments: personal computers, the open web, and networked social platforms. Ultimately, this dissertation seeks to reevaluate the role that personal records have previously occupied, and to center the personal in archival practice today

    Identifying the Bounds of an Internet Resource

    Get PDF
    Systems for retrieving or archiving Internet resources often assume a URL acts as a delimiter for the resource. But there are many situations where Internet resources do not have a one-to-one mapping with URLs. For URLs that point to the first page of a document that has been broken up over multiple pages, users are likely to consider the whole article as the resource, even though it is spread across multiple URLs. Comments, tags, ratings, and advertising might or might not be perceived as part of the resource whether they are retrieved as part of the primary URL or accessed via a link. Understanding what people perceive as part of a resource is necessary prior to developing algorithms to detect and make use of resource boundaries. A pilot study examined how content similarity, URL similarity, and the combination of the two matched human expectations. This pilot study showed that more nuanced techniques were needed that took into account the particular content and context of the resource and related content. Based on the lessons from the pilot study, a study was performed focused on two research questions: (1) how particular relationships between the content of pages effect expectations and (2) how encountered implementations of saving and perceptions of content value relate to the notion of internet resource bounds. Results showed that human expectations are affected by expected relationships, such as two web pages showing parts of the same news article. They are also affected when two content elements are part of the same set of content, as is the case when two photos are presented as members of the same collection or presentation. Expectations were also affected by the role of the content – advertisements presented alongside articles or photos were less likely to be considered as part of a resource. The exploration of web resource boundaries found that people’s assessments of resource bounds rely on understanding relationships between content fragments on the same web page and between content fragments on different web pages. These results were in the context of personal archiving scenarios. Would institutional archives have different expectations? A follow-on study gathered perceptions in the context of institutional archiving questions to explore whether such perceptions change based on whether the archive is for personal use or is institutional in nature. Results show that there are similar expectations for preserving continuations of the main content in personal and institutional archiving scenarios. Institutional archives are more likely to be expected to preserve the context of the main content, such as additional linked content, advertisements, and author information. This implies alternative resource bounds based on the type of content, relationships between content elements, and the type of archive in consideration. Based on the predictive features that gathered, an automatic classification for determining if two pieces of content should be considered as part of the same resource was designed. This classifier is an example of taking into account the features identified as important in the studies of human perceptions when developing techniques that bound materials captured during the archiving of online resources

    HTTP Mailbox - Asynchronous Restful Communication

    Get PDF
    Traditionally, general web services used only the GET and POST methods of HTTP while several other HTTP methods like PUT, PATCH, and DELETE were rarely utilized. Additionally, the Web was mainly navigated by humans using web browsers and clicking on hyperlinks or submitting HTML forms. Clicking on a link is always a GET request while HTML forms only allow GET and POST methods. Recently, several web frameworks/libraries have started supporting RESTful web services through APIs. To support HTTP methods other than GET and POST in browsers, these frameworks have used hidden HTML form fields as a workaround to convey the desired HTTP method to the server application. In such cases, the web server is unaware of the intended HTTP method because it receives the request as POST. Middleware between the web server and the application may override the HTTP method based on special hidden form field values. Unavailability of the servers is another factor that affects the communication. Because of the stateless and synchronous nature of HTTP, a client must wait for the server to be available to perform the task and respond to the request. Browser-based communication also suffers from cross-origin restrictions for security reasons. We describe HTTP Mailbox, a mechanism to enable RESTful HTTP communication in an asynchronous mode with a full range of HTTP methods otherwise unavailable to standard clients and servers. HTTP Mailbox also allows for multicast semantics via HTTP. We evaluate a reference implementation using ApacheBench (a server stress testing tool) demonstrating high throughput (on 1,000 concurrent requests) and a systemic error rate of 0.01%. Finally, we demonstrate our HTTP Mailbox implementation in a human-assisted Web preservation application called “Preserve Me! and a visualization application called Preserve Me! Viz

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization
    corecore