124 research outputs found

    CLEAR: a credible method to evaluate website archivability

    Get PDF
    Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- uence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability

    On the Change in Archivability of Websites Over Time

    Get PDF
    As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page's accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing mementos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation.Comment: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malt

    Scripts in a Frame: A Framework for Archiving Deferred Representations

    Get PDF
    Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival tools are unable to archive the resulting JavaScript-dependent representations (what we term deferred representations), resulting in missing or incorrect content in the archives and the general inability to replay the archived resource as it existed at the time of capture. Building on prior studies on Web archiving, client-side monitoring of events and embedded resources, and studies of the Web, we establish an understanding of the trends contributing to the increasing unarchivability of deferred representations. We show that JavaScript leads to lower-quality mementos (archived Web resources) due to the archival difficulties it introduces. We measure the historical impact of JavaScript on mementos, demonstrating that the increased adoption of JavaScript and Ajax correlates with the increase in missing embedded resources. To measure memento and archive quality, we propose and evaluate a metric to assess memento quality closer to Web users’ perception. We propose a two-tiered crawling approach that enables crawlers to capture embedded resources dependent upon JavaScript. Measuring the performance benefits between crawl approaches, we propose a classification method that mitigates the performance impacts of the two-tiered crawling approach, and we measure the frontier size improvements observed with the two-tiered approach. Using the two-tiered crawling approach, we measure the number of client-side states associated with each URI-R and propose a mechanism for storing the mementos of deferred representations. In short, this dissertation details a body of work that explores the following: why JavaScript and deferred representations are difficult to archive (establishing the term deferred representation to describe JavaScript dependent representations); the extent to which JavaScript impacts archivability along with its impact on current archival tools; a metric for measuring the quality of mementos, which we use to describe the impact of JavaScript on archival quality; the performance trade-offs between traditional archival tools and technologies that better archive JavaScript; and a two-tiered crawling approach for discovering and archiving currently unarchivable descendants (representations generated by client-side user events) of deferred representations to mitigate the impact of JavaScript on our archives. In summary, what we archive is increasingly different from what we as interactive users experience. Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web

    Aggregating Private and Public Web Archives Using the Mementity Framework

    Get PDF
    Web archives preserve the live Web for posterity, but the content on the Web one cares about may not be preserved. The ability to access this content in the future requires the assurance that those sites will continue to exist on the Web until the content is requested and that the content will remain accessible. It is ultimately the responsibility of the individual to preserve this content, but attempting to replay personally preserved pages segregates archived pages by individuals and organizations of personal, private, and public Web content. This is misrepresentative of the Web as it was. While the Memento Framework may be used for inter-archive aggregation, no dynamics exist for the special consideration needed for the contents of these personal and private captures. In this work we introduce a framework for aggregating private and public Web archives. We introduce three mementities that serve the roles of the aforementioned aggregation, access control to personal Web archives, and negotiation of Web archives in dimensions beyond time, inclusive of the dimension of privacy. These three mementities serve as the foundation of the Mementity Framework. We investigate the difficulties and dynamics of preserving, replaying, aggregating, propagating, and collaborating with live Web captures of personal and private content. We offer a systematic solution to these outstanding issues through the application of the framework. We ensure the framework\u27s applicability beyond the use cases we describe as well as the extensibility of reusing the mementities for currently unforeseen access patterns. We evaluate the framework by justifying the mementity design decisions, formulaically abstracting the anticipated temporal and spatial costs, and providing reference implementations, usage, and examples for the framework

    Recognizing Co-Creators in Four Configurations: Critical Questions for Web Archiving

    Get PDF
    Four categories of co-creator shape web archivists\u27 practice and influence the development of web archives: social forces, users and uses, subjects of web archives, and technical agents. This paper illustrates how these categories of co-creator overlap and interact in four specific web archiving contexts. It recommends that web archivists acknowledge this complex array of contributors as a way to imagine web archives differently. A critical approach to web archiving recognizes relationships and blended roles among stakeholders; seeks opportunities for non-extractive archival activity; and acknowledges the value of creative reuse as an important aspect of preservation

    Patron Driven Acquisition of Publisher-hosted Content: Bypassing DRM

    Get PDF
    Academic library patron driven acquisition (PDA) of ebooks on aggregator platforms is gaining steam. Although there are many advantages to this model, aggregated ebook content is still hampered by digital rights management (DRM). Publisher hosted ebooks are often DRM free, providing user-friendly access to ebook chapters that emulates ejournal article access. Librarians and libraries should build win-win-win partnerships with aggregators and publishers that facilitate centralized PDA on aggregator platforms which results in library ownership of purchased books on DRM-free publisher platforms. Ultimately, a simpler solution would be to reduce the restrictiveness of digital rights management on aggregator-hosted content, which might eventually happen. But can we afford to wait

    COVID-19 Dashboard Functionality and Design: Assessing Dashboard Design Service Providers for Health Disaster Response

    Get PDF
    71 pagesWhen disaster strikes, data visualizations are used as quick ways to concisely distill timely information to civilians. Amidst the COVID-19 pandemic, data-driven dashboards played a disproportionately large role in quickly collecting, processing, and conveying preliminary data to citizens. After the Johns Hopkins COVID-19 dashboard went viral, individual public health departments across the world realized the importance of distilling and delivering real-time data to citizens and decision makers. The widescale proliferation of dashboards across emergency response groups has only recently been made possible thanks to a business model in the software industry known as Platform as a Service, or PaaS, providers, which provide the data hosting, application development, and graphical interfaces for non-technical experts to deploy dashboards without an extensive background in web development. What the PaaS providers offer in ease-of-use, however, is traded against their limitations in functionality and accessibility. In this thesis, I used content analysis to perform a systematic review of 24 international COVID-19 data dashboards to understand international variation in COVID-19 dashboard design and to offer feature recommendations for software companies to incorporate into their PaaS platforms

    Archivabilidad de sitios web para preservación digital : estudio del area de salud

    Get PDF
    Em busca de uma solução para compreender as razões pelas quais alguns recursos presentes em websites não são possíveis de serem arquivados pelas ferramentas de captura, surgiu o conceito de arquivabilidade da web. Apresentamos este estudo que propõe iniciar uma discussão acerca do tema, a partir do método CLEAR+ e da ferramenta ArchiveReady, e verificar sua aplicabilidade a partir da identificação de websites da área da saúde, com testes de preservação digital por meio do arquivamento da web. A pesquisa configurou-se como estudo de caso, com procedimentos envolvendo pesquisa bibliográfica e documental, bem como o uso de software para identificar arquivabilidade dos sites. Conclui-se que tanto os testes de arquivabilidade quanto os de arquivamento da web apontam para poucas dificuldades de captura, em pequeno grau, sugerindo-se, portanto, que para atingir uma melhor qualidade de captura sejam adotados padrões de conformidade na produção dos websites, de acordo com o estabelecido pelo World Wide Web Consortium.In search for a solution to understand the reasons why some resources present on websites are not possible to be archived by capture tools, we approach the concept of web archivability. We present this study that proposes to initiate a discussion about the evaluation of the archivability, using the CLEAR+ method and the ArchiveReady, and to verify their applicability from the identification of websites in the health studies, with digital preservation tests through the web archiving. The research was configured as a case study, with procedures involving bibliographic and documentary research, as well as the use of software to identify the archivability of the sites. It is concluded that both archivability tests and web archiving tests point to little capture difficulties, to a small degree, therefore suggesting that to achieve better capture quality, compliance standards should be adopted in the production of websites, according to what is established by the World Wide Web Consortium.En la búsqueda de una solución para comprender las razones por las cuales las herramientas de captura no pueden archivar algunos recursos presentes en sitios web, abordamos el concepto de archivabilidad de la web. Presentamos este estudio que propone iniciar una discusión sobre la evaluación de la archivabilidad de los sitios web, utilizando el método CLEAR+ y la herramienta ArchiveReady, y verificar su aplicabilidad a partir de la identificación de sitios web en los estudios de salud, con pruebas de preservación digital a través del archivo web. La investigación se configuró como un estudio de caso, con procedimientos que implican investigación bibliográfica y documental, así como el uso de software para identificar la capacidad de archivo de los sitios. Se concluye que tanto las pruebas de archivabilidad como las pruebas de archivo web apuntan a pequeñas dificultades de captura, en un pequeño grado, lo que sugiere que para lograr una mejor calidad de captura, se deben adoptar estándares de cumplimiento en la producción de sitios web de acuerdo con lo establecido por el World Wide Consorcio Web

    Linked Research on the Decentralised Web

    Get PDF
    This thesis is about research communication in the context of the Web. I analyse literature which reveals how researchers are making use of Web technologies for knowledge dissemination, as well as how individuals are disempowered by the centralisation of certain systems, such as academic publishing platforms and social media. I share my findings on the feasibility of a decentralised and interoperable information space where researchers can control their identifiers whilst fulfilling the core functions of scientific communication: registration, awareness, certification, and archiving. The contemporary research communication paradigm operates under a diverse set of sociotechnical constraints, which influence how units of research information and personal data are created and exchanged. Economic forces and non-interoperable system designs mean that researcher identifiers and research contributions are largely shaped and controlled by third-party entities; participation requires the use of proprietary systems. From a technical standpoint, this thesis takes a deep look at semantic structure of research artifacts, and how they can be stored, linked and shared in a way that is controlled by individual researchers, or delegated to trusted parties. Further, I find that the ecosystem was lacking a technical Web standard able to fulfill the awareness function of research communication. Thus, I contribute a new communication protocol, Linked Data Notifications (published as a W3C Recommendation) which enables decentralised notifications on the Web, and provide implementations pertinent to the academic publishing use case. So far we have seen decentralised notifications applied in research dissemination or collaboration scenarios, as well as for archival activities and scientific experiments. Another core contribution of this work is a Web standards-based implementation of a clientside tool, dokieli, for decentralised article publishing, annotations and social interactions. dokieli can be used to fulfill the scholarly functions of registration, awareness, certification, and archiving, all in a decentralised manner, returning control of research contributions and discourse to individual researchers. The overarching conclusion of the thesis is that Web technologies can be used to create a fully functioning ecosystem for research communication. Using the framework of Web architecture, and loosely coupling the four functions, an accessible and inclusive ecosystem can be realised whereby users are able to use and switch between interoperable applications without interfering with existing data. Technical solutions alone do not suffice of course, so this thesis also takes into account the need for a change in the traditional mode of thinking amongst scholars, and presents the Linked Research initiative as an ongoing effort toward researcher autonomy in a social system, and universal access to human- and machine-readable information. Outcomes of this outreach work so far include an increase in the number of individuals self-hosting their research artifacts, workshops publishing accessible proceedings on the Web, in-the-wild experiments with open and public peer-review, and semantic graphs of contributions to conference proceedings and journals (the Linked Open Research Cloud). Some of the future challenges include: addressing the social implications of decentralised Web publishing, as well as the design of ethically grounded interoperable mechanisms; cultivating privacy aware information spaces; personal or community-controlled on-demand archiving services; and further design of decentralised applications that are aware of the core functions of scientific communication

    Learning from FITS: Limitations in use in modern astronomical research

    Get PDF
    The Flexible Image Transport System (FITS) standard has been a great boon to astronomy, allowing observatories, scientists and the public to exchange astronomical information easily. The FITS standard, however, is showing its age. Developed in the late 1970s, the FITS authors made a number of implementation choices that, while common at the time, are now seen to limit its utility with modern data. The authors of the FITS standard could not anticipate the challenges which we are facing today in astronomical computing. Difficulties we now face include, but are not limited to, addressing the need to handle an expanded range of specialized data product types (data models), being more conducive to the networked exchange and storage of data, handling very large datasets, and capturing significantly more complex metadata and data relationships. There are members of the community today who find some or all of these limitations unworkable, and have decided to move ahead with storing data in other formats. If this fragmentation continues, we risk abandoning the advantages of broad interoperability, and ready archivability, that the FITS format provides for astronomy. In this paper we detail some selected important problems which exist within the FITS standard today. These problems may provide insight into deeper underlying issues which reside in the format and we provide a discussion of some lessons learned. It is not our intention here to prescribe specific remedies to these issues; rather, it is to call attention of the FITS and greater astronomical computing communities to these problems in the hope that it will spur action to address them
    • …
    corecore