56 research outputs found

    Models and methods for web archive crawling

    Get PDF
    Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency. This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an “undistorted” capture of the ever-changing Web. We express the quality of such the resulting capture with the “blur” quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the “coherence” measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with pagespecific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives. A fully functional prototype demonstrates the practical viability of our approach.Ein Webarchiv ist eine umfassende Informationsquelle fĂŒr eine Vielzahl von Anwendern, wie etwa Forscher, Analysten und Juristen. Zu diesem Zweck enthĂ€lt es Repliken von Webseiten, die sich typischerweise im Laufe der Zeit geĂ€ndert haben. Um ein möglichst umfassendes und qualitativ hochwertiges Archiv zu erhalten, sollten daher - im Idealfall - alle Versionen der Webseiten archiviert worden sein. Dies ist allerdings sowohl aufgrund mangelnder Ressourcen als auch technischer Rahmenbedingungen nicht einmal annĂ€hernd möglich. Das Archiv besteht daher aus zahlreichen zu unterschiedlichen Zeitpunkten erstellten “Mosaiksteinen”, die mehr oder minder gut zueinander passen. Diese Dissertation fĂŒhrt ein Modell zur Beurteilung der DatenqualitĂ€t eines Webarchives ein und untersucht Archivierungsstrategien zur Optimierung der DatenqualitĂ€t. Zu diesem Zweck wurden im Rahmen der Arbeit “Einzel-” und “Doppelarchivierungsstrategien” entwickelt. Bei der Einzelarchivierungsstrategie werden die Inhalte fĂŒr jede zu erstellende Replik genau einmal gespeichert, wobei versucht wird, das Abbild des sich kontinuierlich verĂ€ndernden Webs möglichst “unverzerrt” zu archivieren. Die QualitĂ€t einer solchen Einzelarchivierungsstrategie kann dabei durch den Grad der “Verzerrung” (engl. “blur”) gemessen werden. Bei einer Doppelarchivierungsstrategie hingegen werden die Inhalte pro Replik genau zweimal besucht. Dazu teilt man den Archivierungsvorgang in eine “Besuchs-” und “Kontrollphase” ein. Durch die Aufteilung in die zuvor genannten Phasen ist es dann möglich festzustellen, welche Inhalte sich im Laufe des Archivierungsprozess geĂ€ndert haben. Dies ermöglicht exakt festzustellen, ob und welche Inhalte zueinander passen. Die GĂŒte einer Doppelarchivierungsstrategie wird dazu mittels der durch sie erzielten “KohĂ€renz” (engl. “coherence”) gemessen. Die Archivierungsstrategien basieren auf Vorhersagen ĂŒber das Änderungsverhalten der zur archivierenden Inhalte, die als Poissonprozesse mit inhaltsspezifischen Änderungsraten modelliert wurden. Weiterhin wird gezeigt, dass diese Änderungsraten statistisch bestimmt werden können. Abschließend werden Visualisierungstechniken fĂŒr die QualitĂ€tsanalyse des resultierenden Webarchivs vorgestellt. Ein voll funktionsfĂ€higer Prototyp demonstriert die Praxistauglichkeit unseres Ansatzes

    CLEAR: a credible method to evaluate website archivability

    Get PDF
    Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is diïŹƒcult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- uence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability

    Archiving the Relaxed Consistency Web

    Full text link
    The historical, cultural, and intellectual importance of archiving the web has been widely recognized. Today, all countries with high Internet penetration rate have established high-profile archiving initiatives to crawl and archive the fast-disappearing web content for long-term use. As web technologies evolve, established web archiving techniques face challenges. This paper focuses on the potential impact of the relaxed consistency web design on crawler driven web archiving. Relaxed consistent websites may disseminate, albeit ephemerally, inaccurate and even contradictory information. If captured and preserved in the web archives as historical records, such information will degrade the overall archival quality. To assess the extent of such quality degradation, we build a simplified feed-following application and simulate its operation with synthetic workloads. The results indicate that a non-trivial portion of a relaxed consistency web archive may contain observable inconsistency, and the inconsistency window may extend significantly longer than that observed at the data store. We discuss the nature of such quality degradation and propose a few possible remedies.Comment: 10 pages, 6 figures, CIKM 201

    Delivering computationally-intensive digital patient applications to the clinic: An exemplar solution to predict femoral bone strength from CT data

    Get PDF
    Background and objective:Whilst fragility hip fractures commonly affect elderly people, often causing permanent disability or death, they are rarely addressed in advance through preventive techniques. Quantification of bone strength can help to identify subjects at risk, thus reducing the incidence of fractures in the population. In recent years, researchers have shown that finite element models (FEMs) of the hip joint, derived from computed tomography (CT) images, can predict bone strength more accurately than other techniques currently used in the clinic. The specialised hardware and trained personnel required to perform such analyses, however, limits the widespread adoption of FEMs in clinical contexts. In this manuscript we present CT2S (Computed Tomography To Strength), a system developed in collaboration between The University of Sheffield and Sheffield Teaching Hospitals, designed to streamline access to this complex workflow for clinical end-users. Methods:The system relies on XNAT and makes use of custom apps based on open source software. Available through a website, it allows doctors in the healthcare environment to benefit from FE based bone strength estimation without being exposed to the technical aspects, which are concealed behind a user-friendly interface. Clinicians request the analysis of CT scans of a patient through the website. Using XNAT functionality, the anonymised images are automatically transferred to the University research facility, where an operator processes them and estimates the bone strength through FEM using a combination of open source and commercial software. Following the analysis, the doctor is provided with the results in a structured report. Results:The platform, currently available for research purposes, has been deployed and fully tested in Sheffield, UK. The entire analysis requires processing times ranging from 3.5 to 8 h, depending on the available computational power. Conclusions:The short processing time makes the system compatible with current clinical workflows. The use of open source software and the accurate description of the workflow given here facilitates the deployment in other centres

    Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

    Full text link
    Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.Comment: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL

    Delivering computationally-intensive digital patient applications to the clinic: An exemplar solution to predict femoral bone strength from CT data

    Get PDF
    Background and objective:Whilst fragility hip fractures commonly affect elderly people, often causing permanent disability or death, they are rarely addressed in advance through preventive techniques. Quantification of bone strength can help to identify subjects at risk, thus reducing the incidence of fractures in the population. In recent years, researchers have shown that finite element models (FEMs) of the hip joint, derived from computed tomography (CT) images, can predict bone strength more accurately than other techniques currently used in the clinic. The specialised hardware and trained personnel required to perform such analyses, however, limits the widespread adoption of FEMs in clinical contexts. In this manuscript we present CT2S (Computed Tomography To Strength), a system developed in collaboration between The University of Sheffield and Sheffield Teaching Hospitals, designed to streamline access to this complex workflow for clinical end-users. Methods:The system relies on XNAT and makes use of custom apps based on open source software. Available through a website, it allows doctors in the healthcare environment to benefit from FE based bone strength estimation without being exposed to the technical aspects, which are concealed behind a user-friendly interface. Clinicians request the analysis of CT scans of a patient through the website. Using XNAT functionality, the anonymised images are automatically transferred to the University research facility, where an operator processes them and estimates the bone strength through FEM using a combination of open source and commercial software. Following the analysis, the doctor is provided with the results in a structured report. Results:The platform, currently available for research purposes, has been deployed and fully tested in Sheffield, UK. The entire analysis requires processing times ranging from 3.5 to 8 h, depending on the available computational power. Conclusions:The short processing time makes the system compatible with current clinical workflows. The use of open source software and the accurate description of the workflow given here facilitates the deployment in other centres
    • 

    corecore