56 research outputs found
Models and methods for web archive crawling
Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency.
This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an âundistortedâ capture of the ever-changing Web. We express the quality of such the resulting capture with the âblurâ quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the âcoherenceâ measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with pagespecific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives.
A fully functional prototype demonstrates the practical viability of our approach.Ein Webarchiv ist eine umfassende Informationsquelle fĂŒr eine Vielzahl von Anwendern, wie etwa Forscher, Analysten und Juristen. Zu diesem Zweck enthĂ€lt es Repliken von Webseiten, die sich typischerweise im Laufe der Zeit geĂ€ndert haben. Um ein möglichst umfassendes und qualitativ hochwertiges Archiv zu erhalten, sollten daher - im Idealfall - alle Versionen der Webseiten archiviert worden sein. Dies ist allerdings sowohl aufgrund mangelnder Ressourcen als auch technischer Rahmenbedingungen nicht einmal annĂ€hernd möglich. Das Archiv besteht daher aus zahlreichen zu unterschiedlichen Zeitpunkten erstellten âMosaiksteinenâ, die mehr oder minder gut zueinander passen.
Diese Dissertation fĂŒhrt ein Modell zur Beurteilung der DatenqualitĂ€t eines Webarchives ein und untersucht Archivierungsstrategien zur Optimierung der DatenqualitĂ€t. Zu diesem Zweck wurden im Rahmen der Arbeit âEinzel-â und âDoppelarchivierungsstrategienâ entwickelt. Bei der Einzelarchivierungsstrategie werden die Inhalte fĂŒr jede zu erstellende Replik genau einmal gespeichert, wobei versucht wird, das Abbild des sich kontinuierlich verĂ€ndernden Webs möglichst âunverzerrtâ zu archivieren. Die QualitĂ€t einer solchen Einzelarchivierungsstrategie kann dabei durch den Grad der âVerzerrungâ (engl. âblurâ) gemessen werden. Bei einer Doppelarchivierungsstrategie hingegen werden die Inhalte pro Replik genau zweimal besucht. Dazu teilt man den Archivierungsvorgang in eine âBesuchs-â und âKontrollphaseâ ein. Durch die Aufteilung in die zuvor genannten Phasen ist es dann möglich festzustellen, welche Inhalte sich im Laufe des Archivierungsprozess geĂ€ndert haben. Dies ermöglicht exakt festzustellen, ob und welche Inhalte zueinander passen. Die GĂŒte einer Doppelarchivierungsstrategie wird dazu mittels der durch sie erzielten âKohĂ€renzâ (engl. âcoherenceâ) gemessen. Die Archivierungsstrategien basieren auf Vorhersagen ĂŒber das Ănderungsverhalten der zur archivierenden Inhalte, die als Poissonprozesse mit inhaltsspezifischen Ănderungsraten modelliert wurden. Weiterhin wird gezeigt, dass diese Ănderungsraten statistisch bestimmt werden können. AbschlieĂend werden Visualisierungstechniken fĂŒr die QualitĂ€tsanalyse des resultierenden Webarchivs vorgestellt. Ein voll funktionsfĂ€higer Prototyp demonstriert die Praxistauglichkeit unseres Ansatzes
CLEAR: a credible method to evaluate website archivability
Web archiving is crucial to ensure that cultural, scientific
and social heritage on the web remains accessible and usable
over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is
diïŹcult for such reasons as, website complexity, plethora of
underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish
the notion of Website Archivability (WA) and to introduce
the Credible Live Evaluation of Archive Readiness (CLEAR)
method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability
of a web site should provide archivists with a valuable tool
when assessing the possibilities of archiving material and in-
uence web design professionals to consider the implications
of their design decisions on the likelihood could be archived.
A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method
for assessing Website Archivability
Archiving the Relaxed Consistency Web
The historical, cultural, and intellectual importance of archiving the web
has been widely recognized. Today, all countries with high Internet penetration
rate have established high-profile archiving initiatives to crawl and archive
the fast-disappearing web content for long-term use. As web technologies
evolve, established web archiving techniques face challenges. This paper
focuses on the potential impact of the relaxed consistency web design on
crawler driven web archiving. Relaxed consistent websites may disseminate,
albeit ephemerally, inaccurate and even contradictory information. If captured
and preserved in the web archives as historical records, such information will
degrade the overall archival quality. To assess the extent of such quality
degradation, we build a simplified feed-following application and simulate its
operation with synthetic workloads. The results indicate that a non-trivial
portion of a relaxed consistency web archive may contain observable
inconsistency, and the inconsistency window may extend significantly longer
than that observed at the data store. We discuss the nature of such quality
degradation and propose a few possible remedies.Comment: 10 pages, 6 figures, CIKM 201
Delivering computationally-intensive digital patient applications to the clinic: An exemplar solution to predict femoral bone strength from CT data
Background and objective:Whilst fragility hip fractures commonly affect elderly people, often causing permanent disability or death, they are rarely addressed in advance through preventive techniques. Quantification of bone strength can help to identify subjects at risk, thus reducing the incidence of fractures in the population. In recent years, researchers have shown that finite element models (FEMs) of the hip joint, derived from computed tomography (CT) images, can predict bone strength more accurately than other techniques currently used in the clinic. The specialised hardware and trained personnel required to perform such analyses, however, limits the widespread adoption of FEMs in clinical contexts. In this manuscript we present CT2S (Computed Tomography To Strength), a system developed in collaboration between The University of Sheffield and Sheffield Teaching Hospitals, designed to streamline access to this complex workflow for clinical end-users. Methods:The system relies on XNAT and makes use of custom apps based on open source software. Available through a website, it allows doctors in the healthcare environment to benefit from FE based bone strength estimation without being exposed to the technical aspects, which are concealed behind a user-friendly interface. Clinicians request the analysis of CT scans of a patient through the website. Using XNAT functionality, the anonymised images are automatically transferred to the University research facility, where an operator processes them and estimates the bone strength through FEM using a combination of open source and commercial software. Following the analysis, the doctor is provided with the results in a structured report. Results:The platform, currently available for research purposes, has been deployed and fully tested in Sheffield, UK. The entire analysis requires processing times ranging from 3.5 to 8 h, depending on the available computational power. Conclusions:The short processing time makes the system compatible with current clinical workflows. The use of open source software and the accurate description of the workflow given here facilitates the deployment in other centres
Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering
Many web sites are transitioning how they construct their pages. The
conventional model is where the content is embedded server-side in the HTML and
returned to the client in an HTTP response. Increasingly, sites are moving to a
model where the initial HTTP response contains only an HTML skeleton plus
JavaScript that makes API calls to a variety of servers for the content
(typically in JSON format), and then builds out the DOM client-side, more
easily allowing for periodically refreshing the content in a page and allowing
dynamic modification of the content. This client-side rendering, now
predominant in social media platforms such as Twitter and Instagram, is also
being adopted by news outlets, such as CNN.com. When conventional web archiving
techniques, such as crawling with Heritrix, are applied to pages that render
their content client-side, the JSON responses can become out of sync with the
HTML page in which it is to be embedded, resulting in temporal violations on
replay. Because the violative JSON is not directly observable in the page
(i.e., in the same manner a violative embedded image is), the temporal
violations can be difficult to detect. We describe how the top level CNN.com
page has used client-side rendering since April 2015 and the impact this has
had on web archives. Between April 24, 2015 and July 21, 2016, we found almost
15,000 mementos with a temporal violation of more than 2 days between the base
CNN.com HTML and the JSON responses used to deliver the content under the main
story. One way to mitigate this problem is to use browser-based crawling
instead of conventional crawlers like Heritrix, but browser-based crawling is
currently much slower than non-browser-based tools such as Heritrix.Comment: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE
Joint Conference on Digital Libraries (JCDL
Delivering computationally-intensive digital patient applications to the clinic: An exemplar solution to predict femoral bone strength from CT data
Background and objective:Whilst fragility hip fractures commonly affect elderly people, often causing permanent disability or death, they are rarely addressed in advance through preventive techniques. Quantification of bone strength can help to identify subjects at risk, thus reducing the incidence of fractures in the population. In recent years, researchers have shown that finite element models (FEMs) of the hip joint, derived from computed tomography (CT) images, can predict bone strength more accurately than other techniques currently used in the clinic. The specialised hardware and trained personnel required to perform such analyses, however, limits the widespread adoption of FEMs in clinical contexts. In this manuscript we present CT2S (Computed Tomography To Strength), a system developed in collaboration between The University of Sheffield and Sheffield Teaching Hospitals, designed to streamline access to this complex workflow for clinical end-users. Methods:The system relies on XNAT and makes use of custom apps based on open source software. Available through a website, it allows doctors in the healthcare environment to benefit from FE based bone strength estimation without being exposed to the technical aspects, which are concealed behind a user-friendly interface. Clinicians request the analysis of CT scans of a patient through the website. Using XNAT functionality, the anonymised images are automatically transferred to the University research facility, where an operator processes them and estimates the bone strength through FEM using a combination of open source and commercial software. Following the analysis, the doctor is provided with the results in a structured report. Results:The platform, currently available for research purposes, has been deployed and fully tested in Sheffield, UK. The entire analysis requires processing times ranging from 3.5 to 8 h, depending on the available computational power. Conclusions:The short processing time makes the system compatible with current clinical workflows. The use of open source software and the accurate description of the workflow given here facilitates the deployment in other centres
Recommended from our members
Web Archiving Bibliography 2013
The following document is a bibliography of the field of web archiving. It includes a preface as well as a list of bibliographical resources
- âŠ