As web technologies evolve, web archivists work to keep up so that our
digital history is preserved. Recent advances in web technologies have
introduced client-side executed scripts that load data without a referential
identifier or that require user interaction (e.g., content loading when the
page has scrolled). These advances have made automating methods for capturing
web pages more difficult. Because of the evolving schemes of publishing web
pages along with the progressive capability of web preservation tools, the
archivability of pages on the web has varied over time. In this paper we show
that the archivability of a web page can be deduced from the type of page being
archived, which aligns with that page's accessibility in respect to dynamic
content. We show concrete examples of when these technologies were introduced
by referencing mementos of pages that have persisted through a long evolution
of available technologies. Identifying these reasons for the inability of these
web pages to be archived in the past in respect to accessibility serves as a
guide for ensuring that content that has longevity is published using good
practice methods that make it available for preservation.Comment: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL)
  2013, Valletta, Malt

B. Parmanto

E. Crook

F. McCown

M. Prellwitz

P. Likarish

S.G. Ainsworth

W. Chisholm

English

arXiv

PDF of a powerpoint presentation from TPDL 2013: 17th International Conference on Theory and Practice of Digital Libraries, Valletta, Malta, September 22-26, 2013. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1015/thumbnail.jp

Kelly, Mat

Brunelle, Justin F.

Weigle, Michele C.

Nelson, Michael L.

Old Dominion University

Old Dominion UniversityODU Digital CommonsComputer Science Presentations Computer Science9-23-2013On the Change in Archivability of Websites OverTimeMat KellyOld Dominion UniversityJustin F. BrunelleMichele C. WeigleOld Dominion University, mweigle@odu.eduMichael L. NelsonOld Dominion University, mnelson@odu.eduFollow this and additional works at: https://digitalcommons.odu.edu/computerscience_presentationsPart of the Archival Science CommonsThis Book is brought to you for free and open access by the Computer Science at ODU Digital Commons. It has been accepted for inclusion inComputer Science Presentations by an authorized administrator of ODU Digital Commons. For more information, please contactdigitalcommons@odu.edu.Recommended CitationKelly, Mat; Brunelle, Justin F.; Weigle, Michele C.; and Nelson, Michael L., "On the Change in Archivability of Websites Over Time"(2013). Computer Science Presentations. 14.https://digitalcommons.odu.edu/computerscience_presentations/14On the Change in Archivability of Websites Over TimeMat Kelly, Justin F. Brunelle, Michele C. Weigle, and Michael L. NelsonOld Dominion University{mkelly,jbrunelle,mweigle,mln}@cs.odu.eduTPDL 2013, Valletta, Malta, September 23, 20131Preserving Web Pages• Identify page content needed for re-render• Save page contents– HTML, Images, CSS, JavaScript• Rewrite inter-resource references for replayTPDL 2013, Valletta, Malta, September 23, 20132From M. Klein, J. F. Brunelle, WADL 2013Ease of Archiving• HTML – text-based, references other resources• Images• CSS• JavaScriptEXAMPLE<html><body><img src=“meatball-logo.png” /></body></html>TPDL 2013, Valletta, Malta, September 23, 20133Ease of Archiving• HTML• Images – binary data, no embedded URIs• CSS• JavaScriptEXAMPLETPDL 2013, Valletta, Malta, September 23, 20134Ease of Archiving• HTML• Images• CSS – text-based, references other resources• JavaScriptEXAMPLEbody {margin: 5px; color: black; background-image: url(‘outerSpace.png’);}TPDL 2013, Valletta, Malta, September 23, 20135Ease of Archiving• HTML• Images• CSS• JavaScript – text-based, references other resources, URIs (not necessarily known until runtime)EXAMPLE$.ajax({url: “meatball-logo” +“.”+“png”; //build logo URI at runtime});TPDL 2013, Valletta, Malta, September 23, 20136Ease of Archiving• HTML• Images• CSS• JavaScript – text-based, references other resources, URIs (not necessarily known until runtime)EXAMPLE$.ajax({url: “wormlogo” +“.”+“png”; //build logo URI at runtime});TPDL 2013, Valletta, Malta, September 23, 20137BIG Problem• JavaScript meant for browser– Capable of JS Execution• Crawlers only recently became capable of exec• Crawlers getting smarter at extracting URIs– Nowhere near perfectTPDL 2013, Valletta, Malta, September 23, 20138Identifying Missing ResourcesTPDL 2013, Valletta, Malta, September 23, 20139• Sometimes missing resources are subtleFrom Live Web From Internet Archive Apr. 30, 2012http://web.archive.org/web/20120430235712/http://maps.google.comIdentifying Missing ResourcesTPDL 2013, Valletta, Malta, September 23, 201310• Archived version not interactive & resources are missingFrom Live Web From Internet Archive Apr. 30, 2012http://web.archive.org/web/20120430235712/http://maps.google.comIdentifying Missing ResourcesTPDL 2013, Valletta, Malta, September 23, 201311• Other times, a failed AJAX call prevents other resources from loadingYouTube (2011) in archives with failed Ajax callhttp://web.archive.org/web/20110420002216/http://www.youtube.com/Why JavaScript makes it difficult• Archival crawlers don’t interpret DOM, are made for capture and thus fasterTPDL 2013, Valletta, Malta, September 23, 201312Same JS Code Different Agentsfunctionality1()functionality2()functionality3()Different functionality(Recent versions)• Interprets JavaScript• Enabled pages w/ JS to be archived better!JavaScript and AccessibilityTPDL 2013, Valletta, Malta, September 23, 201313g u i d e l i n e ssecti  n• Content displaying should not be dependent on script execution• U.S. Government– required to comply with accessibility standards– Content should be available to crawler w/o JS– Therefore, government sites are better preserved• Right?If not Accessible then not ArchivableTPDL 2013, Valletta, Malta, September 23, 201314NASA.gov in Internet Archive, 2004http://web.archive.org/web/20041014205942/http://www.nasa.govFinding out where NASA went wrongTPDL 2013, Valletta, Malta, September 23, 201315nasa.gov in 1996nasa.gov in 1997nasa.gov in 2012…Query ArchivesStep 1CaptureDOM Requests forresourcesScreenshot ofWeb pageHTTP.jpgStep 2Webkit• Modern rendering• Browser-like JS behavior1997NASA.gov over timeTPDL 2013, Valletta, Malta, September 23, 2013161996 1998 1999 2000 2001 20022003 2004 2005 2006 2007 2008 2009 2010 2011 20121997NASA.gov over timeTPDL 2013, Valletta, Malta, September 23, 2013171996 1998 1999 2000 2001 20022003 2004 2005 2006 2007 2008 2009 2010 2011 2012Does this apply to popular sites?AlexaRankWeb SiteName1 Facebook.com2 Google.com3 YouTube.com4 Yahoo.com5 Baidu.com6 Wikipedia.org7 Live.org8 Amazon.com9 QQ.com10 Twitter.comTPDL 2013, Valletta, Malta, September 23, 201318• Alexa’s Top 10 websitesDoes this apply to popular sites?AlexaRankWeb SiteName1 Facebook.com2 Google.com3 YouTube.com4 Yahoo.com5 Baidu.com6 Wikipedia.org7 Live.org8 Amazon.com9 QQ.com10 Twitter.comTPDL 2013, Valletta, Malta, September 23, 201319• Alexa’s Top 10 websites• No Mementos• robots.txt exclusion prevents crawlDoes this apply to popular sites?Alexa RankWeb SiteNameSampled Mementos1 Facebook.com No memento robots.txt exclusion2 Google.com 15 mementos 1998 to 20123 YouTube.com 7 mementos 2006 to 20124 Yahoo.com 16 mementos 1997 to 20125 Baidu.com No memento robots.txt exclusion6 Wikipedia.org 12 mementos 2001 to 20127 Live.org 15 mementos 1999 to 20128 Amazon.com 14 mementos 1999 to 20129 QQ.com 15 mementos 1998 to 201210 Twitter.com No memento robots.txt exclusionTPDL 2013, Valletta, Malta, September 23, 201320all thumbnails at: http://www.cs.odu.edu/~mkelly/semester/2013_spring/20130127alexatop10/Case Study with Ajax emphasis: YouTubeAlexa RankWeb SiteNameSampled Mementos1 Facebook.com No memento robots.txt exclusion2 Google.com 15 mementos 1998 to 20123 YouTube.com 7 mementos 2006 to 20124 Yahoo.com 16 mementos 1997 to 20125 Baidu.com No memento robots.txt exclusion6 Wikipedia.org 12 mementos 2001 to 20127 Live.org 15 mementos 1999 to 20128 Amazon.com 14 mementos 1999 to 20129 QQ.com 15 mementos 1998 to 201210 Twitter.com No memento robots.txt exclusionTPDL 2013, Valletta, Malta, September 23, 201321Ajax’s Effects on the Archivability of YouTube• Recently  Viewed content missingTPDL 2013, Valletta, Malta, September 23, 2013222006 YouTube.com from Internet Archivehttp://web.archive.org/web/20060427213420/http://youtube.comAjax’s Effects on the Archivability of YouTube• Recently Viewed content missing• How much of this is because of JavaScript?TPDL 2013, Valletta, Malta, September 23, 2013232006 YouTube.com from Internet Archivehttp://web.archive.org/web/20060427213420/http://youtube.comYouTube as captured without JavaScript capability• Titles fetched with JavaScript• Primary content (preview) still missingTPDL 2013, Valletta, Malta, September 23, 2013242006 YouTube.com from Internet Archivehttp://web.archive.org/web/20060427213420/http://youtube.comDependent Loading of ResourcesGET http://web.archive.org/web/20121208145112cs_/http://s.ytimg.com/yt/cssbin/www-core-vfl_OJqFG.css 404 (Not Found) www.youtube.com:15GET http://web.archive.org/web/20121208145115js_/http://s.ytimg.com/yt/jsbin/www-core-vfl8PDcRe.js 404 (Not Found) www.youtube.com:45Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:56Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:76Uncaught TypeError: Cannot read property 'ajax' of undefined www.youtube.com:86Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:101Uncaught ReferenceError: _gel is not defined www.youtube.com:1784Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:1929Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:524GET http://web.archive.org/web/20130101024721im_/http://i2.ytimg.com/vi/1f7neSzDqvc/default.jpg 404 (Not Found) TPDL 2013, Valletta, Malta, September 23, 201325YouTube 2011http://web.archive.org/web/20110420002216/http://www.youtube.com/Dependent Loading of ResourcesGET http://web.archive.org/web/20121208145112cs_/http://s.ytimg.com/yt/cssbin/www-core-vfl_OJqFG.css 404 (Not Found) www.youtube.com:15GET http://web.archive.org/web/20121208145115js_/http://s.ytimg.com/yt/jsbin/www-core-vfl8PDcRe.js 404 (Not Found) www.youtube.com:45Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:56Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:76Uncaught TypeError: Cannot read property 'ajax' of undefined www.youtube.com:86Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:101Uncaught ReferenceError: _gel is not defined www.youtube.com:1784Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:1929Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:524GET http://web.archive.org/web/20130101024721im_/http://i2.ytimg.com/vi/1f7neSzDqvc/default.jpg 404 (Not Found) TPDL 2013, Valletta, Malta, September 23, 201326Missing CSSMissing JSMissing JS-dependent resourcesYouTube 2011http://web.archive.org/web/20110420002216/http://www.youtube.com/Dependent Loading of ResourcesGET http://web.archive.org/web/20121208145112cs_/http://s.ytimg.com/yt/cssbin/www-core-vfl_OJqFG.css 404 (Not Found) www.youtube.com:15GET http://web.archive.org/web/20121208145115js_/http://s.ytimg.com/yt/jsbin/www-core-vfl8PDcRe.js 404 (Not Found) www.youtube.com:45Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:56Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:76Uncaught TypeError: Cannot read property 'ajax' of undefined www.youtube.com:86Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:101Uncaught ReferenceError: _gel is not defined www.youtube.com:1784Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:1929Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:524GET http://web.archive.org/web/20130101024721im_/http://i2.ytimg.com/vi/1f7neSzDqvc/default.jpg 404 (Not Found) TPDL 2013, Valletta, Malta, September 23, 201327Missing CSSMissing JSMissing JS-dependent resourcesYouTube 2011http://web.archive.org/web/20110420002216/http://www.youtube.com/Dependent Loading of ResourcesGET http://web.archive.org/web/20121208145112cs_/http://s.ytimg.com/yt/cssbin/www-core-vfl_OJqFG.css 404 (Not Found) www.youtube.com:15GET http://web.archive.org/web/20121208145115js_/http://s.ytimg.com/yt/jsbin/www-core-vfl8PDcRe.js 404 (Not Found) www.youtube.com:45Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:56Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:76Uncaught TypeError: Cannot read property 'ajax' of undefined www.youtube.com:86Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:101Uncaught ReferenceError: _gel is not defined www.youtube.com:1784Uncaught TypeError: Object #<Object> has no method 'setConfig' www.youtube.com:1929Uncaught TypeError: Cannot read property 'home' of undefined www.youtube.com:524GET http://web.archive.org/web/20130101024721im_/http://i2.ytimg.com/vi/1f7neSzDqvc/default.jpg 404 (Not Found) TPDL 2013, Valletta, Malta, September 23, 201328Missing CSSMissing JSMissing JS-dependent resourcesYouTube 2011http://web.archive.org/web/20110420002216/http://www.youtube.com/TPDL 2013, Valletta, Malta, September 23, 201329Live Web Leaks Into Archive Via JavascriptTPDL 2013, Valletta, Malta, September 23, 201330see: http://ws-dl.blogspot.com/2012/10/2012-10-10-zombies-in-archives.htmlSept 3, 20082012Stepping Back, NASA 1999• Few embedded resources• Little to no JavaScript– If some, not Ajax• No resources dependent on JavaScriptTPDL 2013, Valletta, Malta, September 23, 201331http://web.archive.org/web/19991109105105/http://www.nasa.govStepping Back, NASA 2003• Content requires JS• Accessibility goes awry• Archivability decreases TPDL 2013, Valletta, Malta, September 23, 201332var fstr = '';if(hasFlash(6)) {fstr+='<object id="screenreader.swf">...</object>';window.status = 'Flash 6 Detected...';} else {fstr+='...To view the enhanced version of NASA.gov, you must have Flash 6 installed....';}with(document) { open('text/html'); write(fstr); close(); }http://web.archive.org/web/20041014205942/http://www.nasa.govStepping Back, NASA 2007• JS check removed• Content is Accessible• Archivability Goes Up• AccessibilityArchivabilityTPDL 2013, Valletta, Malta, September 23, 201333http://web.archive.org/web/20071011055607/http://www.nasa.govReinforcing Case: WikipediaTPDL 2013, Valletta, Malta, September 23, 20133420022001 2003 2004 2005 20062007 2008 2009 2010 2011 2012SummaryTPDL 2013, Valletta, Malta, September 23, 201335• Javascript: good for interaction, bad for archiving– crawlers miss URIs at crawl time– rendering yesterday's pages causes them to reach into today's web• Different trends of archivability over time:– YouTube: bad-->worse– NASA: good-->bad-->good– Wikipedia: good– see all: http://www.cs.odu.edu/~mkelly/semester/2013_spring/20130127alexatop10/• Overall, archivability is getting worse– 24% increase in missing embedded resources from 2006-2010 due to Javascript

On the Change in Archivability of Websites Over Time

Mat Kelly

Justin F. Brunelle

Michele C. Weigle

Michael L. Nelson

Crossref

Abstract. As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technolo-gies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content load-ing when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page’s accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing me-mentos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation

CiteSeerX

M.L.: On the change in archivability of websites over time

On the Change in Archivability of Websites Over Time

Abstract

Similar works

Full text

Available Versions

Old Dominion University

Crossref

CiteSeerX