Search CORE

572 research outputs found

CLEAR: a credible method to evaluate website archivability

Author: Banos V.
Kim Y.
Manolopoulos Y.
Ross S.
Publication venue
Publication date
Field of study

Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is diﬃcult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- uence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability

Enlighten

Practice of Robots Exclusion Protocol in Bhutan

Author: Dema Kezang
Jamtsho Thinley
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 05/01/2021
Field of study

Most of the search engines rely on the web robots to collect information from the web. The web is open access and unregulated which makes it easier for the robots to crawl and index all the contents of websites so easily. But not all wish to get their websites and web pages indexed by web crawlers. The diverse crawling activities can be regulated and managed by deploying the Robots Exclusion Protocol (REP) in a file called robots.txt in the server. The method used is a de-facto standard and most of the ethical robots will follow the rules specified in the file. In Bhutan, there are many websites and in order to regulate those bots, the usage of the robots.txt file in the websites are not known since no study has been carried out till date. The main aim of the paper is to investigate the use of robots.txt files in various organizations’ websites in Bhutan. And further, to analyze its content present in the file if it exist. A total of 50 websites from various sectors like colleges, government ministries, autonomous agencies, corporations and newspaper agencies were selected for the investigation to check the usage of the file. Moreover, the files were further studied and analyzed for its file size, types of robots specified, and correct use of the file. The result showed that that almost 70% of the websites investigated are using the default robots.txt file generally created by the Joomla and Word press Content Management Systems (CMS) which ultimately specifies that there is a usage of the file. But on the other hand, the file is not really taken into seriously and almost 70% of it lacks major and best protocols defined in it that will help define the access and denial to various resources to various types of robots available on the web. Approximately 30% of the URLs adopted for the study show that the REP file is not added in their web server, thus providing unregulated access of resources to all types of web robots. Keywords: Crawler, robots.txt, search engines, robots exclusion protocol, indexing DOI: 10.7176/JEP/11-35-01 Publication date: December 31st 202

International Institute for Science, Technology and Education (IISTE): E-Journals

Web Site Metadata

Author: B. He
D.R. Danielson
G. Pant
J. Madhavan
M.C. Drott
S. Ceri
S. Raghavan
Y. Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

The currently established formats for how a Web site can publish metadata about a site's pages, the robots.txt file and sitemaps, focus on how to provide information to crawlers about where to not go and where to go on a site. This is sufficient as input for crawlers, but does not allow Web sites to publish richer metadata about their site's structure, such as the navigational structure. This paper looks at the availability of Web site metadata on today's Web in terms of available information resources and quantitative aspects of their contents. Such an analysis of the available Web site metadata not only makes it easier to understand what data is available today; it also serves as the foundation for investigating what kind of information retrieval processes could be driven by that data, and what additional data could be provided by Web sites if they had richer data formats to publish metadata

Crossref

eScholarship - University of California

Extending Sitemaps for ResourceSync

Author: Klein Martin
Van de Sompel Herbert
Publication venue
Publication date: 01/01/2013
Field of study

The documents used in the ResourceSync synchronization framework are based on the widely adopted document format defined by the Sitemap protocol. In order to address requirements of the framework, extensions to the Sitemap format were necessary. This short paper describes the concerns we had about introducing such extensions, the tests we did to evaluate their validity, and aspects of the framework to address them.Comment: 4 pages, 6 listings, accepted at JCDL 201

arXiv.org e-Print Archive

Crossref

Sharing Means Renting?: An Entire-marketplace Analysis of Airbnb

Author: Cohen Molly
Gut Dominik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/05/2017
Field of study

Airbnb, an online marketplace for accommodations, has experienced a staggering growth accompanied by intense debates and scattered regulations around the world. Current discourses, however, are largely focused on opinions rather than empirical evidences. Here, we aim to bridge this gap by presenting the first large-scale measurement study on Airbnb, using a crawled data set containing 2.3 million listings, 1.3 million hosts, and 19.3 million reviews. We measure several key characteristics at the heart of the ongoing debate and the sharing economy. Among others, we find that Airbnb has reached a global yet heterogeneous coverage. The majority of its listings across many countries are entire homes, suggesting that Airbnb is actually more like a rental marketplace rather than a spare-room sharing platform. Analysis on star-ratings reveals that there is a bias toward positive ratings, amplified by a bias toward using positive words in reviews. The extent of such bias is greater than Yelp reviews, which were already shown to exhibit a positive bias. We investigate a key issue---commercial hosts who own multiple listings on Airbnb---repeatedly discussed in the current debate. We find that their existence is prevalent, they are early-movers towards joining Airbnb, and their listings are disproportionately entire homes and located in the US. Our work advances the current understanding of how Airbnb is being used and may serve as an independent and empirical reference to inform the debate.Comment: WebSci '1

arXiv.org e-Print Archive

Crossref

Web crawlers on a health related portal: Detection, characterisation and implications

Author: Jawaheer G
Kostkova P
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2011
Field of study

Web crawlers are automated computer programs that visit websites in order to download their content. They are employed for non-malicious (search engine crawlers indexing websites) and malicious purposes (those breaching privacy by harvesting email addresses for unsolicited email promotion and spam databases). Whatever their usage, web crawlers need to be accurately identified in an analysis of the overall traffic to a website. Visits from web crawlers as well as from genuine users are recorded in the web server logs. In this paper, we analyse the web server logs of NRIC, a health related portal. We present the techniques used to identify malicious and non-malicious web crawlers from these logs, using a blacklist database and analysis of the characteristics of the online behaviour of malicious crawlers. We use visualisation to carry out sanity checks along the crawler removal process. We illustrate the use of these techniques using 3 months of web server logs from NRIC. We use a combination of visualisation and baseline measures from Google Analytics to demonstrate the efficacy of our techniques. Finally, we discuss the implications of our work on the analysis of the web traffic to a website using web server logs and on the interpretation of the results from such analysis. © 2011 IEEE

UCL Discovery