4,223 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
On the Change in Archivability of Websites Over Time
As web technologies evolve, web archivists work to keep up so that our
digital history is preserved. Recent advances in web technologies have
introduced client-side executed scripts that load data without a referential
identifier or that require user interaction (e.g., content loading when the
page has scrolled). These advances have made automating methods for capturing
web pages more difficult. Because of the evolving schemes of publishing web
pages along with the progressive capability of web preservation tools, the
archivability of pages on the web has varied over time. In this paper we show
that the archivability of a web page can be deduced from the type of page being
archived, which aligns with that page's accessibility in respect to dynamic
content. We show concrete examples of when these technologies were introduced
by referencing mementos of pages that have persisted through a long evolution
of available technologies. Identifying these reasons for the inability of these
web pages to be archived in the past in respect to accessibility serves as a
guide for ensuring that content that has longevity is published using good
practice methods that make it available for preservation.Comment: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL)
2013, Valletta, Malt
I Know Why You Went to the Clinic: Risks and Realization of HTTPS Traffic Analysis
Revelations of large scale electronic surveillance and data mining by
governments and corporations have fueled increased adoption of HTTPS. We
present a traffic analysis attack against over 6000 webpages spanning the HTTPS
deployments of 10 widely used, industry-leading websites in areas such as
healthcare, finance, legal services and streaming video. Our attack identifies
individual pages in the same website with 89% accuracy, exposing personal
details including medical conditions, financial and legal affairs and sexual
orientation. We examine evaluation methodology and reveal accuracy variations
as large as 18% caused by assumptions affecting caching and cookies. We present
a novel defense reducing attack accuracy to 27% with a 9% traffic increase, and
demonstrate significantly increased effectiveness of prior defenses in our
evaluation context, inclusive of enabled caching, user-specific cookies and
pages within the same website
INCORPORATING PRIVACY AND SECURITY FEATURES IN AN OPEN SOURCE SEARCH ENGINE A Project Report Presented to
The aim of this project was to explore and implement various privacy and security features in an open-source search engine and enhance the security and privacy capabilities of Yioop. Yioop, an open-source PHP search engine based on GPLv3 license, is designed and developed by Dr. Chris Pollett. We have enabled a crawl, search and index mechanism for hidden services by execution of codes, which has facilitated access of the Tor network in Yioop. We have diversified the ability of the previously supported text CAPTCHA functionality in Yioop by implementing hash CAPTCHA and provided feasibility to toggle between text CAPTCHA and hash CAPTCHA. To enable the user to log in to his or her respective Yioop account without sharing the password over the network, we have incorporated zero knowledge authentications in which Yioop does not store the user’s real password, but it stores the numerical password, which is derived from the user’s original password
- …