4,117 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
Methodologies for the Automatic Location of Academic and Educational Texts on the Internet
Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis.
This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
Hybrid focused crawling on the Surface and the Dark Web
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating
through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of
interest. This work proposes a generic focused crawling framework for discovering resources on any given topic
that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the
Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by
automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the
destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates
11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination
of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of
Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the
effectiveness of the proposed focused crawler both for the Surface and the Dark Web
- …