7,031 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
Malware distributions and graph structure of the Web
Knowledge about the graph structure of the Web is important for understanding
this complex socio-technical system and for devising proper policies supporting
its future development. Knowledge about the differences between clean and
malicious parts of the Web is important for understanding potential treats to
its users and for devising protection mechanisms. In this study, we conduct
data science methods on a large crawl of surface and deep Web pages with the
aim to increase such knowledge. To accomplish this, we answer the following
questions. Which theoretical distributions explain important local
characteristics and network properties of websites? How are these
characteristics and properties different between clean and malicious
(malware-affected) websites? What is the prediction power of local
characteristics and network properties to classify malware websites? To the
best of our knowledge, this is the first large-scale study describing the
differences in global properties between malicious and clean parts of the Web.
In other words, our work is building on and bridging the gap between
\textit{Web science} that tackles large-scale graph representations and
\textit{Web cyber security} that is concerned with malicious activities on the
Web. The results presented herein can also help antivirus vendors in devising
approaches to improve their detection algorithms
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Distributed Information Retrieval using Keyword Auctions
This report motivates the need for large-scale distributed approaches to information retrieval, and proposes solutions based on keyword auctions
- …