605 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
Algorithms For Discovering Communities In Complex Networks
It has been observed that real-world random networks like the WWW, Internet, social networks, citation networks, etc., organize themselves into closely-knit groups that are locally dense and globally sparse. These closely-knit groups are termed communities. Nodes within a community are similar in some aspect. For example in a WWW network, communities might consist of web pages that share similar contents. Mining these communities facilitates better understanding of their evolution and topology, and is of great theoretical and commercial significance. Community related research has focused on two main problems: community discovery and community identification. Community discovery is the problem of extracting all the communities in a given network, whereas community identification is the problem of identifying the community, to which, a given set of nodes belong. We make a comparative study of various existing community-discovery algorithms. We then propose a new algorithm based on bibliographic metrics, which addresses the drawbacks in existing approaches. Bibliographic metrics are used to study similarities between publications in a citation network. Our algorithm classifies nodes in the network based on the similarity of their neighborhoods. One of the drawbacks of the current community-discovery algorithms is their computational complexity. These algorithms do not scale up to the enormous size of the real-world networks. We propose a hash-table-based technique that helps us compute the bibliometric similarity between nodes in O(m ?) time. Here m is the number of edges in the graph and ?, the largest degree. Next, we investigate different centrality metrics. Centrality metrics are used to portray the importance of a node in the network. We propose an algorithm that utilizes centrality metrics of the nodes to compute the importance of the edges in the network. Removal of the edges in ascending order of their importance breaks the network into components, each of which represent a community. We compare the performance of the algorithm on synthetic networks with a known community structure using several centrality metrics. Performance was measured as the percentage of nodes that were correctly classified. As an illustration, we model the ucf.edu domain as a web graph and analyze the changes in its properties like densification power law, edge density, degree distribution, diameter, etc., over a five-year period. Our results show super-linear growth in the number of edges with time. We observe (and explain) that despite the increase in average degree of the nodes, the edge density decreases with time
Lumbricus webis: a parallel and distributed crawling architecture for the Italian web
Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week
A novel defense mechanism against web crawler intrusion
Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome
A Framework to Evaluate Information Quality in Public Administration Website
This paper presents a framework aimed at assessing the capacity of Public Administration bodies (PA) to offer a good quality of information and service on their web portals. Our framework is based on the extraction of “.it” domain names registered by Italian public institutions and the subsequent analysis of their relative websites. The analysis foresees an automatic gathering of the web pages of PA portals by means of web crawling and an assessment of the quality of their online information services. This assessment is carried out by verifying their compliance with current legislation on the basis of the criteria established in government guidelines[1]. This approach provides an ongoing monitoring process of the PA websites that can contribute to the improvement of their overall quality. Moreover, our approach can also hopefully be of benefit to local governments in other countries.
Available at: https://aisel.aisnet.org/pajais/vol5/iss3/3
SEMO: a framework for customer social networks analysis based on semantics
The increasing importance of the Internet in most domains has brought about a paradigm change in consumer relations. The influence of Social Networks has entered the Customer Relationship Management domain under the coined term CRM 2.0. In this context, the need to understand and classify the interactions of customers by means of new platforms has emerged as a challenge for both researchers and professionals world-wide. This is the perfect scenario for the use of SEMO, a platform for Customer Social Networks Analysis based on Semantics and emotion mining. The platform benefits from both semantic annotation and classification and text analysis, relying on techniques from the Natural Language Processing domain. The results of the evaluation of the experimental implementation of SEMO reveal a promising and viable platform from a technical perspective.This work is supported by the Spanish Ministry of Industry, Tourism, and Commerce under the EUREKA project SITIO (TSI-020400-2009-148), SONAR2 (TSI-020100-2008-665) and GO2 (TSI-020400-2009-127)Publicad
A Tool for Web Links Prototyping
Crawlers for Virtual Integration processes must be
efficient, given that VI process is online, which means that while
the system is looking for the required information, the user
is waiting for a response. Therefore, downloading a minimum
number of irrelevant pages is mandatory in order to improve
the crawler efficiency. Most crawlers need to download a page
in order the determine its relevance, which results in a high
number of irrelevant pages downloaded. We propose a tool
that builds a set of prototype links for a given site, where
each prototype represents links leading to pages containing a
certain concept. These prototypes can then be used to classify
pages before downloading them, just by analysing their URL.
Therefore, they are the support for crawlers to navigate through
sites downloading a minimum number of irrelevant pages while
reducing bandwidth, making them suitable for VI systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
A Tool for Link-Based Web Page Classification
Virtual integration systems require a crawler to navigate
through web sites automatically, looking for relevant information. This
process is online, so whilst the system is looking for the required information,
the user is waiting for a response. Therefore, downloading a
minimum number of irrelevant pages is mandatory to improve the crawler
efficiency. Most crawlers need to download a page to determine its relevance,
which results in a high number of irrelevant pages downloaded. In
this paper, we propose a classifier that helps crawlers to efficiently navigate
through web sites. This classifier is able to determine if a web page
is relevant by analysing exclusively its URL, minimising the number of
irrelevant pages downloaded, improving crawling efficiency and reducing
used bandwidth, making it suitable for virtual integration systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
- …