Search CORE

605 research outputs found

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

Algorithms For Discovering Communities In Complex Networks

Author: Balakrishnan Hemant
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2006
Field of study

It has been observed that real-world random networks like the WWW, Internet, social networks, citation networks, etc., organize themselves into closely-knit groups that are locally dense and globally sparse. These closely-knit groups are termed communities. Nodes within a community are similar in some aspect. For example in a WWW network, communities might consist of web pages that share similar contents. Mining these communities facilitates better understanding of their evolution and topology, and is of great theoretical and commercial significance. Community related research has focused on two main problems: community discovery and community identification. Community discovery is the problem of extracting all the communities in a given network, whereas community identification is the problem of identifying the community, to which, a given set of nodes belong. We make a comparative study of various existing community-discovery algorithms. We then propose a new algorithm based on bibliographic metrics, which addresses the drawbacks in existing approaches. Bibliographic metrics are used to study similarities between publications in a citation network. Our algorithm classifies nodes in the network based on the similarity of their neighborhoods. One of the drawbacks of the current community-discovery algorithms is their computational complexity. These algorithms do not scale up to the enormous size of the real-world networks. We propose a hash-table-based technique that helps us compute the bibliometric similarity between nodes in O(m ?) time. Here m is the number of edges in the graph and ?, the largest degree. Next, we investigate different centrality metrics. Centrality metrics are used to portray the importance of a node in the network. We propose an algorithm that utilizes centrality metrics of the nodes to compute the importance of the edges in the network. Removal of the edges in ascending order of their importance breaks the network into components, each of which represent a community. We compare the performance of the algorithm on synthetic networks with a known community structure using several centrality metrics. Performance was measured as the percentage of nodes that were correctly classified. As an illustration, we model the ucf.edu domain as a web graph and analyze the changes in its properties like densification power law, edge density, degree distribution, diameter, etc., over a five-year period. Our results show super-linear growth in the number of edges with time. We observe (and explain) that despite the increase in average degree of the nodes, the edge density decreases with time

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

Author: Felicioli Claudio
Geraci Filippo
Pellegrini Marco
Publication venue
Publication date
Field of study

Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week

PUblication MAnagement

A novel defense mechanism against web crawler intrusion

Author: Aghamohammadi Alireza
Publication venue: DigitalCommons@EMU
Publication date: 05/11/2013
Field of study

Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

Eastern Michigan University: Digital Commons@EMU

A Framework to Evaluate Information Quality in Public Administration Website

Author: Geraci Filippo
Martinelli Maurizio
Pellegrini Marco
Serrecchia Michela
Publication venue: AIS Electronic Library (AISeL)
Publication date: 30/09/2013
Field of study

This paper presents a framework aimed at assessing the capacity of Public Administration bodies (PA) to offer a good quality of information and service on their web portals. Our framework is based on the extraction of “.it” domain names registered by Italian public institutions and the subsequent analysis of their relative websites. The analysis foresees an automatic gathering of the web pages of PA portals by means of web crawling and an assessment of the quality of their online information services. This assessment is carried out by verifying their compliance with current legislation on the basis of the criteria established in government guidelines[1]. This approach provides an ongoing monitoring process of the PA websites that can contribute to the improvement of their overall quality. Moreover, our approach can also hopefully be of benefit to local governments in other countries. Available at: https://aisel.aisnet.org/pajais/vol5/iss3/3

AIS Electronic Library (AISeL)

SEMO: a framework for customer social networks analysis based on semantics

Author: Colomo-Palacios Ricardo
García-Crespo Ángel
Gómez-Berbís Juan Miguel
Ruiz Mezcua María Belén
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

The increasing importance of the Internet in most domains has brought about a paradigm change in consumer relations. The influence of Social Networks has entered the Customer Relationship Management domain under the coined term CRM 2.0. In this context, the need to understand and classify the interactions of customers by means of new platforms has emerged as a challenge for both researchers and professionals world-wide. This is the perfect scenario for the use of SEMO, a platform for Customer Social Networks Analysis based on Semantics and emotion mining. The platform benefits from both semantic annotation and classification and text analysis, relying on techniques from the Natural Language Processing domain. The results of the evaluation of the experimental implementation of SEMO reveal a promising and viable platform from a technical perspective.This work is supported by the Spanish Ministry of Industry, Tourism, and Commerce under the EUREKA project SITIO (TSI-020400-2009-148), SONAR2 (TSI-020100-2008-665) and GO2 (TSI-020400-2009-127)Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

A Tool for Web Links Prototyping

Author: Corchuelo Gil Rafael
Hernández Salmerón Inmaculada Concepción
Ruiz Cortés David
Sleiman Hassan A.
Publication venue: CSREA Press
Publication date: 01/01/2011
Field of study

Crawlers for Virtual Integration processes must be efficient, given that VI process is online, which means that while the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory in order to improve the crawler efficiency. Most crawlers need to download a page in order the determine its relevance, which results in a high number of irrelevant pages downloaded. We propose a tool that builds a set of prototype links for a given site, where each prototype represents links leading to pages containing a certain concept. These prototypes can then be used to classify pages before downloading them, just by analysing their URL. Therefore, they are the support for crawlers to navigate through sites downloading a minimum number of irrelevant pages while reducing bandwidth, making them suitable for VI systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla

A Tool for Link-Based Web Page Classification

Author: Corchuelo Gil Rafael
Hernández Salmerón Inmaculada Concepción
Rivero Carlos R.
Ruiz Cortés David
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

idUS. Depósito de Investigación Universidad de Sevilla