Search CORE

5,149 research outputs found

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Digital Creativity Support for Original Journalism

Author: Apostolou D.
Brown A.
Holm B.
Maiden N.
Nyre L.
Tonheim A.
van der Beld A.
Zachos K.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/07/2020
Field of study

The decline in circulations and revenues resulting from the digitalization of news production and consumption has led to a crisis in journalism.Journalists have less time to research, investigate and write original stories, leading to problems for our democratic processes and holding the powerful to account. This paper reports the architecture, features and rationale for new digital creativity support designed to support journalists to discover more original angles onstories. It also summarises the evaluation of the tool’s use in 3 newsrooms

City Research Online

Global-Scale Resource Survey and Performance Monitoring of Public OGC Web Map Services

Author: Cao Jun
Cheng Xiaoqiang
Gui Zhipeng
Liu Xiaojing
Wu Huayi
Publication venue: 'MDPI AG'
Publication date: 01/06/2016
Field of study

One of the most widely-implemented service standards provided by the Open Geospatial Consortium (OGC) to the user community is the Web Map Service (WMS). WMS is widely employed globally, but there is limited knowledge of the global distribution, adoption status or the service quality of these online WMS resources. To fill this void, we investigated global WMSs resources and performed distributed performance monitoring of these services. This paper explicates a distributed monitoring framework that was used to monitor 46,296 WMSs continuously for over one year and a crawling method to discover these WMSs. We analyzed server locations, provider types, themes, the spatiotemporal coverage of map layers and the service versions for 41,703 valid WMSs. Furthermore, we appraised the stability and performance of basic operations for 1210 selected WMSs (i.e., GetCapabilities and GetMap). We discuss the major reasons for request errors and performance issues, as well as the relationship between service response times and the spatiotemporal distribution of client monitoring sites. This paper will help service providers, end users and developers of standards to grasp the status of global WMS resources, as well as to understand the adoption status of OGC standards. The conclusions drawn in this paper can benefit geospatial resource discovery, service performance evaluation and guide service performance improvements.Comment: 24 pages; 15 figure

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

How to Ask for Technical Help? Evidence-based Guidelines for Writing Questions on Stack Overflow

Author: Calefato Fabio
Lanubile Filippo
Novielli Nicole
Publication venue: 'Elsevier BV'
Publication date: 24/11/2017
Field of study

Context: The success of Stack Overflow and other community-based question-and-answer (Q&A) sites depends mainly on the will of their members to answer others' questions. In fact, when formulating requests on Q&A sites, we are not simply seeking for information. Instead, we are also asking for other people's help and feedback. Understanding the dynamics of the participation in Q&A communities is essential to improve the value of crowdsourced knowledge. Objective: In this paper, we investigate how information seekers can increase the chance of eliciting a successful answer to their questions on Stack Overflow by focusing on the following actionable factors: affect, presentation quality, and time. Method: We develop a conceptual framework of factors potentially influencing the success of questions in Stack Overflow. We quantitatively analyze a set of over 87K questions from the official Stack Overflow dump to assess the impact of actionable factors on the success of technical requests. The information seeker reputation is included as a control factor. Furthermore, to understand the role played by affective states in the success of questions, we qualitatively analyze questions containing positive and negative emotions. Finally, a survey is conducted to understand how Stack Overflow users perceive the guideline suggestions for writing questions. Results: We found that regardless of user reputation, successful questions are short, contain code snippets, and do not abuse with uppercase characters. As regards affect, successful questions adopt a neutral emotional style. Conclusion: We provide evidence-based guidelines for writing effective questions on Stack Overflow that software engineers can follow to increase the chance of getting technical help. As for the role of affect, we empirically confirmed community guidelines that suggest avoiding rudeness in question writing.Comment: Preprint, to appear in Information and Software Technolog

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Bari

Recommendation System for News Reader

Author: Athalye Shweta
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2013
Field of study

Recommendation Systems help users to find information and make decisions where they lack the required knowledge to judge a particular product. Also, the information dataset available can be huge and recommendation systems help in filtering this data according to users‟ needs. Recommendation systems can be used in various different ways to facilitate its users with effective information sorting. For a person who loves reading, this paper presents the research and implementation of a Recommendation System for a NewsReader Application using Android Platform. The NewsReader Application proactively recommends news articles as per the reading habits of the user, recorded over a period of time and also recommends the currently trending articles. Recommendation systems and their implementations using various algorithms is the primary area of study for this project. This research paper compares and details popular recommendation algorithms viz. Content based recommendation systems, Collaborative recommendation systems etc. Moreover, it also presents a more efficient Hybrid approach that absorbs the best aspects from both the algorithms mentioned above, while trying to eliminate all the potential drawbacks observed

SJSU ScholarWorks

Topic Detection and Tracking in Personal Search History

Author: Gupta Kamal Kant
Publication venue
Publication date: 01/07/2008
Field of study

This thesis describes a system for tracking and detecting topics in personal search history. In particular, we developed a time tracking tool that helps users in analyzing their time and discovering their activity patterns. The system allows a user to specify interesting topics to monitor with a keyword description. The system would then keep track of the log and the time spent on each document and produce a time graph to show how much time has been spent on each topic to be monitored. The system can also detect new topics and potentially recommend relevant information about them to the user. This work has been integrated with the UCAIR Toolbar, a client side agent. Considering limited resources on the client side, we designed an e????cient incremental algorithm for topic tracking and detection. Various unsupervised learning approaches have been considered to improve the accuracy in categorizing the user log into appropriate categories. Experiments show that our tool is effective in categorizing the documents into existing categories and detecting the new useful catgeories. Moreover, the quality of categorization improves over time as more and more log is available

Illinois Digital Environment for Access to Learning and Scholarship Repository