Search CORE

1,221 research outputs found

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

Global-Scale Resource Survey and Performance Monitoring of Public OGC Web Map Services

Author: Cao Jun
Cheng Xiaoqiang
Gui Zhipeng
Liu Xiaojing
Wu Huayi
Publication venue: 'MDPI AG'
Publication date: 01/06/2016
Field of study

One of the most widely-implemented service standards provided by the Open Geospatial Consortium (OGC) to the user community is the Web Map Service (WMS). WMS is widely employed globally, but there is limited knowledge of the global distribution, adoption status or the service quality of these online WMS resources. To fill this void, we investigated global WMSs resources and performed distributed performance monitoring of these services. This paper explicates a distributed monitoring framework that was used to monitor 46,296 WMSs continuously for over one year and a crawling method to discover these WMSs. We analyzed server locations, provider types, themes, the spatiotemporal coverage of map layers and the service versions for 41,703 valid WMSs. Furthermore, we appraised the stability and performance of basic operations for 1210 selected WMSs (i.e., GetCapabilities and GetMap). We discuss the major reasons for request errors and performance issues, as well as the relationship between service response times and the spatiotemporal distribution of client monitoring sites. This paper will help service providers, end users and developers of standards to grasp the status of global WMS resources, as well as to understand the adoption status of OGC standards. The conclusions drawn in this paper can benefit geospatial resource discovery, service performance evaluation and guide service performance improvements.Comment: 24 pages; 15 figure

arXiv.org e-Print Archive

Multidisciplinary Digital Publishing Institute

Directory of Open Access Journals

Developing an in house vulnerability scanner for detecting Template Injection, XSS, and DOM-XSS vulnerabilities

Author: Hauger Marius
Jensen Stig
Publication venue: 'University of Agder'
Publication date: 01/01/2023
Field of study

Web applications are becoming an essential part of today's digital world. However, with the increase in the usage of web applications, security threats have also become more prevalent. Cyber attackers can exploit vulnerabilities in web applications to steal sensitive information or take control of the system. To prevent these attacks, web application security must be given due consideration. Existing vulnerability scanners fail to detect Template Injection, XSS, and DOM-XSS vulnerabilities effectively. To bridge this gap in web application security, a customized in-house scanner is needed to quickly and accurately identify these vulnerabilities, enhancing manual security assessments of web applications. This thesis focused on developing a modular and extensible vulnerability scanner to detect Template Injection, XSS, and DOM-based XSS vulnerabilities in web applications. Testing the scanner against other free and open-source solutions on the market showed that it outperformed them on Template injection vulnerabilities and nearly all on XSS-type vulnerabilities. While the scanner has limitations, focusing on specific injection vulnerabilities can result in better performance

Agder University Research Archive

Personalization of Search Engine by Using Cache based Approach

Author: Krupali Bhaware, Shubham Narkhede, Prof. Neeranjan Chitare
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/03/2018
Field of study

As profound web develops at a quick pace, there has been expanded enthusiasm for strategies that assistance effectively find profound web interfaces. Be that as it may, because of the extensive volume of web assets and the dynamic idea of profound web, accomplishing wide scope and high productivity is a testing issue. In this venture propose a three-organize structure, for productive reaping profound web interfaces. In the principal organize, web crawler performs website based hunting down focus pages with the assistance of web indexes, abstaining from going to countless. To accomplish more precise outcomes for an engaged creep, Web Crawler positions sites to organize exceedingly pertinent ones for a given theme. In the second stage the proposed framework opens the pages inside in application with the assistance of Jsoup API and preprocess it. At that point it plays out the word include of inquiry website pages. In the third stage the proposed framework performs recurrence examination in view of TF and IDF. It additionally utilizes a mix of TF*IDF for positioning website pages. To wipe out inclination on going to some very applicable connections in shrouded web registries, In this undertaking propose plan a connection tree information structure to accomplish more extensive scope for a site. Undertaking trial comes about on an arrangement of delegate areas demonstrate the nimbleness and precision of our proposed crawler structure, which productively recovers profound web interfaces from extensive scale locales and accomplishes higher reap rates than different crawlers utilizing Na�ve Bayes algorithms

International Journal on Future Revolution in Computer Science & Communication Engineering

Hyp3rArmor: reducing web application exposure to automated attacks

Author: Bestavros Azer
Koch William
Publication venue: Computer Science Department, Boston University
Publication date: 11/11/2016
Field of study

Web applications (webapps) are subjected constantly to automated, opportunistic attacks from autonomous robots (bots) engaged in reconnaissance to discover victims that may be vulnerable to specific exploits. This is a typical behavior found in botnet recruitment, worm propagation, largescale fingerprinting and vulnerability scanners. Most anti-bot techniques are deployed at the application layer, thus leaving the network stack of the webapp’s server exposed. In this paper we present a mechanism called Hyp3rArmor, that addresses this vulnerability by minimizing the webapp’s attack surface exposed to automated opportunistic attackers, for JavaScriptenabled web browser clients. Our solution uses port knocking to eliminate the webapp’s visible network footprint. Clients of the webapp are directed to a visible static web server to obtain JavaScript that authenticates the client to the webapp server (using port knocking) before making any requests to the webapp. Our implementation of Hyp3rArmor, which is compatible with all webapp architectures, has been deployed and used to defend single and multi-page websites on the Internet for 114 days. During this time period the static web server observed 964 attempted attacks that were deflected from the webapp, which was only accessed by authenticated clients. Our evaluation shows that in most cases client-side overheads were negligible and that server-side overheads were minimal. Hyp3rArmor is ideal for critical systems and legacy applications that must be accessible on the Internet. Additionally Hyp3rArmor is composable with other security tools, adding an additional layer to a defense in depth approach.This work has been supported by the National Science Foundation (NSF) awards #1430145, #1414119, and #1012798

Boston University Institutional Repository (OpenBU)

Web crawlers on a health related portal: Detection, characterisation and implications

Author: Jawaheer G
Kostkova P
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2011
Field of study

Web crawlers are automated computer programs that visit websites in order to download their content. They are employed for non-malicious (search engine crawlers indexing websites) and malicious purposes (those breaching privacy by harvesting email addresses for unsolicited email promotion and spam databases). Whatever their usage, web crawlers need to be accurately identified in an analysis of the overall traffic to a website. Visits from web crawlers as well as from genuine users are recorded in the web server logs. In this paper, we analyse the web server logs of NRIC, a health related portal. We present the techniques used to identify malicious and non-malicious web crawlers from these logs, using a blacklist database and analysis of the characteristics of the online behaviour of malicious crawlers. We use visualisation to carry out sanity checks along the crawler removal process. We illustrate the use of these techniques using 3 months of web server logs from NRIC. We use a combination of visualisation and baseline measures from Google Analytics to demonstrate the efficacy of our techniques. Finally, we discuss the implications of our work on the analysis of the web traffic to a website using web server logs and on the interpretation of the results from such analysis. © 2011 IEEE

UCL Discovery

Herding Vulnerable Cats: A Statistical Approach to Disentangle Joint Responsibility for Web Security in Shared Hosting

Author: Böhme Rainer
Joosen Wouter
Korczyński Maciej
Moore Tyler
Noroozian Arman
Tajalizadehkhoob Samaneh
van Eeten Michel
van Goethem Tom
Publication venue
Publication date: 01/01/2017
Field of study

Hosting providers play a key role in fighting web compromise, but their ability to prevent abuse is constrained by the security practices of their own customers. {\em Shared} hosting, offers a unique perspective since customers operate under restricted privileges and providers retain more control over configurations. We present the first empirical analysis of the distribution of web security features and software patching practices in shared hosting providers, the influence of providers on these security practices, and their impact on web compromise rates. We construct provider-level features on the global market for shared hosting -- containing 1,259 providers -- by gathering indicators from 442,684 domains. Exploratory factor analysis of 15 indicators identifies four main latent factors that capture security efforts: content security, webmaster security, web infrastructure security and web application security. We confirm, via a fixed-effect regression model, that providers exert significant influence over the latter two factors, which are both related to the software stack in their hosting environment. Finally, by means of GLM regression analysis of these factors on phishing and malware abuse, we show that the four security and software patching factors explain between 10\% and 19\% of the variance in abuse at providers, after controlling for size. For web-application security for instance, we found that when a provider moves from the bottom 10\% to the best-performing 10\%, it would experience 4 times fewer phishing incidents. We show that providers have influence over patch levels--even higher in the stack, where CMSes can run as client-side software--and that this influence is tied to a substantial reduction in abuse levels

arXiv.org e-Print Archive

Search Interfaces on the Web: Querying and Characterizing

Author: Shestakov Denis
Publication venue: Turku Centre for Computer Science
Publication date: 12/06/2008
Field of study

Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.Siirretty Doriast

UTUPub

A novel defense mechanism against web crawler intrusion

Author: Aghamohammadi Alireza
Publication venue: DigitalCommons@EMU
Publication date: 05/11/2013
Field of study

Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

Eastern Michigan University: Digital Commons@EMU