Search CORE

313 research outputs found

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

Author: Hiemstra Djoerd
Keulen Maurice van
Khelghati Mohammadreza
Publication venue: University of Twente, Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2014
Field of study

The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper

Radboud Repository

University of Twente Research Information

Reverse Engineering and Testing of Rich Internet Applications

Author: Amalfitano Domenico
Publication venue
Publication date: 30/11/2011
Field of study

The World Wide Web experiences a continuous and constant evolution, where new initiatives, standards, approaches and technologies are continuously proposed for developing more effective and higher quality Web applications. To satisfy the growing request of the market for Web applications, new technologies, frameworks, tools and environments that allow to develop Web and mobile applications with the least effort and in very short time have been introduced in the last years. These new technologies have made possible the dawn of a new generation of Web applications, named Rich Internet Applications (RIAs), that offer greater usability and interactivity than traditional ones. This evolution has been accompanied by some drawbacks that are mostly due to the lack of applying well-known software engineering practices and approaches. As a consequence, new research questions and challenges have emerged in the field of web and mobile applications maintenance and testing. The research activity described in this thesis has addressed some of these topics with the specific aim of proposing new and effective solutions to the problems of modelling, reverse engineering, comprehending, re-documenting and testing existing RIAs. Due to the growing relevance of mobile applications in the renewed Web scenarios, the problem of testing mobile applications developed for the Android operating system has been addressed too, in an attempt of exploring and proposing new techniques of testing automation for these type of applications

Università degli Studi di Napoli Federico Il Open Archive

Towards the detection and analysis of performance regression introducing code changes

Author: ALShoaibi Deema Adeeb
Publication venue: RIT Scholar Works
Publication date: 01/11/2022
Field of study

In contemporary software development, developers commonly conduct regression testing to ensure that code changes do not affect software quality. Performance regression testing is an emerging research area from the regression testing domain in software engineering. Performance regression testing aims to maintain the system\u27s performance. Conducting performance regression testing is known to be expensive. It is also complex, considering the increase of committed code and developing team members working simultaneously. Many automated regression testing techniques have been proposed in prior research. However, challenges in the practice of locating and resolving performance regression still exist. Directing regression testing to the commit level provides solutions to locate the root cause, yet it hinders the development process. This thesis outlines motivations and solutions to address locating performance regression root causes. First, we challenge a deterministic state-of-art approach by expanding the testing data to find improvement areas. The deterministic approach was found to be limited in searching for the best regression-locating rule. Thus, we presented two stochastic approaches to develop models that can learn from historical commits. The goal of the first stochastic approach is to view the research problem as a search-based optimization problem seeking to reach the highest detection rate. We are applying different multi-objective evolutionary algorithms and conducting a comparison between them. This thesis also investigates whether simplifying the search space by combining objectives would achieve comparative results. The second stochastic approach addresses the severity of class imbalance any system could have since code changes introducing regression are rare but costly. We formulate the identification of problematic commits that introduce performance regression as a binary classification problem that handles class imbalance. Further, the thesis provides an exploratory study on the challenges developers face in resolving performance regression. The study is based on the questions posted on a technical form directed to performance regression. We collected around 2k questions discussing the regression of software execution time, and all were manually analyzed. The study resulted in a categorization of the challenges. We also discussed the difficulty level of performance regression issues within the development community. This study provides insights to help developers during the software design and implementation to avoid regression causes

RIT Scholar Works

Recommended from our members

Augmenting Wiring Diagrams of Neural Circuits with Activity in Larval Drosophila

Author: Champion Andrew
Publication venue: University of Cambridge
Publication date: 01/11/2020
Field of study

Neural circuit models explain an animal's behavior as evoked activity of different circuit elements of its nervous system. Synaptic wiring diagrams mapped from structural imaging of nervous systems guide modeling of neural circuits on the basis of connectivity. However, connectivity alone may not sufficiently constrain the set of plausible circuit hypotheses for empirical study. Combining structural imaging of synaptic connectivity with functional information from activity imaging can further constrain these hypotheses to create unequivocal neural circuit models. This thesis develops computational methods and tools to cross-reference structural and activity imaging of explant larval Drosophila central nervous systems at cellular resolution. Augmenting synaptic wiring diagrams with activity maps via these methods relates circuit structure and function at the neuronal level on a per-behavior basis. Neuronal activity of larval central nervous systems expressing pan-neuronal calcium indicators is imaged in a light sheet microscope, which are then structurally imaged with high throughput electron microscopy. Methods and tools are provided for the assembly of these image volumes, spatial registration between imaging modalities, automated detection of relevant tissue and cellular structures in each, extraction of activity time series, and morphological identification of neurons in structural imaging using reference wiring diagrams mapped from other larvae. Using these methods, existing wiring diagrams mapped from a reference first instar larva were identified with neurons in a larva augmented with activity information for a neural circuit involved in peristaltic motor control. This demonstrates the feasibility of the contributed methods to associate the wiring diagrams of arbitrary circuits of interest with activity timeseries across multiple individuals, behaviors, and behavioral bouts. To demonstrate capability to augment wiring diagrams with information besides activity, these methods are also applied to multiple larvae each expressing specific neurotransmitter labels rather than calcium indicators in the light sheet microscopy. This work scaffolds future modeling of circuits underlying behavior that can only be mechanistically understood in the context of multi-modal observation of synaptic connectivity, functional activity and molecular markers. The methods developed also enable comparative connectomics between multiple individuals, which is necessary to study inter-individual variability in circuits and to observe experimental intervention in the development, structure, and function of neural circuits.Howard Hughes Medical Institute Janelia Research Campu

Apollo (Cambridge)

The Viuva Negra crawler

Author: Gomes Daniel
Silva Mário J.
Publication venue: Department of Informatics, University of Lisbon
Publication date: 01/11/2006
Field of study

This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications

Universidade de Lisboa: Repositório.UL