1,972 research outputs found

    Ranking Bias in Deep Web Size Estimation Using Capture Recapture Method

    Get PDF
    Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias—the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function. We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    A Distributed Approach to Crawl Domain Specific Hidden Web

    Get PDF
    A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content hidden behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance

    Designing A General Deep Web Access Approach Based On A Newly Introduced Factor; Harvestability Factor (HF)

    Get PDF
    The growing need of accessing more and more information draws attentions to huge amount of data hidden behind web forms defined as deep web. To make this data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need to have a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a number of issues should be considered. Among these issues, business domain features, targeted websites' features, and the harvesting goals are the most influential ones. To consider all these elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HF_w) or a harvester (HF_h) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites' (for HF_w) or harvesters' (for HF_h) features. These features are presented in this paper by gathering a number of them from literature and introducing new ones through the authors' experiments. In addition to enabling websites' or harvesters' designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing general purpose deep web harvesters. This framework allows filling in the gap in designing general purpose harvesters by focusing on detailed features of deep websites which have effects on harvesting processes. The represented features in this paper provide a thorough list of requirements for designing deep web harvesters which is not done to best of our knowledge in literature in this extent. To validate the effectiveness of HF in practice, it is shown how the HFs' elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To run the experiments, the developed harvester by the authors, is also discussed in this paper

    Designing a general deep web harvester by harvestability factor

    Get PDF
    To make deep web data accessible, harvesters have a crucial role. Targeting different domains and websites enhances the need of a general-purpose harvester which can be applied to different settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all influential elements in one big picture, a new concept, called harvestability factor (HF), is introduced in this paper. The HF is defined as an attribute of a website (HFW) or a harvester (HF_H) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are different websites’ or harvesters’ features. These elements are gathered from literature or introduced through the authors’ experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can define the list of features and prioritize their implementations. To validate the effectiveness of HF in practice, it is shown how the HFs0\ud websites and how this is useful in designing a harvester. To validate the HF H as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.783 of 15
    • …
    corecore