Search CORE

817 research outputs found

A Distributed Approach to Crawl Domain Specific Hidden Web

Author: Desai Lovekeshkumar
Publication venue: ScholarWorks @ Georgia State University
Publication date: 03/08/2007
Field of study

A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content hidden behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance

ScholarWorks @ Georgia State University

A Brief History of Web Crawlers

Author: Bochmann Gregor V.
Dinçktürk Mustafa Emre
Hooshmand Salman
Jourdan Guy-Vincent
Mirtaheri Seyed M.
Onut Iosif Viorel
Publication venue
Publication date: 04/05/2014
Field of study

Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

arXiv.org e-Print Archive

CiteSeerX

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

WAQS : a web-based approximate query system

Author: Chang George Jyh-Shian
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/2001
Field of study

The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

Digital Commons @ New Jersey Institute of Technology (NJIT)

Knowledge Classification Agent

Author: Alawi Faliq Haffiz
Publication venue: Universiti Teknologi Petronas
Publication date: 01/07/2005
Field of study

This paper is about the knowledge classification agent. This system is a knowledge management based that integrated the way of searching information via internet using a search engine. The availability of a search engine especially in Malaysia is not as efficient as it can be. Using a normal search engine, the search result is too general and users usually get results which are not always suit their request. This system will use an agent that work by grouping the result into related groups or categories. Before producing the final results to the user, this system will categorize the result into common type of keyword, say, user searching a keyword 'beetles' this agent will group it into groups such beetles for cars, beetles for insects and beetles for music band. The target users of the system are people that use the internet as an information resource such as academicians and researchers. Useful information that had been classified needed by users in order to choose the best information is identify as the main problems that trigger the project. The objectives of the project are to develop classified results into related groups for users so that users can find the requested information efficiently as well as save users time. Author had planned to use a spiral model as a methodology. Author believes that by using this system, the problems that stated above can be solved

UTPedia

A Domain Based Approach to Crawl the Hidden Web

Author: Pandya Milan
Publication venue: ScholarWorks @ Georgia State University
Publication date: 04/12/2006
Field of study

There is a lot of research work being performed on indexing the Web. More and more sophisticated Web crawlers are been designed to search and index the Web faster. But all these traditional crawlers crawl only the part of Web we call “Surface Web”. They are unable to crawl the hidden portion of the Web. These traditional crawlers retrieve contents only from surface Web pages which are just a set of Web pages linked by some hyperlinks and ignoring the hidden information. Hence, they ignore tremendous amount of information hidden behind these search forms in Web pages. Most of the published research has been done to detect such searchable forms and make a systematic search over these forms. Our approach here will be based on a Web crawler that analyzes search forms and fills tem with appropriate content to retrieve maximum relevant information from the database

ScholarWorks @ Georgia State University

Search Interfaces on the Web: Querying and Characterizing

Author: Shestakov Denis
Publication venue: Turku Centre for Computer Science
Publication date: 12/06/2008
Field of study

Current-day web search engines (e.g., Google) do not crawl and index a significant portion of theWeb and, hence, web users relying on search engines only are unable to discover and access a large amount of information from the non-indexable part of the Web. Specifically, dynamic pages generated based on parameters provided by a user via web search forms (or search interfaces) are not indexed by search engines and cannot be found in searchers’ results. Such search interfaces provide web users with an online access to myriads of databases on the Web. In order to obtain some information from a web database of interest, a user issues his/her query by specifying query terms in a search form and receives the query results, a set of dynamic pages that embed required information from a database. At the same time, issuing a query via an arbitrary search interface is an extremely complex task for any kind of automatic agents including web crawlers, which, at least up to the present day, do not even attempt to pass through web forms on a large scale. In this thesis, our primary and key object of study is a huge portion of the Web (hereafter referred as the deep Web) hidden behind web search interfaces. We concentrate on three classes of problems around the deep Web: characterization of deep Web, finding and classifying deep web resources, and querying web databases. Characterizing deep Web: Though the term deep Web was coined in 2000, which is sufficiently long ago for any web-related concept/technology, we still do not know many important characteristics of the deep Web. Another matter of concern is that surveys of the deep Web existing so far are predominantly based on study of deep web sites in English. One can then expect that findings from these surveys may be biased, especially owing to a steady increase in non-English web content. In this way, surveying of national segments of the deep Web is of interest not only to national communities but to the whole web community as well. In this thesis, we propose two new methods for estimating the main parameters of deep Web. We use the suggested methods to estimate the scale of one specific national segment of the Web and report our findings. We also build and make publicly available a dataset describing more than 200 web databases from the national segment of the Web. Finding deep web resources: The deep Web has been growing at a very fast pace. It has been estimated that there are hundred thousands of deep web sites. Due to the huge volume of information in the deep Web, there has been a significant interest to approaches that allow users and computer applications to leverage this information. Most approaches assumed that search interfaces to web databases of interest are already discovered and known to query systems. However, such assumptions do not hold true mostly because of the large scale of the deep Web – indeed, for any given domain of interest there are too many web databases with relevant content. Thus, the ability to locate search interfaces to web databases becomes a key requirement for any application accessing the deep Web. In this thesis, we describe the architecture of the I-Crawler, a system for finding and classifying search interfaces. Specifically, the I-Crawler is intentionally designed to be used in deepWeb characterization studies and for constructing directories of deep web resources. Unlike almost all other approaches to the deep Web existing so far, the I-Crawler is able to recognize and analyze JavaScript-rich and non-HTML searchable forms. Querying web databases: Retrieving information by filling out web search forms is a typical task for a web user. This is all the more so as interfaces of conventional search engines are also web forms. At present, a user needs to manually provide input values to search interfaces and then extract required data from the pages with results. The manual filling out forms is not feasible and cumbersome in cases of complex queries but such kind of queries are essential for many web searches especially in the area of e-commerce. In this way, the automation of querying and retrieving data behind search interfaces is desirable and essential for such tasks as building domain-independent deep web crawlers and automated web agents, searching for domain-specific information (vertical search engines), and for extraction and integration of information from various deep web resources. We present a data model for representing search interfaces and discuss techniques for extracting field labels, client-side scripts and structured data from HTML pages. We also describe a representation of result pages and discuss how to extract and store results of form queries. Besides, we present a user-friendly and expressive form query language that allows one to retrieve information behind search interfaces and extract useful data from the result pages based on specified conditions. We implement a prototype system for querying web databases and describe its architecture and components design.Siirretty Doriast

UTUPub

Bot crawler to retrieve data from Facebook based on the selection of posts and the extraction of user profiles

Author: Ballesteros Ricaurte Javier Antonio
Durán Vaca Mónica Katherine
González Amarillo Angela María
López Pedro Nel
Sánchez Paipilla Ariel Guillermo
Publication venue: Universidad de la Costa CUC
Publication date: 01/01/2022
Field of study

Introduction: Data can currently be found within organizations and outside of them, they are growing exponentially. Today, the information available on the Internet and social networks has become a generator of value, through the effective analysis of a specific situation, using techniques and methodologies with which content-based solutions can be proposed, and thus achieve, execute timely, intelligent and assertive decision-making processes. Objective: The main objective of this work is to development of a Bot Crawler, which allows extracting information from Facebook without access restrictions, or request for credentials, based on web crawling and scraping techniques, through the selection of HTML tags, to track and be able to define patterns. Method: The development of this project consisted of four main stages: A) Teamwork with SCRUM, B) Comparison of web data extraction techniques, C) Extraction and validation of permissions to access the data in Facebook, D) Development of the bor crawler. Results:  Briefly, mention the main results of the research Conclusions: As a result of this process, a graphical interface is created that allows checking the process of obtaining data derived from user profiles of this social network

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

Repositorio Digital CUC

EDUCOSTA - Editorial de la Universidad de la Costa CUC

Personalization of Search Engine by Using Cache based Approach

Author: Krupali Bhaware, Shubham Narkhede, Prof. Neeranjan Chitare
Publication venue: Auricle Global Society of Education and Research
Publication date: 31/03/2018
Field of study

As profound web develops at a quick pace, there has been expanded enthusiasm for strategies that assistance effectively find profound web interfaces. Be that as it may, because of the extensive volume of web assets and the dynamic idea of profound web, accomplishing wide scope and high productivity is a testing issue. In this venture propose a three-organize structure, for productive reaping profound web interfaces. In the principal organize, web crawler performs website based hunting down focus pages with the assistance of web indexes, abstaining from going to countless. To accomplish more precise outcomes for an engaged creep, Web Crawler positions sites to organize exceedingly pertinent ones for a given theme. In the second stage the proposed framework opens the pages inside in application with the assistance of Jsoup API and preprocess it. At that point it plays out the word include of inquiry website pages. In the third stage the proposed framework performs recurrence examination in view of TF and IDF. It additionally utilizes a mix of TF*IDF for positioning website pages. To wipe out inclination on going to some very applicable connections in shrouded web registries, In this undertaking propose plan a connection tree information structure to accomplish more extensive scope for a site. Undertaking trial comes about on an arrangement of delegate areas demonstrate the nimbleness and precision of our proposed crawler structure, which productively recovers profound web interfaces from extensive scale locales and accomplishes higher reap rates than different crawlers utilizing Na�ve Bayes algorithms

International Journal on Future Revolution in Computer Science & Communication Engineering

Easier : An Approach to Automatically Generate Active Ontologies for Intelligent Assistants

Author: Blersch Martin
Landhäußer Mathias
Publication venue: International Institute of Informatics and Systemics
Publication date: 01/01/2016
Field of study

Intelligent assistants are ubiquitous and will grow in importance. Apple\u27s well-known assistant Siri uses Active Ontologies to process user input and to model the provided functionalities. Supporting new features requires extending the ontologies or even building new ones. The question is no longer "How to build an intelligent assistant?" but "How to do it efficiently?" We propose EASIER, an approach to automate building and extending Active Ontologies. EASIER identifies new services automatically and classifies unseen service providers with a clustering-based approach. It proposes ontology elements for new service categories and service providers respectively to ease ontology building. We evaluate EASIER with 292 form-based web services and two different clustering algorithms from Weka, DBScan and spectral clustering. DBScan achieves a F1 score of 51% in a ten-fold cross validation but is outperformed by spectral clustering, which achieves a F1 score of even 70%

KITopen