Search CORE

100 research outputs found

A Study of Focused Web Crawling Techniques

Author: Gourav Shrivastava, Praveen Kaushik, R. K. Pateriya
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/08/2018
Field of study

In the recent years, the growth of data on the web is increasing exponentially. Due to this exponential growth, it is very crucial to find the accurate and significant information on the Web. Web crawlers are the tools or programs which find the web pages from the World Wide Web by following hyperlinks. Search engines indexes web pages which can be further retrieved by entering a query given by a user. The immense size and an assortment of the Web make it troublesome for any crawler to recover every pertinent information from the Web. In this way, different variations of Web crawling techniques are emerging as an active research area. In this paper, we survey the learnable focused crawlers

International Journal on Recent and Innovation Trends in Computing and Communication

A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

Author: Dhanith P. R. Joe
Raja S. P.
Surendiran B.
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 28/04/2022
Field of study

Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

Re-UNIR

Efficient Deep-Web-Harvesting Using Advanced Crawler

Author: Gaikwad Dhananjay M., Tanpure Navnath B., Gulaskar Sangram S., Bakale Avinash D., Prof. Shaikh I. R.
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/09/2015
Field of study

Due to heavy usage of internet large amount of diverse data is spread over it which provides access to particular data or to search most relevant data. It is very challenging for search engine to fetch relevant data as per user’s need and which consumes more time. So, to reduce large amount of time spend on searching most relevant data we proposed the “Advanced crawler”. In this proposed approach, results collected from different web search engines to achieve meta search approach. Multiple search engine for the user query and aggregate those result in one single space and then performing two stages crawling on that data or Urls. In which the sight locating and in-site exploring is done f or achieving most relevant site with the help of page ranking and reverse searching techniques. This system also works online and offline manner

International Journal on Recent and Innovation Trends in Computing and Communication

Ontology Based Approach for Services Information Discovery using Hybrid Self Adaptive Semantic Focused Crawler

Author: Swapnil V. Patil, Sharmila M. Shinde
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/07/2015
Field of study

Focused crawling is aimed at specifically searching out pages that are relevant to a predefined set of topics. Since ontology is an all around framed information representation, ontology based focused crawling methodologies have come into exploration. Crawling is one of the essential systems for building information stockpiles. The reason for semantic focused crawler is naturally finding, commenting and ordering the administration data with the Semantic Web advances. Here, a framework of a hybrid self-adaptive semantic focused crawler – HSASF crawler, with the inspiration driving viably discovering, and sorting out administration organization information over the Internet, by considering the three essential issues has been displayed. A semi-supervised system has been planned with the inspiration driving subsequently selecting the ideal limit values for each idea, while considering the optimal performance without considering the constraint of the preparation of data set. DOI: 10.17762/ijritcc2321-8169.15072

International Journal on Recent and Innovation Trends in Computing and Communication

Enhance Crawler For Efficiently Harvesting Deep Web Interfaces

Author: Sujata R. Gutte, Shubhangi S. Gujar
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/10/2017
Field of study

Scenario in web is varying quickly and size of web resources is rising, efficiency has become a challenging problem for crawling such data. The hidden web content is the data that cannot be indexed by search engines as they always stay behind searchable web interfaces. The proposed system purposes to develop a framework for focused crawler for efficient gathering hidden web interfaces. Firstly Crawler performs site-based searching for getting center pages with the help of web search tools to avoid from visiting additional number of pages. To get more specific results for a focused crawler, projected crawler ranks websites by giving high priority to more related ones for a given search. Crawler accomplishes fast in-site searching via watching for more relevant links with an adaptive link ranking. Here we have incorporated spell checker for giving correct input and apply reverse searching with incremental site prioritizing for wide-ranging coverage of hidden web sites

International Journal on Recent and Innovation Trends in Computing and Communication

Focused kraulinh capacity resources as a means of reducing search in WEB

Author: Замятін Денис Станіславович
Михайлюк Антон Юрійович
Михайлюк Вадим Антонович
Петрашенко Андрій Васильович
Publication venue
Publication date: 01/01/2011
Field of study

У статті досліджується проблема створення системи моніторингу тематичних Web-ресурсів для корпоративного середовища. Запропоновано класифікацію основних алгоритмів обходу ресурсів. Розроблено класифікацію метрик ранжування сайтів за ознакою об’єктів, на основі яких виконується оцінювання. Проведено попередній розрахунок оцінки придатності сфокусованого пошуку для даної задачі.В статье исследуется проблема создания системы мониторинга тематических Web-ресурсов для корпоративной среды. Предложена классификация основных алгоритмов обхода ресурсов. Разработана классификация метрик ранжирования сайтов по признаку объектов, на основе которых выполняется оценка. Проведен предварительный расчет оценки пригодности сфокусированного поиска для данной задачи.This paper examines the problem of creating a system for monitoring Web-themed resources for the corporate environment. The classification of the basic algorithms traversing resources. The classification of metrics ranking sites on the basis of the objects on which the evaluation is performed. A preliminary calculation of the suitability assessment focused search for this problem

Borys Grinchenko Kyiv University Institutional repository