Search CORE

804 research outputs found

A Study of Focused Web Crawling Techniques

Author: Gourav Shrivastava, Praveen Kaushik, R. K. Pateriya
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/08/2018
Field of study

In the recent years, the growth of data on the web is increasing exponentially. Due to this exponential growth, it is very crucial to find the accurate and significant information on the Web. Web crawlers are the tools or programs which find the web pages from the World Wide Web by following hyperlinks. Search engines indexes web pages which can be further retrieved by entering a query given by a user. The immense size and an assortment of the Web make it troublesome for any crawler to recover every pertinent information from the Web. In this way, different variations of Web crawling techniques are emerging as an active research area. In this paper, we survey the learnable focused crawlers

International Journal on Recent and Innovation Trends in Computing and Communication

A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

Author: Dhanith P. R. Joe
Raja S. P.
Surendiran B.
Publication venue: 'Universidad Internacional de La Rioja'
Publication date: 28/04/2022
Field of study

Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

Re-UNIR

Analysis of Web Crawling Algorithms

Author: Rashmi Janbandhu, Prashant Dahiwale, M.M.Raghuwanshi
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/03/2014
Field of study

The web today is huge and enormous collection of data today and it goes on increasing day by day. Thus, searching for some particular data in this collection has a significant impact. Researches taking place give prominence to the relevancy and relatedness of the data that is found. In Spite of their relevance pages for any search topic, the results are still huge to be explored. Another important issue to be kept in mind is the users standpoint differs from time to time from topic to topic. Effective relevance prediction can help avoid downloading and visiting many ir relevant pages. The performance of a crawler depends mostly on the opulence of links in the specific topic being searched. This paper reviews the researches on web crawling algorithms used for searching

International Journal on Recent and Innovation Trends in Computing and Communication

Accelerated focused crawling through online relevance feedback

Author: Chakrabarti Soumen
Mallela Subramanyam
Punera Kunal
Publication venue
Publication date: 01/01/2002
Field of study

The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

Real-time focused extraction of social media users

Author: Losada Carril David Enrique
Martínez Castaño Rodrigo
Pichel Campos Juan Carlos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2022
Field of study

In this paper, we explore a real-time automation challenge: the problem of focused extraction of Social Media users. This challenge can be seen as a special form of focused crawling where the main target is to detect users with certain patterns. Given a specific user profile, the task consists of rapidly ingesting Social Media data and early detecting target users. This is a real-time intelligent automation task that has numerous applications in domains such as safety, health or marketing. The volume and dynamics of Social Media contents demand efficient real-time solutions able to predict which users are worth to explore. To meet this aim, we propose and evaluate several methods that effectively allow us to harvest relevant users. Even with little contextual information (e.g., a single user submission), our methods quickly focus on the most promising users. We also developed a distributed microservice architecture that supports real-time parallel extraction of Social Media users. This modular architecture scales up in clusters of computers and it can be easily adapted for user extraction in multiple domains and Social Media sources. Our experiments suggest that some of the proposed prioritisation methods, which work with minimal user context, are effective at rapidly focusing on the most relevant users. These methods perform satisfactorily with huge volumes of users and interactions and lead to harvest ratios 2 to 9 times higher than those achieved by random prioritisationThis work was supported in part by the Ministerio de Ciencia e Innovación (MICINN) under Grant RTI2018-093336-B-C21 and Grant PLEC2021-007662; in part by Xunta de Galicia under Grant ED431G/08, Grant ED431G-2019/04, Grant ED431C 2018/19, and Grant ED431F 2020/08; and in part by the European Regional Development Fund (ERDF)S

Repositorio Institucional da Universidade de Santiago de Compostela

A Novel Cooperation and Competition Strategy Among Multi-Agent Crawlers

Author: Du Yajun
Wang Min
Xu Yong
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 07/02/2017
Field of study

Multi-Agent theory which is used for communication and collaboration among focused crawlers has been proved that it can improve the precision of returned result significantly. In this paper, we proposed a new organizational structure of multi-agent for focused crawlers, in which the agents were divided into three categories, namely F-Agent (Facilitator-Agent), As-Agent (Assistance-Agent) and C-Agent (Crawler-Agent). They worked on their own responsibilities and cooperated mutually to complete a common task of web crawling. In our proposed architecture of focused crawlers based on multi-agent system, we emphasized discussing the collaborative process among multiple agents. To control the cooperation among agents, we proposed a negotiation protocol based on the contract net protocol and achieved the collaboration model of focused crawlers based on multi-agent by JADE. At last, the comparative experiment results showed that our focused crawlers had higher precision and efficiency than other crawlers using the algorithms with breadth-first, best-first, etc

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism

Author: Jamie Bartlett
Louis Reynolds
Publication venue: Demos
Publication date
Field of study

Overview This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.  The paper is structured as follows: Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS). Part 2 provides an introduction to the key approaches of social media intelligence (henceforth ‘SOCMINT’) for counter-terrorism. Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored. Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work

Analysis and Policy Observatory (APO)