804 research outputs found

    A Study of Focused Web Crawling Techniques

    Get PDF
    In the recent years, the growth of data on the web is increasing exponentially. Due to this exponential growth, it is very crucial to find the accurate and significant information on the Web. Web crawlers are the tools or programs which find the web pages from the World Wide Web by following hyperlinks. Search engines indexes web pages which can be further retrieved by entering a query given by a user. The immense size and an assortment of the Web make it troublesome for any crawler to recover every pertinent information from the Web. In this way, different variations of Web crawling techniques are emerging as an active research area. In this paper, we survey the learnable focused crawlers

    A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

    Get PDF
    Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

    Analysis of Web Crawling Algorithms

    Get PDF
    The web today is huge and enormous collection of data today and it goes on increasing day by day. Thus, searching for some particular data in this collection has a significant impact. Researches taking place give prominence to the relevancy and relatedness of the data that is found. In Spite of their relevance pages for any search topic, the results are still huge to be explored. Another important issue to be kept in mind is the users standpoint differs from time to time from topic to topic. Effective relevance prediction can help avoid downloading and visiting many ir relevant pages. The performance of a crawler depends mostly on the opulence of links in the specific topic being searched. This paper reviews the researches on web crawling algorithms used for searching

    Accelerated focused crawling through online relevance feedback

    Get PDF
    The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

    Real-time focused extraction of social media users

    Get PDF
    In this paper, we explore a real-time automation challenge: the problem of focused extraction of Social Media users. This challenge can be seen as a special form of focused crawling where the main target is to detect users with certain patterns. Given a specific user profile, the task consists of rapidly ingesting Social Media data and early detecting target users. This is a real-time intelligent automation task that has numerous applications in domains such as safety, health or marketing. The volume and dynamics of Social Media contents demand efficient real-time solutions able to predict which users are worth to explore. To meet this aim, we propose and evaluate several methods that effectively allow us to harvest relevant users. Even with little contextual information (e.g., a single user submission), our methods quickly focus on the most promising users. We also developed a distributed microservice architecture that supports real-time parallel extraction of Social Media users. This modular architecture scales up in clusters of computers and it can be easily adapted for user extraction in multiple domains and Social Media sources. Our experiments suggest that some of the proposed prioritisation methods, which work with minimal user context, are effective at rapidly focusing on the most relevant users. These methods perform satisfactorily with huge volumes of users and interactions and lead to harvest ratios 2 to 9 times higher than those achieved by random prioritisationThis work was supported in part by the Ministerio de Ciencia e InnovaciĂłn (MICINN) under Grant RTI2018-093336-B-C21 and Grant PLEC2021-007662; in part by Xunta de Galicia under Grant ED431G/08, Grant ED431G-2019/04, Grant ED431C 2018/19, and Grant ED431F 2020/08; and in part by the European Regional Development Fund (ERDF)S

    A Novel Cooperation and Competition Strategy Among Multi-Agent Crawlers

    Get PDF
    Multi-Agent theory which is used for communication and collaboration among focused crawlers has been proved that it can improve the precision of returned result significantly. In this paper, we proposed a new organizational structure of multi-agent for focused crawlers, in which the agents were divided into three categories, namely F-Agent (Facilitator-Agent), As-Agent (Assistance-Agent) and C-Agent (Crawler-Agent). They worked on their own responsibilities and cooperated mutually to complete a common task of web crawling. In our proposed architecture of focused crawlers based on multi-agent system, we emphasized discussing the collaborative process among multiple agents. To control the cooperation among agents, we proposed a negotiation protocol based on the contract net protocol and achieved the collaboration model of focused crawlers based on multi-agent by JADE. At last, the comparative experiment results showed that our focused crawlers had higher precision and efficiency than other crawlers using the algorithms with breadth-first, best-first, etc

    State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism

    Get PDF
    Overview This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.  The paper is structured as follows: Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS). Part 2 provides an introduction to the key approaches of social media intelligence (henceforth ‘SOCMINT’) for counter-terrorism. Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored. Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work
    • 

    corecore