1,911 research outputs found

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    A Novel Framework for Context Based Distributed Focused Crawler (CBDFC)

    Get PDF
    Focused crawling aims to search only the relevant subset of the WWW for a specific topic of user interest; leading to the necessity to decide about the relevancy of a document to the topic of interest; especially when the user is not perfect in specifying the exact context of the topic. This paper provides a novel framework of a context based distributed focused crawler that maintains an index of web documents pertaining to the context of keywords resulting in storage of more related documents

    A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
    corecore