Search CORE

1,824 research outputs found

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Author: Diligenti M.
Mohr G.
Psallidas F.
Risse T.
Tannier X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/12/2016
Field of study

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

arXiv.org e-Print Archive

Crossref

Role of Ranking Algorithms for Information Retrieval

Author: Burdak Bhawani Shankar
Choudhary Laxmi
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 09/08/2012
Field of study

As the use of web is increasing more day by day, the web users get easily lost in the web's rich hyper structure. The main aim of the owner of the website is to give the relevant information according their needs to the users. We explained the Web mining is used to categorize users and pages by analyzing user's behavior, the content of pages and then describe Web Structure mining. This paper includes different Page Ranking algorithms and compares those algorithms used for Information Retrieval. Different Page Rank based algorithms like Page Rank (PR), WPR (Weighted Page Rank), HITS (Hyperlink Induced Topic Selection), Distance Rank and EigenRumor algorithms are discussed and compared. Simulation Interface has been designed for PageRank algorithm and Weighted PageRank algorithm but PageRank is the only ranking algorithm on which Google search engine works.Comment: Keywords: Page Rank, Web Mining, Web Structured Mining, Web Content Minin

arXiv.org e-Print Archive

Crossref

A focused crawler combinatory link and content model based on T-graph principles

Author: Patel Ahmed
Seyfi Ali
Publication venue: 'Elsevier BV'
Publication date: 15/07/2015
Field of study

Kingston University Research Repository

Accelerated focused crawling through online relevance feedback

Author: Chakrabarti Soumen
Mallela Subramanyam
Punera Kunal
Publication venue
Publication date: 01/01/2002
Field of study

The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded

An Improved PageRank Method based on Genetic Algorithm for Web Search

Author: Du Wencai
Gui Zhanji
Guo Qingju
Yan Lili
Publication venue: 'Elsevier BV'
Publication date: 31/12/2011
Field of study

AbstractWeb search engine has become a very important tool for finding information efficiently from the massive Web data. Based on PageRank algorithm, a genetic PageRank algorithm (GPRA) is proposed. With the condition of preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm so as to solve web search. Experimental results have shown that GPRA is superior to PageRank algorithm and genetic algorithm on performance

Elsevier - Publisher Connector

Recommended from our members

INJECT: Algorithms to Discover Creative Angles on News

Author: Maiden N.
Zachos K.
Publication venue
Publication date
Field of study

INJECT is a new digitaltool tosupport journalists to think more creativelywhendiscoveringnewangles on stories under devel-opment. It deliversinteractiveand intelligentsupport embeddedin the text editorsthat journalists work with regularly. This support is generated bycombiningcomplex creative searchesofmillionsof related news storiespublished in multiplelanguageswith entityextraction algorithms and interactive creative guidance tailored to news. This paper reportsthetool’sarchitecture, some itsalgo-rithms, and the design decisions made to delivera reliable and us-able tool for journalistsin different newsroomsand work contexts

City Research Online

A Novel Cooperation and Competition Strategy Among Multi-Agent Crawlers

Author: Du Yajun
Wang Min
Xu Yong
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 07/02/2017
Field of study

Multi-Agent theory which is used for communication and collaboration among focused crawlers has been proved that it can improve the precision of returned result significantly. In this paper, we proposed a new organizational structure of multi-agent for focused crawlers, in which the agents were divided into three categories, namely F-Agent (Facilitator-Agent), As-Agent (Assistance-Agent) and C-Agent (Crawler-Agent). They worked on their own responsibilities and cooperated mutually to complete a common task of web crawling. In our proposed architecture of focused crawlers based on multi-agent system, we emphasized discussing the collaborative process among multiple agents. To control the cooperation among agents, we proposed a negotiation protocol based on the contract net protocol and achieved the collaboration model of focused crawlers based on multi-agent by JADE. At last, the comparative experiment results showed that our focused crawlers had higher precision and efficiency than other crawlers using the algorithms with breadth-first, best-first, etc

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)