1,824 research outputs found
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Role of Ranking Algorithms for Information Retrieval
As the use of web is increasing more day by day, the web users get easily
lost in the web's rich hyper structure. The main aim of the owner of the
website is to give the relevant information according their needs to the users.
We explained the Web mining is used to categorize users and pages by analyzing
user's behavior, the content of pages and then describe Web Structure mining.
This paper includes different Page Ranking algorithms and compares those
algorithms used for Information Retrieval. Different Page Rank based algorithms
like Page Rank (PR), WPR (Weighted Page Rank), HITS (Hyperlink Induced Topic
Selection), Distance Rank and EigenRumor algorithms are discussed and compared.
Simulation Interface has been designed for PageRank algorithm and Weighted
PageRank algorithm but PageRank is the only ranking algorithm on which Google
search engine works.Comment: Keywords: Page Rank, Web Mining, Web Structured Mining, Web Content
Minin
Accelerated focused crawling through online relevance feedback
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded
An Improved PageRank Method based on Genetic Algorithm for Web Search
AbstractWeb search engine has become a very important tool for finding information efficiently from the massive Web data. Based on PageRank algorithm, a genetic PageRank algorithm (GPRA) is proposed. With the condition of preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm so as to solve web search. Experimental results have shown that GPRA is superior to PageRank algorithm and genetic algorithm on performance
Recommended from our members
INJECT: Algorithms to Discover Creative Angles on News
INJECT is a new digitaltool tosupport journalists to think more creativelywhendiscoveringnewangles on stories under devel-opment. It deliversinteractiveand intelligentsupport embeddedin the text editorsthat journalists work with regularly. This support is generated bycombiningcomplex creative searchesofmillionsof related news storiespublished in multiplelanguageswith entityextraction algorithms and interactive creative guidance tailored to news. This paper reportsthetool’sarchitecture, some itsalgo-rithms, and the design decisions made to delivera reliable and us-able tool for journalistsin different newsroomsand work contexts
A Novel Cooperation and Competition Strategy Among Multi-Agent Crawlers
Multi-Agent theory which is used for communication and collaboration among focused crawlers has been proved that it can improve the precision of returned result significantly. In this paper, we proposed a new organizational structure of multi-agent for focused crawlers, in which the agents were divided into three categories, namely F-Agent (Facilitator-Agent), As-Agent (Assistance-Agent) and C-Agent (Crawler-Agent). They worked on their own responsibilities and cooperated mutually to complete a common task of web crawling. In our proposed architecture of focused crawlers based on multi-agent system, we emphasized discussing the collaborative process among multiple agents. To control the cooperation among agents, we proposed a negotiation protocol based on the contract net protocol and achieved the collaboration model of focused crawlers based on multi-agent by JADE. At last, the comparative experiment results showed that our focused crawlers had higher precision and efficiency than other crawlers using the algorithms with breadth-first, best-first, etc
- …