3 research outputs found

    An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

    Get PDF
    The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework

    Fine Grained Approach for Domain Specific Seed URL Extraction

    Get PDF
    Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for \u27Security - Information and Cyber\u27 domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant \u27Security Search Engine\u27 can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction

    Distinguishing the Popularity Between Topics: A System for Up-to-date Opinion Retrieval and Mining in the Web

    Get PDF
    The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers
    corecore