Search CORE

3 research outputs found

An Agent-Based Focused Crawling Framework for Topic- and Genre-Related Web Document Discovery

Author: Katsimpras Georgios
Pappas Nikolaos
Stamatatos Efstathios
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/12/2013
Field of study

The discovery of web documents about certain topics is an important task for web-based applications including web document retrieval, opinion mining and knowledge extraction. In this paper, we propose an agent-based focused crawling framework able to retrieve topic- and genre-related web documents. Starting from a simple topic query, a set of focused crawler agents explore in parallel topic-specific web paths using dynamic seed URLs that belong to certain web genres and are collected from web search engines. The agents make use of an internal mechanism that weighs topic and genre relevance scores of unvisited web pages. They are able to adapt to the properties of a given topic by modifying their internal knowledge during search, handle ambiguous queries, ignore irrelevant pages with respect to the topic and retrieve collaboratively topic-relevant web pages. We performed an experimental study to evaluate the behavior of the agents for a variety of topic queries demonstrating the benefits and the capabilities of our framework

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Fine Grained Approach for Domain Specific Seed URL Extraction

Author: Sanagavarapu Lalit Mohan
Sarangi Sourav
Varma Vasudeva
Y Raghu Reddy
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2018
Field of study

Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for \u27Security - Information and Cyber\u27 domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant \u27Security Search Engine\u27 can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

Distinguishing the Popularity Between Topics: A System for Up-to-date Opinion Retrieval and Mining in the Web

Author: Katsimpras Georgios
Pappas Nikolaos
Stamatatos Efstathios
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/12/2013
Field of study

The constantly increasing amount of opinionated texts found in the Web had a significant impact in the development of sentiment analysis. So far, the majority of the comparative studies in this field focus on analyzing fixed (offline) collections from certain domains, genres, or topics. In this paper, we present an online system for opinion mining and retrieval that is able to discover up-to-date web pages on given topics using focused crawling agents, extract opinionated textual parts from web pages, and estimate their polarity using opinion mining agents. The evaluation of the system on real-world case studies, demonstrates that is appropriate for opinion comparison between topics, since it provides useful indications on the popularity based on a relatively small amount of web pages. Moreover, it can produce genre-aware results of opinion retrieval, a valuable option for decision-makers

Infoscience - École polytechnique fédérale de Lausanne