3 research outputs found

    A Novel Cooperation and Competition Strategy Among Multi-Agent Crawlers

    Get PDF
    Multi-Agent theory which is used for communication and collaboration among focused crawlers has been proved that it can improve the precision of returned result significantly. In this paper, we proposed a new organizational structure of multi-agent for focused crawlers, in which the agents were divided into three categories, namely F-Agent (Facilitator-Agent), As-Agent (Assistance-Agent) and C-Agent (Crawler-Agent). They worked on their own responsibilities and cooperated mutually to complete a common task of web crawling. In our proposed architecture of focused crawlers based on multi-agent system, we emphasized discussing the collaborative process among multiple agents. To control the cooperation among agents, we proposed a negotiation protocol based on the contract net protocol and achieved the collaboration model of focused crawlers based on multi-agent by JADE. At last, the comparative experiment results showed that our focused crawlers had higher precision and efficiency than other crawlers using the algorithms with breadth-first, best-first, etc

    Um sistema de coleta de dados de fontes heterogêneas baseado em computação distribuída

    Get PDF
    TCC (graduação) - Universidade Federal de Santa Catarina, Campus Araranguá, Curso de Tecnologias da Informação e Comunicação.Atualmente a quantidade de informação cresce de maneira exponencial, seja na Web ou redes internas das organizações. Um dos fatores que justificam esse fato é a participação cada vez mais frequente de usuários comuns não somente no consumo da informação, mas também na produção de conteúdo. Sendo assim, são requeridas maneiras eficientes de se coletar e armazenar grandes volumes de informação. Como tarefa, a coleta de informação constitui-se primeiramente na localização de determinada fonte de informação e posteriormente em sua coleta. De modo geral, as informações estão dispersas em servidores distribuídos geograficamente quando se fala na Web, mas também dentro das organizações espalhadas nos servidores e computadores pessoais. Coletar essa quantidade de informação exige poder de processamento computacional. Visando promover suporte a esta demanda o presente trabalho propõem um sistema em que a tarefa de coleta de dados seja realizada de maneira distribuída. A demonstração de viabilidade é realizada através de um protótipo implementado a partir da proposição de uma visão lógica (funcionamento geral) e física (detalhamento dos componentes tecnológicos). O protótipo desenvolvido contêm os serviços de análise da estrutura de determinado recurso da Web e a coleta propriamente dita de maneira distribuída, permitindo deste modo, atender a demanda de um sistema de coleta de informação em grande escala. Visando possibilitar a análise do sistema proposto, são elaborados três cenários para verificar a adaptação do coletor, bem como, a sua capacidade de processamento. A aplicação do protótipo nestes cenários de coleta permitiu demonstrar que este é capaz de obter resultados consistentes e satisfatórios em relação à adaptação e desempenho em diferentes configurações, tanto na fase de análise quanto na fase de coleta.Currently the amount of information grows exponentially, either on the Web or on organizational networks. One of the factors that justify this fact is the participation increasing of common users not only downloading information, but also in the content production. Therefore area required efficient ways to crawl and store large volumes of information. In this sense, the crawling of information as a task aims first to locate a particular information source and later, in how to collect it. In general, information is spread in geographically distributed servers when taking into account the Web, but also within organizations scattered on servers and personal computers. Collect this amount of information requires a high computational processing power. In order to to promote support to this demand, the present work proposes a system in which the crawling task is performed in a distributed way. The feasibility demonstration is performed using a prototype implemented from the proposition of a logical view (general functioning) and physical (detailing the technological components). The prototype contains services for analyzing the structure of a given web resource and the crawler itself in a distributed way. It allows gathering the demand for a system which collects information on a large scale. Aiming to analyze the proposed system it was elaborated three scenarios to verify the adaptation of the crawler as well as its processing capacity. The application of the prototype in some scenarios was able to demonstrate that the proposed system is able to obtain consistent and satisfactory results taken into account adaptation and performance on different configurations, in both analysis and collect phases

    New Weighting Schemes for Document Ranking and Ranked Query Suggestion

    Get PDF
    Term weighting is a process of scoring and ranking a term’s relevance to a user’s information need or the importance of a term to a document. This thesis aims to investigate novel term weighting methods with applications in document representation for text classification, web document ranking, and ranked query suggestion. Firstly, this research proposes a new feature for document representation under the vector space model (VSM) framework, i.e., class specific document frequency (CSDF), which leads to a new term weighting scheme based on term frequency (TF) and the newly proposed feature. The experimental results show that the proposed methods, CSDF and TF-CSDF, improve the performance of document classification in comparison with other widely used VSM document representations. Secondly, a new ranking method called GCrank is proposed for re-ranking web documents returned from search engines using document classification scores. The experimental results show that the GCrank method can improve the performance of web returned document ranking in terms of several commonly used evaluation criteria. Finally, this research investigates several state-of-the-art ranked retrieval methods, adapts and combines them as well, leading to a new method called Tfjac for ranked query suggestion, which is based on the combination between TF-IDF and Jaccard coefficient methods. The experimental results show that Tfjac is the best method for query suggestion among the methods evaluated. It outperforms the most popularly used TF-IDF method in terms of increasing the number of highly relevant query suggestions
    corecore