3 research outputs found

    An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation

    Get PDF
    A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler

    A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

    Get PDF
    Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

    ¿De qué se habla cuando se habla de homeopatía en la web? un análisis de contenido

    Get PDF
    An analysis of the content of the web pages that deal with homeopathy is carried out. The aim is to determine their trends, typology, most prominent domains, and most used terms. To do this, the most representative terms in the field have been identified and a group of seeds has been selected. Both elements are the starting point for “Crawler by domain”, an application developed to collect web pages. The results show that many pages in the sector have a positive view of homeopathy. That is logical considering that a large part has as its purpose the sale of homeopathy products and/or services or they are specialized portals in it. As a general conclusion, the tendency of these sources to offer content with a positive bias and easily understood by the average user, together with the relative scarcity of pages with a critical sense or even without bias, may be a factor that encourages users to lean towards for the use of this pseudotherapy, since it is possible that it is interpreted as effective proof of its benefits.Se realiza un análisis del contenido de las páginas web que tratan sobre la homeopatía con el objetivo de determinar su tendencia, tipología, dominios más destacados y términos más utilizados. Para ello, se han identificado los términos más representativos del ámbito y se han seleccionado un grupo de semillas. Ambos elementos son el punto con que arranca “Crawler by domain”, una aplicación desarrollada para la recolección de páginas del dominio. Los resultados muestran que la inmensa mayoría de páginas del sector tienen una visión positiva de la homeopatía. Algo que es lógico teniendo en cuenta que gran parte tiene como fin la venta de productos y/o servicios de homeopatía o se tratan de portales especializados en ésta. Como conclusión general, la tendencia de las fuentes a ofrecer un contenido con sesgo positivo y fácilmente comprensible por el usuario medio, junto con la relativa escasez de páginas con un sentido crítico o incluso sin tendencia puede ser un factor que impulse a los usuarios a inclinarse por la utilización de esta pseudoterapia, ya que es posible que sea interpretada como prueba efectiva de sus beneficios.A explosão de informações na Internet resultou em usuários com dificuldade em discriminar fontes de informações confiáveis. Isso, aliado ao surgimento de algumas pseudociências, pode implicar em risco para os pacientes que abandonam os tratamentos médicos em favor de pseudoterapias como a homeopatia, por confundir fontes de informações de saúde confiáveis. É feita uma análise do conteúdo das páginas da web que tratam da homeopatia para determinar quais termos são usados ​​para representar conceitos relacionados. É realizada uma análise do conteúdo das páginas da web; Observando a falta de estruturação dos dados, foi criado um Crawler que recupera páginas com determinados conteúdos, garantindo a adequação das fontes ao domínio analisado
    corecore