2,292 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF

    Collaboration Enabling Internet Resource Collection-Building Software and Technologies

    Get PDF
    Over the last decade the Library of the University of California, Riverside and its collaborators have developed a number of systems, service designs, and projects that utilize innovative technologies to foster better Internet finding tools in libraries and more cooperative and efficient effort in Internet link and metadata collection building. The open-source software and projects discussed represent appropriate technologies and sustainable strategies that we believe will help Internet portals, digital libraries, virtual libraries, library catalogs-with-portal-like-capabilities (IPDVLCs), and related collection-building efforts in academia to better scale and more accurately anticipate and meet the needs of scholarly and educational users.published or submitted for publicatio

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems

    Get PDF
    Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.O contĂ­nuo desenvolvimento social e econĂłmico tem conduzido ao longo do tempo a um aumento do consumo, assim como a uma maior exigĂȘncia do consumidor por produtos melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel fundamental na decisĂŁo de compra por parte de um consumidor. Nesse sentido, as lojas online precisam de analisar e definir qual o melhor preço para cada produto, tendo como base diversos fatores, tais como o custo de produção/venda, posicionamento do produto (e.g. produto Ăąncora) e as prĂłprias estratĂ©gias das empresas concorrentes. O trabalho dos analistas de mercado mudou drasticamente nos Ășltimos anos. O crescimento de sites na Web tem sido exponencial, o nĂșmero de sites E-commerce tambĂ©m tem prosperado. A classificação de pĂĄginas da Web torna-se cada vez mais importante, especialmente em campos como mineração de dados na Web e coleta/extração de informaçÔes. Os classificadores tradicionais sĂŁo geralmente feitos manualmente e nĂŁo adaptativos, o que os torna inadequados num contexto mais amplo. NĂłs introduzimos um conjunto de mĂ©todos e o estudo posterior dos seus resultados para criar um crawler e scraper mais genĂ©ricos e modulares para extração de conhecimento em pĂĄginas de Ecommerce. A informação recolhida pode entĂŁo ser processada e utilizada na tomada de decisĂŁo sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito extrair conhecimento de Web sites de E-commerce. Este processo necessita realizar a navegação sobre lojas online e armazenar pĂĄginas de produto. Isto implica que dado uma pĂĄgina web a framework seja capaz de determinar se Ă© uma pĂĄgina de produto. Para atingir este objetivo nĂłs classificamos as pĂĄginas em trĂȘs categorias: catĂĄlogo, produto e spam. A classificação das pĂĄginas foi realizada tendo em conta o html e o aspeto visual das pĂĄginas, utilizando tanto mĂ©todos tradicionais como Deep Learning. Depois de identificar um conjunto de pĂĄginas de produto procedemos Ă  extração de informação sobre o preço. Este processo nĂŁo Ă© trivial devido Ă  quantidade de abordagens possĂ­veis para criar uma pĂĄgina web. A maioria dos produtos sĂŁo dinĂąmicos no sentido em que um produto Ă© na realidade uma famĂ­lia de produtos relacionados. Por exemplo, quando visitamos uma loja online de sapatos, para um modelo em especifico existe a provavelmente um conjunto de tamanhos e cores disponĂ­veis. Esse modelo pode ser apresentado numa Ășnica pĂĄgina dinĂąmica fazendo com que seja necessĂĄrio para a nossa Framework explorar estas combinaçÔes relevantes. Este processo Ă© chamado de scraping e Ă© o Ășltimo passo da Framework Prometheus

    FilteredWeb: A Framework for the Automated Search-Based Discovery of Blocked URLs

    Full text link
    Various methods have been proposed for creating and maintaining lists of potentially filtered URLs to allow for measurement of ongoing internet censorship around the world. Whilst testing a known resource for evidence of filtering can be relatively simple, given appropriate vantage points, discovering previously unknown filtered web resources remains an open challenge. We present a new framework for automating the process of discovering filtered resources through the use of adaptive queries to well-known search engines. Our system applies information retrieval algorithms to isolate characteristic linguistic patterns in known filtered web pages; these are then used as the basis for web search queries. The results of these queries are then checked for evidence of filtering, and newly discovered filtered resources are fed back into the system to detect further filtered content. Our implementation of this framework, applied to China as a case study, shows that this approach is demonstrably effective at detecting significant numbers of previously unknown filtered web pages, making a significant contribution to the ongoing detection of internet filtering as it develops. Our tool is currently deployed and has been used to discover 1355 domains that are poisoned within China as of Feb 2017 - 30 times more than are contained in the most widely-used public filter list. Of these, 759 are outside of the Alexa Top 1000 domains list, demonstrating the capability of this framework to find more obscure filtered content. Further, our initial analysis of filtered URLs, and the search terms that were used to discover them, gives further insight into the nature of the content currently being blocked in China.Comment: To appear in "Network Traffic Measurement and Analysis Conference 2017" (TMA2017
    • 

    corecore