602 research outputs found

    Brass: A Queueing Manager for Warrick

    Get PDF
    When an individual loses their website and a backup can-not be found, they can download and run Warrick, a web-repository crawler which will recover their lost website by crawling the holdings of the Internet Archive and several search engine caches. Running Warrick locally requires some technical know-how, so we have created an on-line queueing system called Brass which simplifies the task of recovering lost websites. We discuss the technical aspects of recon-structing websites and the implementation of Brass. Our newly developed system allows anyone to recover a lost web-site with a few mouse clicks and allows us to track which websites the public is most interested in saving

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Global-Scale Resource Survey and Performance Monitoring of Public OGC Web Map Services

    Full text link
    One of the most widely-implemented service standards provided by the Open Geospatial Consortium (OGC) to the user community is the Web Map Service (WMS). WMS is widely employed globally, but there is limited knowledge of the global distribution, adoption status or the service quality of these online WMS resources. To fill this void, we investigated global WMSs resources and performed distributed performance monitoring of these services. This paper explicates a distributed monitoring framework that was used to monitor 46,296 WMSs continuously for over one year and a crawling method to discover these WMSs. We analyzed server locations, provider types, themes, the spatiotemporal coverage of map layers and the service versions for 41,703 valid WMSs. Furthermore, we appraised the stability and performance of basic operations for 1210 selected WMSs (i.e., GetCapabilities and GetMap). We discuss the major reasons for request errors and performance issues, as well as the relationship between service response times and the spatiotemporal distribution of client monitoring sites. This paper will help service providers, end users and developers of standards to grasp the status of global WMS resources, as well as to understand the adoption status of OGC standards. The conclusions drawn in this paper can benefit geospatial resource discovery, service performance evaluation and guide service performance improvements.Comment: 24 pages; 15 figure

    Heuristics for Crawling WSDL Descriptions of Web Service Interfaces - the Heritrix Case

    Get PDF
    ResĂŒmee KĂ€esoleva bakalureuse töö eesmĂ€rgiks on seadistada ja tĂ€iustada avatud lĂ€htekoodil baseeruvat Heritrix veebiussi. Tehtud muudatuste tulemina peab Heritrix suutma leida veebiteenuseid mĂ€rkivaid WSDL faile. Veebiuss ehk web crawler on programm, mis otsib automatiseeritult mööda Interneti avarusi ringi liikudes soovitud veebidokumente. WSDL on XML formaadis keel, mis sĂ€testab veebiteenuse asukoha ja protokolli ning kirjeldab pakutavad meetodid ja funktsioonid. EesmĂ€rgi saavutamiseks uuriti avaldatud artikleid, mis kirjeldasid erinevaid strateegiaid Internetist veebiteenuste otsimiseks kasutades veebiussi. Mainitud tööde pĂ”hjal loodi Heritrix'i seadistus, mis vĂ”imaldas WSDL teenuse kirjeldusi otsida. Lisaks kirjutati programmeerimis keeles Java Heritrixi tĂ€iendav klass, mis vĂ”imaldab lihtsustatud kujul salvestada veebi roomamise tulemusi. Ühes leitud artiklites kirjeldati suunatud otsingu (focused crawling) toe lisamist veebiteenuseid otsivale Heritrix veebiussile. Suunatud otsing vĂ”imaldab ussil hinnata uusi avastatud veebilehti ning lubab keskenduda lehtedele, mis suurema tĂ”enĂ€osusega sisaldavad otsitavaid ressursse. Kuna vaadeldavas programmis puudub tugi suunatud otsingu funktsionaalsusele, lisati see kĂ€esoleva töö kĂ€igus tĂ€iendava mooduli loomisega. Algoritmi aluseks vĂ”eti mainitud artiklis kirjeldatud lahendus. Selleks, et kontrollida kas lisatud tĂ€iendus muutis roomamise protsessi tĂ€psemaks vĂ”i kiiremaks teostati eksperiment kolme katsega. KĂ€ivitati kaks Heritrixi exemplari, millest mĂ”lemad seadistati WSDL teenuse kirjeldusi ostima, kuid ainult ĂŒhele neist lisati suunatud otsingu tugi. Katse kĂ€igus vaadeldi leitud teenuste arvu ja kogu lĂ€bi kammitud veebilehtede kogust. Eksperimendi tulemuste analĂŒĂŒsist vĂ”is jĂ€reldada, et suunatud otsingu funktsionaalsus muudab roomamise protsessi tĂ€psemaks ning vĂ”imaldab seelĂ€bi WSDL teenuse kirjeldusi kiiremini leida.The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter. We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawler’s source code, that allows logging of search results without any excessive data. With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing “interesting” data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrix’s source code, the algorithm used as basis for our solution was described in one of the articles. To see if our enhancement provided any improvement in the crawl’s process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them
    • 

    corecore