602 research outputs found
Brass: A Queueing Manager for Warrick
When an individual loses their website and a backup can-not be found, they can download and run Warrick, a web-repository crawler which will recover their lost website by crawling the holdings of the Internet Archive and several search engine caches. Running Warrick locally requires some technical know-how, so we have created an on-line queueing system called Brass which simplifies the task of recovering lost websites. We discuss the technical aspects of recon-structing websites and the implementation of Brass. Our newly developed system allows anyone to recover a lost web-site with a few mouse clicks and allows us to track which websites the public is most interested in saving
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Global-Scale Resource Survey and Performance Monitoring of Public OGC Web Map Services
One of the most widely-implemented service standards provided by the Open
Geospatial Consortium (OGC) to the user community is the Web Map Service (WMS).
WMS is widely employed globally, but there is limited knowledge of the global
distribution, adoption status or the service quality of these online WMS
resources. To fill this void, we investigated global WMSs resources and
performed distributed performance monitoring of these services. This paper
explicates a distributed monitoring framework that was used to monitor 46,296
WMSs continuously for over one year and a crawling method to discover these
WMSs. We analyzed server locations, provider types, themes, the spatiotemporal
coverage of map layers and the service versions for 41,703 valid WMSs.
Furthermore, we appraised the stability and performance of basic operations for
1210 selected WMSs (i.e., GetCapabilities and GetMap). We discuss the major
reasons for request errors and performance issues, as well as the relationship
between service response times and the spatiotemporal distribution of client
monitoring sites. This paper will help service providers, end users and
developers of standards to grasp the status of global WMS resources, as well as
to understand the adoption status of OGC standards. The conclusions drawn in
this paper can benefit geospatial resource discovery, service performance
evaluation and guide service performance improvements.Comment: 24 pages; 15 figure
Heuristics for Crawling WSDL Descriptions of Web Service Interfaces - the Heritrix Case
ResĂŒmee
KÀesoleva bakalureuse töö eesmÀrgiks on seadistada ja tÀiustada avatud lÀhtekoodil
baseeruvat Heritrix veebiussi. Tehtud muudatuste tulemina peab Heritrix suutma leida
veebiteenuseid mÀrkivaid WSDL faile. Veebiuss ehk web crawler on programm, mis otsib
automatiseeritult mööda Interneti avarusi ringi liikudes soovitud veebidokumente. WSDL
on XML formaadis keel, mis sÀtestab veebiteenuse asukoha ja protokolli ning kirjeldab
pakutavad meetodid ja funktsioonid.
EesmÀrgi saavutamiseks uuriti avaldatud artikleid, mis kirjeldasid erinevaid strateegiaid
Internetist veebiteenuste otsimiseks kasutades veebiussi. Mainitud tööde pÔhjal loodi
Heritrix'i seadistus, mis vÔimaldas WSDL teenuse kirjeldusi otsida. Lisaks kirjutati
programmeerimis keeles Java Heritrixi tÀiendav klass, mis vÔimaldab lihtsustatud kujul
salvestada veebi roomamise tulemusi.
Ăhes leitud artiklites kirjeldati suunatud otsingu (focused crawling) toe lisamist
veebiteenuseid otsivale Heritrix veebiussile. Suunatud otsing vÔimaldab ussil hinnata uusi
avastatud veebilehti ning lubab keskenduda lehtedele, mis suurema tÔenÀosusega
sisaldavad otsitavaid ressursse. Kuna vaadeldavas programmis puudub tugi suunatud
otsingu funktsionaalsusele, lisati see kÀesoleva töö kÀigus tÀiendava mooduli loomisega.
Algoritmi aluseks vÔeti mainitud artiklis kirjeldatud lahendus.
Selleks, et kontrollida kas lisatud tÀiendus muutis roomamise protsessi tÀpsemaks vÔi
kiiremaks teostati eksperiment kolme katsega. KĂ€ivitati kaks Heritrixi exemplari, millest
mĂ”lemad seadistati WSDL teenuse kirjeldusi ostima, kuid ainult ĂŒhele neist lisati suunatud
otsingu tugi. Katse kÀigus vaadeldi leitud teenuste arvu ja kogu lÀbi kammitud
veebilehtede kogust.
Eksperimendi tulemuste analĂŒĂŒsist vĂ”is jĂ€reldada, et suunatud otsingu funktsionaalsus
muudab roomamise protsessi tÀpsemaks ning vÔimaldab seelÀbi WSDL teenuse kirjeldusi
kiiremini leida.The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter.
We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawlerâs source code, that allows logging of search results without any excessive data.
With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing âinterestingâ data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrixâs source code, the algorithm used as basis for our solution was described in one of the articles.
To see if our enhancement provided any improvement in the crawlâs process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them
- âŠ