2,011 research outputs found
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis
The Market Blended Insight project1 has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the unstructured text on the web, to annotate and then translate the extracted data according to the backend schema
Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler
Specialized dictionaries are used to understand concepts in specific domains,
especially where those concepts are not part of the general vocabulary, or
having meanings that differ from ordinary languages. The first step in creating
a specialized dictionary involves detecting the characteristic vocabulary of
the domain in question. Classical methods for detecting this vocabulary involve
gathering a domain corpus, calculating statistics on the terms found there, and
then comparing these statistics to a background or general language corpus.
Terms which are found significantly more often in the specialized corpus than
in the background corpus are candidates for the characteristic vocabulary of
the domain. Here we present two tools, a directed crawler, and a distributional
semantics package, that can be used together, circumventing the need of a
background corpus. Both tools are available on the web
Methodologies for the Automatic Location of Academic and Educational Texts on the Internet
Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as âappropriateâ to a given database, a problem only solved by complex text content analysis.
This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
Methodologies for the Automatic Location of Academic and Educational Texts on the Internet
Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as âappropriateâ to a given database, a problem only solved by complex text content analysis.
This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Web Data Extraction For Content Aggregation From E-Commerce Websites
Internetist on saanud piiramatu andmeallikas. LĂ€bi otsingumootorite\n\ron see andmehulk tehtud kĂ€ttesaadavaks igapĂ€evasele interneti kasutajale. Sellele vaatamata on seal ikka informatsiooni, mis pole lihtsasti kĂ€ttesaadav olemasolevateotsingumootoritega. See tekitab jĂ€tkuvalt vajadust ehitada aina uusi otsingumootoreid, mis esitavad informatsiooni uuel kujul, paremini kui seda on varem tehtud. Selleks, et esitada andmeid sellisel kujul, et neist tekiks lisavÀÀrtus tuleb nad kĂ”igepealt kokku koguda ning seejĂ€rel töödelda ja analĂŒĂŒsida. Antud magistritöö uurib andmete kogumise faasi selles protsessis.\n\rEsitletakse modernset andmete eraldamise sĂŒsteemi ZedBot, mis vĂ”imaldab veebilehtedel esinevad pooleldi struktureeritud andmed teisendada kĂ”rge tĂ€psusega struktureeritud kujule. Loodud sĂŒsteem tĂ€idab enamikku nĂ”udeid, mida peab tĂ€napĂ€evane andmeeraldussĂŒsteem tĂ€itma, milleks on: platvormist sĂ”ltumatus, vĂ”imas reeglite kirjelduse sĂŒsteem, automaatne reeglite genereerimise sĂŒsteem ja lihtsasti kasutatav kasutajaliides andmete annoteerimiseks. Eriliselt disainitud otsi-robot vĂ”imaldab andmete eraldamist kogu veebilehelt ilma inimese sekkumiseta. Töös nĂ€idatakse, et esitletud programm on sobilik andmete eraldamiseks vĂ€ga suure tĂ€psusega suurelt hulgalt veebilehtedelt ning tööriista poolt loodud andmestiku saab kasutada tooteinfo agregeerimiseks ning uue lisandvÀÀrtuse loomiseks.World Wide Web has become an unlimited source of data. Search engines have made this information available to every day Internet user. There is still information available that is not easily accessible through existing search engines, so there remains the need to create new search engines that would present information better than before. In order to present data in a way that gives extra value, it must be collected, analysed and transformed. This master thesis focuses on data collection part. Modern information extraction system ZedBot is presented, that allows extraction of highly structured data form semi structured web pages. It complies with majority of requirements set for modern data extraction system: it is platform independent, it has powerful semi automatic wrapper generation system and has easy to use user interface for annotating structured data. Specially designed web crawler allows to extraction to be performed on whole web site level without human interaction. \n\r We show that presented tool is suitable for extraction highly accurate data from large number of websites and can be used as a data source for product aggregation system to create new added value
Heuristics for Crawling WSDL Descriptions of Web Service Interfaces - the Heritrix Case
ResĂŒmee
KÀesoleva bakalureuse töö eesmÀrgiks on seadistada ja tÀiustada avatud lÀhtekoodil
baseeruvat Heritrix veebiussi. Tehtud muudatuste tulemina peab Heritrix suutma leida
veebiteenuseid mÀrkivaid WSDL faile. Veebiuss ehk web crawler on programm, mis otsib
automatiseeritult mööda Interneti avarusi ringi liikudes soovitud veebidokumente. WSDL
on XML formaadis keel, mis sÀtestab veebiteenuse asukoha ja protokolli ning kirjeldab
pakutavad meetodid ja funktsioonid.
EesmÀrgi saavutamiseks uuriti avaldatud artikleid, mis kirjeldasid erinevaid strateegiaid
Internetist veebiteenuste otsimiseks kasutades veebiussi. Mainitud tööde pÔhjal loodi
Heritrix'i seadistus, mis vÔimaldas WSDL teenuse kirjeldusi otsida. Lisaks kirjutati
programmeerimis keeles Java Heritrixi tÀiendav klass, mis vÔimaldab lihtsustatud kujul
salvestada veebi roomamise tulemusi.
Ăhes leitud artiklites kirjeldati suunatud otsingu (focused crawling) toe lisamist
veebiteenuseid otsivale Heritrix veebiussile. Suunatud otsing vÔimaldab ussil hinnata uusi
avastatud veebilehti ning lubab keskenduda lehtedele, mis suurema tÔenÀosusega
sisaldavad otsitavaid ressursse. Kuna vaadeldavas programmis puudub tugi suunatud
otsingu funktsionaalsusele, lisati see kÀesoleva töö kÀigus tÀiendava mooduli loomisega.
Algoritmi aluseks vÔeti mainitud artiklis kirjeldatud lahendus.
Selleks, et kontrollida kas lisatud tÀiendus muutis roomamise protsessi tÀpsemaks vÔi
kiiremaks teostati eksperiment kolme katsega. KĂ€ivitati kaks Heritrixi exemplari, millest
mĂ”lemad seadistati WSDL teenuse kirjeldusi ostima, kuid ainult ĂŒhele neist lisati suunatud
otsingu tugi. Katse kÀigus vaadeldi leitud teenuste arvu ja kogu lÀbi kammitud
veebilehtede kogust.
Eksperimendi tulemuste analĂŒĂŒsist vĂ”is jĂ€reldada, et suunatud otsingu funktsionaalsus
muudab roomamise protsessi tÀpsemaks ning vÔimaldab seelÀbi WSDL teenuse kirjeldusi
kiiremini leida.The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter.
We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawlerâs source code, that allows logging of search results without any excessive data.
With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing âinterestingâ data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrixâs source code, the algorithm used as basis for our solution was described in one of the articles.
To see if our enhancement provided any improvement in the crawlâs process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them
- âŠ