537 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    MementoMap: A Web Archive Profiling Framework for Efficient Memento Routing

    Get PDF
    With the proliferation of public web archives, it is becoming more important to better profile their contents, both to understand their immense holdings as well as to support routing of requests in Memento aggregators. A memento is a past version of a web page and a Memento aggregator is a tool or service that aggregates mementos from many different web archives. To save resources, the Memento aggregator should only poll the archives that are likely to have a copy of the requested Uniform Resource Identifier (URI). Using the Crawler Index (CDX), we generate profiles of the archives that summarize their holdings and use them to inform routing of the Memento aggregator’s URI requests. Additionally, we use full text search (when available) or sample URI lookups to build an understanding of an archive’s holdings. Previous work in profiling ranged from using full URIs (no false positives, but with large profiles) to using only top-level domains (TLDs) (smaller profiles, but with many false positives). This work explores strategies in between these two extremes. For evaluation we used CDX files from Archive-It, UK Web Archive, Stanford Web Archive Portal, and Arquivo.pt. Moreover, we used web server access log files from the Internet Archive’s Wayback Machine, UK Web Archive, Arquivo.pt, LANL’s Memento Proxy, and ODU’s MemGator Server. In addition, we utilized historical dataset of URIs from DMOZ. In early experiments with various URI-based static profiling policies we successfully identified about 78% of the URIs that were not present in the archive with less than 1% relative cost as compared to the complete knowledge profile and 94% URIs with less than 10% relative cost without any false negatives. In another experiment we found that we can correctly route 80% of the requests while maintaining about 0.9 recall by discovering only 10% of the archive holdings and generating a profile that costs less than 1% of the complete knowledge profile. We created MementoMap, a framework that allows web archives and third parties to express holdings and/or voids of an archive of any size with varying levels of details to fulfil various application needs. Our archive profiling framework enables tools and services to predict and rank archives where mementos of a requested URI are likely to be present. In static profiling policies we predefined the maximum depth of host and path segments of URIs for each policy that are used as URI keys. This gave us a good baseline for evaluation, but was not suitable for merging profiles with different policies. Later, we introduced a more flexible means to represent URI keys that uses wildcard characters to indicate whether a URI key was truncated. Moreover, we developed an algorithm to rollup URI keys dynamically at arbitrary depths when sufficient archiving activity is detected under certain URI prefixes. In an experiment with dynamic profiling of archival holdings we found that a MementoMap of less than 1.5% relative cost can correctly identify the presence or absence of 60% of the lookup URIs in the corresponding archive without any false negatives (i.e., 100% recall). In addition, we separately evaluated archival voids based on the most frequently accessed resources in the access log and found that we could have avoided more than 8% of the false positives without introducing any false negatives. We defined a routing score that can be used for Memento routing. Using a cut-off threshold technique on our routing score we achieved over 96% accuracy if we accept about 89% recall and for a recall of 99% we managed to get about 68% accuracy, which translates to about 72% saving in wasted lookup requests in our Memento aggregator. Moreover, when using top-k archives based on our routing score for routing and choosing only the topmost archive, we missed only about 8% of the sample URIs that are present in at least one archive, but when we selected top-2 archives, we missed less than 2% of these URIs. We also evaluated a machine learning-based routing approach, which resulted in an overall better accuracy, but poorer recall due to low prevalence of the sample lookup URI dataset in different web archives. We contributed various algorithms, such as a space and time efficient approach to ingest large lists of URIs to generate MementoMaps and a Random Searcher Model to discover samples of holdings of web archives. We contributed numerous tools to support various aspects of web archiving and replay, such as MemGator (a Memento aggregator), Inter- Planetary Wayback (a novel archival replay system), Reconstructive (a client-side request rerouting ServiceWorker), and AccessLog Parser. Moreover, this work yielded a file format specification draft called Unified Key Value Store (UKVS) that we use for serialization and dissemination of MementoMaps. It is a flexible and extensible file format that allows easy interactions with Unix text processing tools. UKVS can be used in many applications beyond MementoMaps

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Heuristics for Crawling WSDL Descriptions of Web Service Interfaces - the Heritrix Case

    Get PDF
    ResĂŒmee KĂ€esoleva bakalureuse töö eesmĂ€rgiks on seadistada ja tĂ€iustada avatud lĂ€htekoodil baseeruvat Heritrix veebiussi. Tehtud muudatuste tulemina peab Heritrix suutma leida veebiteenuseid mĂ€rkivaid WSDL faile. Veebiuss ehk web crawler on programm, mis otsib automatiseeritult mööda Interneti avarusi ringi liikudes soovitud veebidokumente. WSDL on XML formaadis keel, mis sĂ€testab veebiteenuse asukoha ja protokolli ning kirjeldab pakutavad meetodid ja funktsioonid. EesmĂ€rgi saavutamiseks uuriti avaldatud artikleid, mis kirjeldasid erinevaid strateegiaid Internetist veebiteenuste otsimiseks kasutades veebiussi. Mainitud tööde pĂ”hjal loodi Heritrix'i seadistus, mis vĂ”imaldas WSDL teenuse kirjeldusi otsida. Lisaks kirjutati programmeerimis keeles Java Heritrixi tĂ€iendav klass, mis vĂ”imaldab lihtsustatud kujul salvestada veebi roomamise tulemusi. Ühes leitud artiklites kirjeldati suunatud otsingu (focused crawling) toe lisamist veebiteenuseid otsivale Heritrix veebiussile. Suunatud otsing vĂ”imaldab ussil hinnata uusi avastatud veebilehti ning lubab keskenduda lehtedele, mis suurema tĂ”enĂ€osusega sisaldavad otsitavaid ressursse. Kuna vaadeldavas programmis puudub tugi suunatud otsingu funktsionaalsusele, lisati see kĂ€esoleva töö kĂ€igus tĂ€iendava mooduli loomisega. Algoritmi aluseks vĂ”eti mainitud artiklis kirjeldatud lahendus. Selleks, et kontrollida kas lisatud tĂ€iendus muutis roomamise protsessi tĂ€psemaks vĂ”i kiiremaks teostati eksperiment kolme katsega. KĂ€ivitati kaks Heritrixi exemplari, millest mĂ”lemad seadistati WSDL teenuse kirjeldusi ostima, kuid ainult ĂŒhele neist lisati suunatud otsingu tugi. Katse kĂ€igus vaadeldi leitud teenuste arvu ja kogu lĂ€bi kammitud veebilehtede kogust. Eksperimendi tulemuste analĂŒĂŒsist vĂ”is jĂ€reldada, et suunatud otsingu funktsionaalsus muudab roomamise protsessi tĂ€psemaks ning vĂ”imaldab seelĂ€bi WSDL teenuse kirjeldusi kiiremini leida.The goal of this thesis is to configure and modify Heritrix web crawler to add the support for finding WSDL description URIs. Heritrix is an open-source spider that has been written in Java programming language and has been designed to help Internet Archive store the contents of Internet. It already includes most of the common heuristics used for spidering and it has a modular architecture design which makes it easy to alter. We gathered a collection of strategies and crawler job configuration options to be used on Heritrix. These originated from the published works that the other teams had done on the topic. In addition to it, we created a new module to the crawler’s source code, that allows logging of search results without any excessive data. With the job configuration changes mentioned, it was possible to spider the web for WSDL description URIs, but as Heritrix does not support focused crawling, the spider would explore all the web sites it happens to stumble upon. Most of these sites would accommodate no information relevant to finding web services. To guide the course of the spider's job to the resources potentially containing “interesting” data, we implemented support for focused crawling of WSDL URIs. The change required the creation of a new module in Heritrix’s source code, the algorithm used as basis for our solution was described in one of the articles. To see if our enhancement provided any improvement in the crawl’s process, a series of experiments were conducted. In them we compared performance and accuracy of two crawlers. Both of which were configured for WSDL descriptions crawling, but one of them was also fitted with module providing support for focused crawling. From the analysis of the experiments' results we deduced that although the crawler job set for the experiments' baseline processed URIs a bit faster, the spider with the improvements found WSDL descriptions more accurately and was able to find more of them

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    Web modelling for web warehouse design

    Get PDF
    Tese de doutoramento em InformĂĄtica (Engenharia InformĂĄtica), apresentada Ă  Universidade de Lisboa atravĂ©s da Faculdade de CiĂȘncias, 2007Users require applications to help them obtaining knowledge from the web. However, the specific characteristics of web data make it difficult to create these applications. One possible solution to facilitate this task is to extract information from the web, transform and load it to a Web Warehouse, which provides uniform access methods for automatic processing of the data. Web Warehousing is conceptually similar to Data Warehousing approaches used to integrate relational information from databases. However, the structure of the web is very dynamic and cannot be controlled by the Warehouse designers. Web models frequently do not reflect the current state of the web. Thus, Web Warehouses must be redesigned at a late stage of development. These changes have high costs and may jeopardize entire projects. This thesis addresses the problem of modelling the web and its influence in the design of Web Warehouses. A model of a web portion was derived and based on it, a Web Warehouse prototype was designed. The prototype was validated in several real-usage scenarios. The obtained results show that web modelling is a fundamental step of the web data integration process.Os utilizadores da web recorrem a ferramentas que os ajudem a satisfazer as suas necessidades de informação. Contudo, as caracterĂ­sticas especĂ­ficas dos conteĂșdos provenientes da web dificultam o desenvolvimento destas aplicaçÔes. Uma aproximação possĂ­vel para a resolução deste problema Ă© a integração de dados provenientes da web num ArmazĂ©m de Dados Web que, por sua vez, disponibilize mĂ©todos de acesso uniformes e facilitem o processamento automĂĄtico. Um ArmazĂ©m de Dados Web Ă© conceptualmente semelhante a um ArmazĂ©m de Dados de negĂłcio. No entanto, a estrutura da informação a carregar, a web, nĂŁo pode ser controlada ou facilmente modelada pelos analistas. Os modelos da web existentes nĂŁo sĂŁo tipicamente representativos do seu estado presente. Como consequĂȘncia, os ArmazĂ©ns de Dados Web sofrem frequentemente alteraçÔes profundas no seu desenho quando jĂĄ se encontram numa fase avançada de desenvolvimento. Estas mudanças tĂȘm custos elevados e podem pĂŽr em causa a viabilidade de todo um projecto. Esta tese estuda o problema da modelação da web e a sua influĂȘncia no desenho de ArmazĂ©ns de Dados Web. Para este efeito, foi extraĂ­do um modelo de uma porção da web, e com base nele, desenhado um protĂłtipo de um ArmazĂ©m de Dados Web. Este protĂłtipo foi validado atravĂ©s da sua utilização em vĂĄrios contextos distintos. Os resultados obtidos mostram que a modelação da web deve ser considerada no processo de integração de dados da web.Fundação para Computação CientĂ­fica Nacional (FCCN); LaSIGE-LaboratĂłrio de Sistemas InformĂĄticos de Grande Escala; Fundação para a CiĂȘncia e Tecnologia (FCT), (SFRH/BD/11062/2002

    Ideas Matchmaking for Supporting Innovators and Entrepreneurs

    Get PDF
    KĂ€esolevas töös esitletakse sĂŒsteemi, mis on vĂ”imeline sirvima veebist ettevĂ”tluse ja tehnoloogiaga seotud andmeid, mida saab siduda kasutajate poolt Innovvoice platvormil vĂ€lja pakutud ideedega. Selline teenus on ideabator platvormi vÀÀrtuslik osa, mis toetab ettevĂ”tluse uuendajaid ja potentsiaalseid ettevĂ”tjaid.In this paper we show a system able to crawl content from the Web related to entrepreneurship and technology, to be matched with ideas proposed by users in the Innovvoice platform. We argue that such a service is a valuable component of an ideabator platform, supporting innovators and possible entrepreneurs

    Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI

    Get PDF
    Web archives are a window to view past versions of webpages. When a user requests a webpage on the live Web, such as http://tripadvisor.com/where_to_t ravel/, the webpage may not be found, which results in an HyperText Transfer Protocol (HTTP) 404 response. The user then may search for the webpage in a Web archive, such as the Internet Archive. Unfortunately, if this page had never been archived, the user will not be able to view the page, nor will the user gain any information on other webpages that have similar content in the archive, such as the archived webpage http://classy-travel.net. Similarly, if the user requests the webpage http://hokiesports.com/football/ from the Internet Archive, the user will only find the requested webpage, and the user will not gain any information on other webpages that have similar content in the archive, such as the archived webpage http://techsideline.com. In this research, we will build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. First, we detect semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. We measure the performance of each step using different techniques, including calculating the F1 to measure of different tokenization methods and the classification. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations
    • 

    corecore