1,257 research outputs found

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Ranking XPaths for extracting search result records

    Get PDF
    Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software

    Proceedings of the First International Workshop on Mashup Personal Learning Environments

    Get PDF
    Wild, F., Kalz, M., & Palmér, M. (Eds.) (2008). Proceedings of the First International Workshop on Mashup Personal Learning Environments (MUPPLE08). September, 17, 2008, Maastricht, The Netherlands: CEUR Workshop Proceedings, ISSN 1613-0073. Available at http://ceur-ws.org/Vol-388.The work on this publication has been sponsored by the TENCompetence Integrated Project (funded by the European Commission's 6th Framework Programme, priority IST/Technology Enhanced Learning. Contract 027087 [http://www.tencompetence.org]) and partly sponsored by the LTfLL project (funded by the European Commission's 7th Framework Programme, priority ISCT. Contract 212578 [http://www.ltfll-project.org

    Entwicklung und Realisierung einer Strategie zur Syndikation von Linked Data

    Get PDF
    Die Veröffentlichung von strukturierten Daten im Linked Data Web hat stark zugenommen. FĂŒr viele Internetnutzer sind diese Daten jedoch nicht nutzbar, da der Zugriff ohne Kenntnis einer Programmiersprache nicht möglich ist. Mit der Webapplikation LESS wurde eine Templateengine fĂŒr Linked Data-Datenquellen und SPARQL-Ergebnisse entwickelt. Auf der Plattform können Templates erstellt, veröffentlicht und von anderen Nutzern weiterverwendet werden. Der Nutzer wird bei der Entwicklung von Templates unterstĂŒtzt, so dass es auch mit geringen technischen Kenntnissen möglich ist, mit Semantic Web-Daten zu arbeiten. LESS ermöglicht die Integration von Daten aus unterschiedlichen Quellen, sowie die Erzeugung textbasierter Ausgabeformate wie RSS, XML und HTML mit Javascript. Templates können fĂŒr unterschiedliche Ressourcen erstellt und anschließend einfach in bestehende Webapplikationen und Webseiten integriert werden. Um die ZuverlĂ€ssigkeit und Geschwindigkeit des Linked Data Web zu verbessern, erfolgt eine Zwischenspeicherung der verwendete Daten in LESS fĂŒr eine bestimmte Zeit oder fĂŒr den Fall des Ausfalls der Datenquelle


    Get PDF
    As we are a part of a world full of news, every second there is something happening in the world. IT major players companies did a lot of effort helping the users to find news and follow them by making new technologies like Really Simple Syndication (RSS), online news portals, and SMS subscriptions. Each of those has many problems. For example, RSS could send more than 12,000 emails in less than 8 hours. SMS could help the user to follow only one source of news and the user has to pay for it. NEWZKIOSK IS offering the main headlines from the most famous news sources (CNN, BBC, NYT, Reuters). RSS and WordPress platform will allow Newzkiosk to offer the users the most compatible news portal over the web, less advertisements for more users' satisfaction and convenience
