Search CORE

23 research outputs found

Exploiting the social and semantic web for guided web archiving

Author: Buchanan George
Dietze Stefan
Doka Katerina
Loizides Fernando
Peters Wim
Rasmussen Edie
Risse Thomas
Senellart Pierre
Stavrakas Yannis
Zaphiris Panayiotis
Publication venue: Heidelberg : Springer Verlag
Publication date: 01/01/2012
Field of study

The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33290-6_47.German Federal Ministry for the Environment, Nature Conservation and Nuclear Safety/0325296Solland Solar Cells BVSolarWorld Innovations GmbHSCHOTT Solar AGRENA GmbHSINGULUS TECHNOLOGIES A

Institutionelles Repositorium der Leibniz Universität Hannover

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Author: Diligenti M.
Mohr G.
Psallidas F.
Risse T.
Tannier X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/12/2016
Field of study

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

arXiv.org e-Print Archive

Crossref

Should I Care about Your Opinion? : Detection of Opinion Interestingness and Dynamics in Social Media

Author: Fisichella Marco
Funk Adam
Gossen Gerhard
Maynard Diana
Publication venue: Basel : MDPI AG
Publication date: 01/01/2014
Field of study

In this paper, we describe a set of reusable text processing components for extracting opinionated information from social media, rating it for interestingness, and for detecting opinion events. We have developed applications in GATE to extract named entities, terms and events and to detect opinions about them, which are then used as the starting point for opinion event detection. The opinions are then aggregated over larger sections of text, to give some overall sentiment about topics and documents, and also some degree of information about interestingness based on opinion diversity. We go beyond traditional opinion mining techniques in a number of ways: by focusing on specific opinion-target extraction related to key terms and events, by examining and dealing with a number of specific linguistic phenomena, by analysing and visualising opinion dynamics over time, and by aggregating the opinions in different ways for a more flexible view of the information contained in the documents.EU/27023

Multidisciplinary Digital Publishing Institute

CiteSeerX

Crossref

Directory of Open Access Journals

Institutionelles Repositorium der Leibniz Universität Hannover

Digitaalse teadmuse arhiveerimine – teoreetilis-praktiline uurimistöö Rahvusarhiivi näitel

Author: Kärberg Tarvo
Publication venue
Publication date: 21/11/2016
Field of study

Väitekirja elektrooniline versioon ei sisalda publikatsioone.Digitaalse informatsiooni pidevalt kiirenev juurdekasv on aidanud rõhutada ka olulise informatsiooni säilitamise vajadust. Säilitamine ei tähenda siinkohal pelgalt füüsilist varundamist, vaid ka informatsiooni kasutatavuse ja mõistetavuse tagamist. See tähendab, et tegelikkuses on vaja hoolitseda ka selle eest, et meil oleks olemas vajalik riist- ja tarkvara arhiveeritud teabe kasutamiseks. Kui seda ei ole, siis saab mõningatel juhtudel kasutada emulaatoreid, mis matkivad konkreetset aegunud süsteemi ja võimaldavad niiviisi vanu faile avada. Samas, kui tehnoloogia iganemist on võimalik ette näha, siis oleks mõistlik failid juba varakult püsivamasse vormingusse ümber konverteerida või andmekandja kaasaegsema vastu vahetada. Nii emuleerimine, konverteerimine kui ka nende kombineerimine aitavad säilitada informatsiooni kasutatavust, kuid ei pruugi tagada autentset mõistetavust, kuna digitaalse teabe esitus sõltub alati säilitatud bittide tõlgendamisest. Näiteks, kui luua WordPad tarkvara abil üks dokument ja avada seesama dokument Hex Editor Neo abil, siis näeme seda faili kahendkujul, Notepad++ näitab RTFi kodeeringut, Microsoft Word 2010 ja LibreOffice Writeri esitustes võime märgata juba mitmeid erinevusi. Kõik eelloetletud esitused on tehnoloogilises mõttes õiged. Faili avamisel veateateid ei teki, sest tarkvara seisukohast lähtudes peavadki esitused sellised olema. Siinjuures oluline rõhutada, et ka korrektne esitus võib jääda kasutajale mõistetamatuks – see, et andmed on säilinud, et neid on võimalik lugeda ja esitada, ei garanteeri paraku, et neid õigesti mõistetakse. Mõistetavuse tagamiseks tuleb alati arvestada ka lõppkasutajaskonnaga. Seetõttu uuribki antud töö võimalusi, kuidas toetada teadmuse (mõistetava informatsiooni) digitaalset arhiveerimist tuginedes eelkõige parimale praktikale, praktilistele eksperimentidele Rahvusarhiivis ja interdistsiplinaarsetele (nt infotehnoloogia kombineerimine arhiivindusega) võtetele.Digital preservation of knowledge is a very broad and complex research area. Many aspects are still open for research. According to the literature, the accessibility and usability of digital information have been more investigated than the comprehensibility of important digital information over time. Although there are remedies (e.g. emulation and migration) for mitigating the risks related to the accessibility and usability, the question how to guarantee understandability/comprehensibility of archived information is still ongoing research. Understanding digital information first requires a representation of the archived information, so that a user could then interpret and understand it. However, it is a not-so-well-known fact that the digital information does not have any fixed representation before involving some software. For example, if we create a document in WordPad and open the same file in Hex Editor Neo software, then we will see the binary representation which is also correct but not suitable for human users, as humans are not used to interpreting binary codes. When we open that file in Notepad++, then we can see the structure of the RTF coding. Again, this is the correct interpretation of this file, but not understandable for the ordinary user, as it shows the technical view of the file format structure. When we open that file in Microsoft Word 2010 or LibreOffice Writer, then we will notice some changes, although the original bits are the same and no errors are displayed by the software. Thus, all representations are technologically correct and no errors will be displayed to the user when they are opening this file. It is important to emphasise that in some cases even the original representation may be not understandable to the users. Therefore, it is important to know who the main users of the archives are and to ensure that the archived objects are independently understandable to that community over the long term. This dissertation will therefore research meaningful use of digital objects by taking into account the designated users’ knowledge and Open Archival Information System (OAIS) model. The research also includes several practical experimental projects at the National Archives of Estonia which will test some important parts of the theoretical work

DSpace at Tartu University Library

Digital Preservation Services : State of the Art Analysis

Author: Dobreva Milena
Ruusalepp Raivo
Publication venue
Publication date: 01/01/2012
Field of study

Research report funded by the DC-NET project.An overview of the state of the art in service provision for digital preservation and curation. Its focus is on the areas where bridging the gaps is needed between e-Infrastructures and efficient and forward-looking digital preservation services. Based on a desktop study and a rapid analysis of some 190 currently available tools and services for digital preservation, the deliverable provides a high-level view on the range of instruments currently on offer to support various functions within a preservation system.European Commission, FP7peer-reviewe

OAR@UM

BlogForever: D3.1 Preservation Strategy Report

Author: Arango-Docio Silvia
Banos Vangelis
Garcia Llopis Jaime
Kalb Hendrik
Kim Yunhyong
Pinsent Ed
Ross Seamus
Sleeman Patricia
Stepanyan Karen
Trochidis Illias
Publication venue: BlogForever
Publication date: 25/10/2013
Field of study

This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

Enlighten

Exploring entity-centric methods in the UK Government Web Archive

Author: Clough P.
Demartini G.
Ranade S.
Seaman G.
Storrar T.
Webster P.
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2016
Field of study

Being able to explore large digital collections effectively is of interest to both academics and practitioners alike. The need to go beyond the provision of keyword-driven functionality to features that support exploration and discovery is widely recognised. In addition, providers are seeking to support more diverse groups of users with varying information needs and tasks. Increasing amounts of cultural heritage are being stored in web archives that present unique challenges as a form of digital cultural heritage. This paper describes a collaboration between the University of Sheffield and the UK National Archives to investigate entity-based methods for exploring the UK Government Web Archive

White Rose Research Online

University of Queensland eSpace

PERICLES DELIVERABLE 5.2 Basic tools for Digital Ecosystem management

Author: Corubolo F
Fuselier J
Watry P
Publication venue
Publication date: 28/10/2015
Field of study

University of Liverpool Repository