Search CORE

19 research outputs found

Exploiting multimedia in creating and analysing multimedia Web archives

Author: Dupplaw David
Hall Wendy
Hare Jonathon
Lewis Paul H.
Martinez Kirk
Publication venue: 'MDPI AG'
Publication date: 01/01/2014
Field of study

The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general

CiteSeerX

Southampton (e-Prints Soton)

Crossref

Directory of Open Access Journals

Exploiting the social and semantic web for guided web archiving

Author: Buchanan George
Dietze Stefan
Doka Katerina
Loizides Fernando
Peters Wim
Rasmussen Edie
Risse Thomas
Senellart Pierre
Stavrakas Yannis
Zaphiris Panayiotis
Publication venue: Heidelberg : Springer Verlag
Publication date: 01/01/2012
Field of study

The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33290-6_47.German Federal Ministry for the Environment, Nature Conservation and Nuclear Safety/0325296Solland Solar Cells BVSolarWorld Innovations GmbHSCHOTT Solar AGRENA GmbHSINGULUS TECHNOLOGIES A

Institutionelles Repositorium der Leibniz Universität Hannover

iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

Author: Diligenti M.
Mohr G.
Psallidas F.
Risse T.
Tannier X.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/12/2016
Field of study

Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

arXiv.org e-Print Archive

Crossref

Acquisition des contenus intelligents dans l’archivage du Web

Author: Faheem Muhammad
Publication venue: HAL CCSD
Publication date: 17/12/2014
Field of study

Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

Thèses en Ligne

Should I Care about Your Opinion? : Detection of Opinion Interestingness and Dynamics in Social Media

Author: Fisichella Marco
Funk Adam
Gossen Gerhard
Maynard Diana
Publication venue: Basel : MDPI AG
Publication date: 01/01/2014
Field of study

In this paper, we describe a set of reusable text processing components for extracting opinionated information from social media, rating it for interestingness, and for detecting opinion events. We have developed applications in GATE to extract named entities, terms and events and to detect opinions about them, which are then used as the starting point for opinion event detection. The opinions are then aggregated over larger sections of text, to give some overall sentiment about topics and documents, and also some degree of information about interestingness based on opinion diversity. We go beyond traditional opinion mining techniques in a number of ways: by focusing on specific opinion-target extraction related to key terms and events, by examining and dealing with a number of specific linguistic phenomena, by analysing and visualising opinion dynamics over time, and by aggregating the opinions in different ways for a more flexible view of the information contained in the documents.EU/27023

Multidisciplinary Digital Publishing Institute

CiteSeerX

Crossref

Directory of Open Access Journals

Institutionelles Repositorium der Leibniz Universität Hannover

BlogForever: D3.1 Preservation Strategy Report

Author: Arango-Docio Silvia
Banos Vangelis
Garcia Llopis Jaime
Kalb Hendrik
Kim Yunhyong
Pinsent Ed
Ross Seamus
Sleeman Patricia
Stepanyan Karen
Trochidis Illias
Publication venue: BlogForever
Publication date: 25/10/2013
Field of study

This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

Enlighten

Digitaalse teadmuse arhiveerimine – teoreetilis-praktiline uurimistöö Rahvusarhiivi näitel

Author: Kärberg Tarvo
Publication venue
Publication date: 21/11/2016
Field of study

Väitekirja elektrooniline versioon ei sisalda publikatsioone.Digitaalse informatsiooni pidevalt kiirenev juurdekasv on aidanud rõhutada ka olulise informatsiooni säilitamise vajadust. Säilitamine ei tähenda siinkohal pelgalt füüsilist varundamist, vaid ka informatsiooni kasutatavuse ja mõistetavuse tagamist. See tähendab, et tegelikkuses on vaja hoolitseda ka selle eest, et meil oleks olemas vajalik riist- ja tarkvara arhiveeritud teabe kasutamiseks. Kui seda ei ole, siis saab mõningatel juhtudel kasutada emulaatoreid, mis matkivad konkreetset aegunud süsteemi ja võimaldavad niiviisi vanu faile avada. Samas, kui tehnoloogia iganemist on võimalik ette näha, siis oleks mõistlik failid juba varakult püsivamasse vormingusse ümber konverteerida või andmekandja kaasaegsema vastu vahetada. Nii emuleerimine, konverteerimine kui ka nende kombineerimine aitavad säilitada informatsiooni kasutatavust, kuid ei pruugi tagada autentset mõistetavust, kuna digitaalse teabe esitus sõltub alati säilitatud bittide tõlgendamisest. Näiteks, kui luua WordPad tarkvara abil üks dokument ja avada seesama dokument Hex Editor Neo abil, siis näeme seda faili kahendkujul, Notepad++ näitab RTFi kodeeringut, Microsoft Word 2010 ja LibreOffice Writeri esitustes võime märgata juba mitmeid erinevusi. Kõik eelloetletud esitused on tehnoloogilises mõttes õiged. Faili avamisel veateateid ei teki, sest tarkvara seisukohast lähtudes peavadki esitused sellised olema. Siinjuures oluline rõhutada, et ka korrektne esitus võib jääda kasutajale mõistetamatuks – see, et andmed on säilinud, et neid on võimalik lugeda ja esitada, ei garanteeri paraku, et neid õigesti mõistetakse. Mõistetavuse tagamiseks tuleb alati arvestada ka lõppkasutajaskonnaga. Seetõttu uuribki antud töö võimalusi, kuidas toetada teadmuse (mõistetava informatsiooni) digitaalset arhiveerimist tuginedes eelkõige parimale praktikale, praktilistele eksperimentidele Rahvusarhiivis ja interdistsiplinaarsetele (nt infotehnoloogia kombineerimine arhiivindusega) võtetele.Digital preservation of knowledge is a very broad and complex research area. Many aspects are still open for research. According to the literature, the accessibility and usability of digital information have been more investigated than the comprehensibility of important digital information over time. Although there are remedies (e.g. emulation and migration) for mitigating the risks related to the accessibility and usability, the question how to guarantee understandability/comprehensibility of archived information is still ongoing research. Understanding digital information first requires a representation of the archived information, so that a user could then interpret and understand it. However, it is a not-so-well-known fact that the digital information does not have any fixed representation before involving some software. For example, if we create a document in WordPad and open the same file in Hex Editor Neo software, then we will see the binary representation which is also correct but not suitable for human users, as humans are not used to interpreting binary codes. When we open that file in Notepad++, then we can see the structure of the RTF coding. Again, this is the correct interpretation of this file, but not understandable for the ordinary user, as it shows the technical view of the file format structure. When we open that file in Microsoft Word 2010 or LibreOffice Writer, then we will notice some changes, although the original bits are the same and no errors are displayed by the software. Thus, all representations are technologically correct and no errors will be displayed to the user when they are opening this file. It is important to emphasise that in some cases even the original representation may be not understandable to the users. Therefore, it is important to know who the main users of the archives are and to ensure that the archived objects are independently understandable to that community over the long term. This dissertation will therefore research meaningful use of digital objects by taking into account the designated users’ knowledge and Open Archival Information System (OAIS) model. The research also includes several practical experimental projects at the National Archives of Estonia which will test some important parts of the theoretical work

DSpace at Tartu University Library

Bootstrapping Web Archive Collections From Micro-Collections in Social Media

Author: Nwala Alexander C.
Publication venue: ODU Digital Commons
Publication date: 01/08/2020
Field of study

In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but this ability comes at a cost: it is time consuming to collect these seeds. The result of this is a shortage of curators, a lack of Web archive collections for various important news events, and a need for an automatic system for generating seeds. We investigate the problem of generating seed URIs automatically, and explore the state of the art in collection building and seed selection. Attempts toward generating seeds automatically have mostly relied on scraping Web or social media Search Engine Result Pages (SERPs). In this work, we introduce a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share narratives about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts Micro-collections, whether shared on Reddit or Twitter, and we consider them as an important source for seeds. This is because, the effort taken to create Micro-collections is an indication of editorial activity and a demonstration of domain expertise. Therefore, we propose a model for generating seeds from Micro-collections. We begin by introducing a simple vocabulary, called post class for describing social media posts across different platforms, and extract seeds from the Micro-collections post class. We further propose Quality Proxies for seeds by extending the idea of collection comparison to evaluation, and present our Micro-collection/Quality Proxy (MCQP) framework for bootstrapping Web archive collections from Micro-collections in social media

Old Dominion University

A Framework for More Effective Dark Web Marketplace Investigations

Author: Cappa Francesco
Cardon James
Hayes Darren
Publication venue: 'MDPI AG'
Publication date: 01/01/2018
Field of study

The success of the Silk Road has prompted the growth of many Dark Web marketplaces. This exponential growth has provided criminal enterprises with new outlets to sell illicit items. Thus, the Dark Web has generated great interest from academics and governments who have sought to unveil the identities of participants in these highly lucrative, yet illegal, marketplaces. Traditional Web scraping methodologies and investigative techniques have proven to be inept at unmasking these marketplace participants. This research provides an analytical framework for automating Dark Web scraping and analysis with free tools found on the World Wide Web. Using a case study marketplace, we successfully tested a Web crawler, developed using AppleScript, to retrieve the account information for thousands of vendors and their respective marketplace listings. This paper clearly details why AppleScript was the most viable and efficient method for scraping Dark Web marketplaces. The results from our case study validate the efficacy of our proposed analytical framework, which has relevance for academics studying this growing phenomenon and for investigators examining criminal activity on the Dark Web

Directory of Open Access Journals

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma