19 research outputs found

    Exploiting multimedia in creating and analysing multimedia Web archives

    No full text
    The data contained on the web and the social web are inherently multimedia and consist of a mixture of textual, visual and audio modalities. Community memories embodied on the web and social web contain a rich mixture of data from these modalities. In many ways, the web is the greatest resource ever created by human-kind. However, due to the dynamic and distributed nature of the web, its content changes, appears and disappears on a daily basis. Web archiving provides a way of capturing snapshots of (parts of) the web for preservation and future analysis. This paper provides an overview of techniques we have developed within the context of the EU funded ARCOMEM (ARchiving COmmunity MEMories) project to allow multimedia web content to be leveraged during the archival process and for post-archival analysis. Through a set of use cases, we explore several practical applications of multimedia analytics within the realm of web archiving, web archive analysis and multimedia data on the web in general

    Exploiting the social and semantic web for guided web archiving

    Get PDF
    The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33290-6_47.German Federal Ministry for the Environment, Nature Conservation and Nuclear Safety/0325296Solland Solar Cells BVSolarWorld Innovations GmbHSCHOTT Solar AGRENA GmbHSINGULUS TECHNOLOGIES A

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    Acquisition des contenus intelligents dans l’archivage du Web

    Get PDF
    Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systĂšmes de gestion de contenu (CMS). Les outils actuellement utilisĂ©s par les archivistes du Web pour prĂ©server le contenu du Web collectent et stockent de maniĂšre aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structurĂ© de ces pages Web. Nous prĂ©sentons dans un premier temps un application-aware helper (AAH) qui s’intĂšgre Ă  une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, Ă©tant donnĂ©e une base de connaissance deCMS courants. L’AAH a Ă©tĂ© intĂ©grĂ©e Ă  deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriĂ©taire d’Internet Memory Foundation et une version personnalisĂ©e d’Heritrix. Nous proposons ensuite un systĂšme de crawl efficace et non supervisĂ©, ACEBot (Adaptive Crawler Bot for data Extraction), guidĂ© par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL rĂ©cupĂ©rĂ©es), apprend une stratĂ©gie de parcours basĂ©e sur l’importance des motifs de navigation (sĂ©lectionnant ceux qui mĂšnent Ă  du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un tĂ©lĂ©chargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requĂȘtes HTTP qu’un crawler gĂ©nĂ©rique, sans compromis de qualitĂ©. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de donnĂ©es semi-supervisĂ©e. OWET permet Ă  un utilisateur d’extraire les donnĂ©es cachĂ©es derriĂšre des formulaires Web

    Should I Care about Your Opinion? : Detection of Opinion Interestingness and Dynamics in Social Media

    Get PDF
    In this paper, we describe a set of reusable text processing components for extracting opinionated information from social media, rating it for interestingness, and for detecting opinion events. We have developed applications in GATE to extract named entities, terms and events and to detect opinions about them, which are then used as the starting point for opinion event detection. The opinions are then aggregated over larger sections of text, to give some overall sentiment about topics and documents, and also some degree of information about interestingness based on opinion diversity. We go beyond traditional opinion mining techniques in a number of ways: by focusing on specific opinion-target extraction related to key terms and events, by examining and dealing with a number of specific linguistic phenomena, by analysing and visualising opinion dynamics over time, and by aggregating the opinions in different ways for a more flexible view of the information contained in the documents.EU/27023

    BlogForever: D3.1 Preservation Strategy Report

    Get PDF
    This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

    Digitaalse teadmuse arhiveerimine – teoreetilis-praktiline uurimistöö Rahvusarhiivi nĂ€itel

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsioone.Digitaalse informatsiooni pidevalt kiirenev juurdekasv on aidanud rĂ”hutada ka olulise informatsiooni sĂ€ilitamise vajadust. SĂ€ilitamine ei tĂ€henda siinkohal pelgalt fĂŒĂŒsilist varundamist, vaid ka informatsiooni kasutatavuse ja mĂ”istetavuse tagamist. See tĂ€hendab, et tegelikkuses on vaja hoolitseda ka selle eest, et meil oleks olemas vajalik riist- ja tarkvara arhiveeritud teabe kasutamiseks. Kui seda ei ole, siis saab mĂ”ningatel juhtudel kasutada emulaatoreid, mis matkivad konkreetset aegunud sĂŒsteemi ja vĂ”imaldavad niiviisi vanu faile avada. Samas, kui tehnoloogia iganemist on vĂ”imalik ette nĂ€ha, siis oleks mĂ”istlik failid juba varakult pĂŒsivamasse vormingusse ĂŒmber konverteerida vĂ”i andmekandja kaasaegsema vastu vahetada. Nii emuleerimine, konverteerimine kui ka nende kombineerimine aitavad sĂ€ilitada informatsiooni kasutatavust, kuid ei pruugi tagada autentset mĂ”istetavust, kuna digitaalse teabe esitus sĂ”ltub alati sĂ€ilitatud bittide tĂ”lgendamisest. NĂ€iteks, kui luua WordPad tarkvara abil ĂŒks dokument ja avada seesama dokument Hex Editor Neo abil, siis nĂ€eme seda faili kahendkujul, Notepad++ nĂ€itab RTFi kodeeringut, Microsoft Word 2010 ja LibreOffice Writeri esitustes vĂ”ime mĂ€rgata juba mitmeid erinevusi. KĂ”ik eelloetletud esitused on tehnoloogilises mĂ”ttes Ă”iged. Faili avamisel veateateid ei teki, sest tarkvara seisukohast lĂ€htudes peavadki esitused sellised olema. Siinjuures oluline rĂ”hutada, et ka korrektne esitus vĂ”ib jÀÀda kasutajale mĂ”istetamatuks – see, et andmed on sĂ€ilinud, et neid on vĂ”imalik lugeda ja esitada, ei garanteeri paraku, et neid Ă”igesti mĂ”istetakse. MĂ”istetavuse tagamiseks tuleb alati arvestada ka lĂ”ppkasutajaskonnaga. SeetĂ”ttu uuribki antud töö vĂ”imalusi, kuidas toetada teadmuse (mĂ”istetava informatsiooni) digitaalset arhiveerimist tuginedes eelkĂ”ige parimale praktikale, praktilistele eksperimentidele Rahvusarhiivis ja interdistsiplinaarsetele (nt infotehnoloogia kombineerimine arhiivindusega) vĂ”tetele.Digital preservation of knowledge is a very broad and complex research area. Many aspects are still open for research. According to the literature, the accessibility and usability of digital information have been more investigated than the comprehensibility of important digital information over time. Although there are remedies (e.g. emulation and migration) for mitigating the risks related to the accessibility and usability, the question how to guarantee understandability/comprehensibility of archived information is still ongoing research. Understanding digital information first requires a representation of the archived information, so that a user could then interpret and understand it. However, it is a not-so-well-known fact that the digital information does not have any fixed representation before involving some software. For example, if we create a document in WordPad and open the same file in Hex Editor Neo software, then we will see the binary representation which is also correct but not suitable for human users, as humans are not used to interpreting binary codes. When we open that file in Notepad++, then we can see the structure of the RTF coding. Again, this is the correct interpretation of this file, but not understandable for the ordinary user, as it shows the technical view of the file format structure. When we open that file in Microsoft Word 2010 or LibreOffice Writer, then we will notice some changes, although the original bits are the same and no errors are displayed by the software. Thus, all representations are technologically correct and no errors will be displayed to the user when they are opening this file. It is important to emphasise that in some cases even the original representation may be not understandable to the users. Therefore, it is important to know who the main users of the archives are and to ensure that the archived objects are independently understandable to that community over the long term. This dissertation will therefore research meaningful use of digital objects by taking into account the designated users’ knowledge and Open Archival Information System (OAIS) model. The research also includes several practical experimental projects at the National Archives of Estonia which will test some important parts of the theoretical work

    Bootstrapping Web Archive Collections From Micro-Collections in Social Media

    Get PDF
    In a Web plagued by disappearing resources, Web archive collections provide a valuable means of preserving Web resources important to the study of past events. These archived collections start with seed URIs (Uniform Resource Identifiers) hand-selected by curators. Curators produce high quality seeds by removing non-relevant URIs and adding URIs from credible and authoritative sources, but this ability comes at a cost: it is time consuming to collect these seeds. The result of this is a shortage of curators, a lack of Web archive collections for various important news events, and a need for an automatic system for generating seeds. We investigate the problem of generating seed URIs automatically, and explore the state of the art in collection building and seed selection. Attempts toward generating seeds automatically have mostly relied on scraping Web or social media Search Engine Result Pages (SERPs). In this work, we introduce a novel source for generating seeds from URIs in the threaded conversations of social media posts created by single or multiple users. Users on social media sites routinely create and share narratives about news events consisting of hand-selected URIs of news stories, tweets, videos, etc. In this work, we call these posts Micro-collections, whether shared on Reddit or Twitter, and we consider them as an important source for seeds. This is because, the effort taken to create Micro-collections is an indication of editorial activity and a demonstration of domain expertise. Therefore, we propose a model for generating seeds from Micro-collections. We begin by introducing a simple vocabulary, called post class for describing social media posts across different platforms, and extract seeds from the Micro-collections post class. We further propose Quality Proxies for seeds by extending the idea of collection comparison to evaluation, and present our Micro-collection/Quality Proxy (MCQP) framework for bootstrapping Web archive collections from Micro-collections in social media

    A Framework for More Effective Dark Web Marketplace Investigations

    Get PDF
    The success of the Silk Road has prompted the growth of many Dark Web marketplaces. This exponential growth has provided criminal enterprises with new outlets to sell illicit items. Thus, the Dark Web has generated great interest from academics and governments who have sought to unveil the identities of participants in these highly lucrative, yet illegal, marketplaces. Traditional Web scraping methodologies and investigative techniques have proven to be inept at unmasking these marketplace participants. This research provides an analytical framework for automating Dark Web scraping and analysis with free tools found on the World Wide Web. Using a case study marketplace, we successfully tested a Web crawler, developed using AppleScript, to retrieve the account information for thousands of vendors and their respective marketplace listings. This paper clearly details why AppleScript was the most viable and efficient method for scraping Dark Web marketplaces. The results from our case study validate the efficacy of our proposed analytical framework, which has relevance for academics studying this growing phenomenon and for investigators examining criminal activity on the Dark Web
    corecore