Search CORE

26 research outputs found

The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

Author: Demidova Elena
Gossen Gerhard
Risse Thomas
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semi-automatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.Comment: Published in the Proceedings of the European Conference on Information Retrieval (ECIR) 201

arXiv.org e-Print Archive

Crossref

Analyzing web archives through topic and event focused sub-collections

Author: Demidova Elena
Gossen Gerhard
Risse Thomas
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections

Southampton (e-Prints Soton)

Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

Author: Cardoso Jorge
Demidova Elena
Gossen Gerhard
Guerra Francesco
Holzmann Helge
Houben Geert-Jan
Pinto Alexandre Miguel
Risse Thomas
Souza Tarcisio
Szymanski Julian
Velegrakis Yannis
Publication venue: Berlin ; Heidelberg : Springer
Publication date: 01/01/2016
Field of study

Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

Institutionelles Repositorium der Leibniz Universität Hannover

MOSAiC-ACA and AFLUX - Arctic airborne campaigns characterizing the exit area of MOSAiC

Author: Andreas Herber
André Ehrlich
Birte Solveig Kulla
Christiane Voigt
Christof Lüpkes
Christophe Gourbeyre
Dmitry Chechin
Elena Ruiz-Donoso
Evelyn Jäkel
Guillaume Mioche
Johannes Stapf
Jörg Hartmann
Leif-Leonard Kliesch
Manfred Wendisch
Manuel Moser
Marcus Klingebiel
Mario Mech
Michael Schäfer
Nils Risse
Olivier Jourdan
Régis Dupuy
Sebastian Becker
Susanne Crewell
Yvonne Boose
Publication venue: Nature Publishing Group
Publication date: 01/01/2022
Field of study

Two airborne field campaigns focusing on observations of Arctic mixed-phase clouds and boundary layer processes and their role with respect to Arctic amplification have been carried out in spring 2019 and late summer 2020 over the Fram Strait northwest of Svalbard. The latter campaign was closely connected to the Multidisciplinary drifting Observatory for the Study of Arctic Climate (MOSAiC) expedition. Comprehensive data sets of the cloudy Arctic atmosphere have been collected by operating remote sensing instruments, insitu probes, instruments for the measurement of turbulent fluxes of energy and momentum, and dropsondes on board the AWI research aircraft Polar 5. In total, 24 flights with 111 flight hours have been performed over open ocean, the marginal sea ice zone, and sea ice. The data sets follow documented methods and quality assurance and are suited for studies on Arctic mixed-phase clouds and their transformation processes, for studies with a focus on Arctic boundary layer processes, and for satellite validation application

Institute of Transport Research:Publications

Kölner UniversitätsPublikationsServer

Directory of Open Access Journals

HAL Clermont Université

PubMed Central

HAL-INSU

Electronic Publication Information Center

International Cooperation and Development: A Conceptual Overview

Author: -/ -
-/ A Guar�n
-/ D G Victor
-/ D G Victor
-/ K J Alter
-/ O S Stokke
-/ P Conceicao
-/ S Oberth�r
-/ Slaughter
-/ T Hale
/ J S Nye
A / M Liese
A / P Hasenclever
A A H Chayes
A C. / V Cutler
A F. / A B Cooper
A Sumner
A. / J.-F Orsini
A.-M Slaughter
Babette Never
C / J J�nsson
C Kenny
C P Kindleberger
C P R Romano
C Streck
Caroline Reeg
D / F Messner
D / T Lesage
D Gartner
D Vogel
D �lcer
E / S Sandor
E Ostrom
Eike W Schamp
Elena Pietschmann
F / G Mayer
F Pagani
Georgeta Vidican
Gr�inne B�rca
Guido / Stephan Ashoff
H / A Kharas
H / M Reisen
H Jones
H Kharas
I Kaul
J / R Peterson
J G Ruggie
J N Rosenau
J O Goldstein
J.-M. / O Severino
K / P Dingwerth
K Dingwerth
K Holzinger
K Raffer
K Raustiala
Kaul
Keijzer / Kr�tke / Van Seters
L Haddad
Lilli Banholzer
Luis A Camacho
M / M Howlett
M / M Koenig-Archibugi
M / S Sch�ferhoff
M A G C Pollack
M E. / K Keck
M Kahler
M Mueller
M Na�m
M Olson
M P. / K A Karns
O / D W Morrissey
O S Stokke
P / V Kenis
P M Haas
P Pattberg
P Pauw
Pieter Pauw
R / T Cornes
R Falkner
R Jervis
R Kanbur
R Mayntz
R O Keohane
R Voigt
R W. / J K Cobb
S / D Forman
S / T Oberth�r
S D Krasner
Sebastian Paulo
T / B Gehring
T / M J Harding
T / W B�the
T Hale
T Risse
T Sandler
W H Reinicke
W H. / F Reinicke
W P Mccray
W W Powell
World Bank
Yulia Yamineva
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

Crossref

Extracting event-centric document collections from large-scale web archives

Author: Demidova Elena
Gossen Gerhard
Risse Thomas
Publication venue
Publication date: 01/01/2017
Field of study

Web archives created by the Internet Archive (IA) (https://archive.org), national libraries and other archiving services contain large amounts of information collected for a time period of over twenty years. These archives constitute a valuable source for research in many disciplines, including the digital humanities and the historical sciences by offering a unique possibility to look into past events and their representation on the Web. Most Web archive services aim to capture the entire Web (IA) or national top-level domains and are therefore broad in their scope, diverse regarding the topics they contain and the time intervals they cover. Due to the large size and the broad scope it is difficult for interested researchers to locate relevant information in the archives as search facilities are very limited. Many users are more interested in studying smaller and topically coherent event-centric collections of documents contained in a Web archive [1,2]. Such collections can reflect specific events such as elections, or natural disasters, e.g. the Fukushima nuclear disaster (2011) or the German federal elections

arXiv.org e-Print Archive

Crossref

Hochschulschriftenserver - Universität Frankfurt am Main

The Past Web: exploring web archives (pre-print)

Author: Demidova Elena
Gomes Daniel
Risse Thomas
Winters Jane
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/07/2021
Field of study

This book provides practical information about web archives, offers inspiring examples for web archivists, raises new challenges, and shares recent research results about access methods to explore information from the past preserved by web archives.info:eu-repo/semantics/publishedVersio

Repositório Comum

Platform and App Histories: Assessing Source Availability in Web Archives and App Repositories

Author: Demidova Elena
Gomes Daniel
Helmond Anne
Risse Thomas
van der Vlist F.N.
Winters Jane
Publication venue
Publication date: 01/01/2020
Field of study

In this chapter, we discuss the research opportunities for historical studies of apps and platforms by focusing on their distinctive characteristics and material traces. We demonstrate the value and explore the utility and breadth of web archives and software repositories for building corpora of archived platform and app sources. Platforms and apps notoriously resist archiving due to their ephemerality and continuous updates. As a consequence, their histories are being overwritten with each update rather than written and preserved. We present a method to assess the availability of archived web sources for social media platforms and apps across the leading web archives and app repositories. Additionally, we conduct a comparative source set availability analysis to establish how, and how well, various source sets are represented across web archives. Our preliminary results indicate that despite the challenges of social media and app archiving, many material traces of platforms and apps are in fact well preserved. We understand these contextual materials as important primary sources through which digital objects such as platforms and apps co-author their own “biographies” with web archives and software repositories

Utrecht University Repository

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Analysing and Enriching Focused Semantic Web Archives for Parliament Applications

Author: Adam Funk
Diana Maynard
Dimitris Spiliotopoulos
Elena Demidova
Helge Holzmann
Nicola Barbieri
Nikolaos Papailiou
Stefan Dietze
Thomas Risse
Wim Peters
Publication venue: MDPI AG
Publication date: 01/01/2014
Field of study

The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results

Multidisciplinary Digital Publishing Institute

CiteSeerX

Directory of Open Access Journals

Institutionelles Repositorium der Leibniz Universität Hannover

Hochschulschriftenserver - Universität Frankfurt am Main