26 research outputs found

    The iCrawl Wizard -- Supporting Interactive Focused Crawl Specification

    Full text link
    Collections of Web documents about specific topics are needed for many areas of current research. Focused crawling enables the creation of such collections on demand. Current focused crawlers require the user to manually specify starting points for the crawl (seed URLs). These are also used to describe the expected topic of the collection. The choice of seed URLs influences the quality of the resulting collection and requires a lot of expertise. In this demonstration we present the iCrawl Wizard, a tool that assists users in defining focused crawls efficiently and semi-automatically. Our tool uses major search engines and Social Media APIs as well as information extraction techniques to find seed URLs and a semantic description of the crawl intent. Using the iCrawl Wizard even non-expert users can create semantic specifications for focused crawlers interactively and efficiently.Comment: Published in the Proceedings of the European Conference on Information Retrieval (ECIR) 201

    Analyzing web archives through topic and event focused sub-collections

    No full text
    Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections

    Semantic URL Analytics to Support Efficient Annotation of Large Scale Web Archives

    Get PDF
    Long-term Web archives comprise Web documents gathered over longer time periods and can easily reach hundreds of terabytes in size. Semantic annotations such as named entities can facilitate intelligent access to the Web archive data. However, the annotation of the entire archive content on this scale is often infeasible. The most efficient way to access the documents within Web archives is provided through their URLs, which are typically stored in dedicated index files. The URLs of the archived Web documents can contain semantic information and can offer an efficient way to obtain initial semantic annotations for the archived documents. In this paper, we analyse the applicability of semantic analysis techniques such as named entity extraction to the URLs in a Web archive. We evaluate the precision of the named entity extraction from the URLs in the Popular German Web dataset and analyse the proportion of the archived URLs from 1,444 popular domains in the time interval from 2000 to 2012 to which these techniques are applicable. Our results demonstrate that named entity recognition can be successfully applied to a large number of URLs in our Web archive and provide a good starting point to efficiently annotate large scale collections of Web documents

    MOSAiC-ACA and AFLUX - Arctic airborne campaigns characterizing the exit area of MOSAiC

    Get PDF
    Two airborne field campaigns focusing on observations of Arctic mixed-phase clouds and boundary layer processes and their role with respect to Arctic amplification have been carried out in spring 2019 and late summer 2020 over the Fram Strait northwest of Svalbard. The latter campaign was closely connected to the Multidisciplinary drifting Observatory for the Study of Arctic Climate (MOSAiC) expedition. Comprehensive data sets of the cloudy Arctic atmosphere have been collected by operating remote sensing instruments, insitu probes, instruments for the measurement of turbulent fluxes of energy and momentum, and dropsondes on board the AWI research aircraft Polar 5. In total, 24 flights with 111 flight hours have been performed over open ocean, the marginal sea ice zone, and sea ice. The data sets follow documented methods and quality assurance and are suited for studies on Arctic mixed-phase clouds and their transformation processes, for studies with a focus on Arctic boundary layer processes, and for satellite validation application

    Extracting event-centric document collections from large-scale web archives

    No full text
    Web archives created by the Internet Archive (IA) (https://archive.org), national libraries and other archiving services contain large amounts of information collected for a time period of over twenty years. These archives constitute a valuable source for research in many disciplines, including the digital humanities and the historical sciences by offering a unique possibility to look into past events and their representation on the Web. Most Web archive services aim to capture the entire Web (IA) or national top-level domains and are therefore broad in their scope, diverse regarding the topics they contain and the time intervals they cover. Due to the large size and the broad scope it is difficult for interested researchers to locate relevant information in the archives as search facilities are very limited. Many users are more interested in studying smaller and topically coherent event-centric collections of documents contained in a Web archive [1,2]. Such collections can reflect specific events such as elections, or natural disasters, e.g. the Fukushima nuclear disaster (2011) or the German federal elections

    The Past Web: exploring web archives (pre-print)

    No full text
    This book provides practical information about web archives, offers inspiring examples for web archivists, raises new challenges, and shares recent research results about access methods to explore information from the past preserved by web archives.info:eu-repo/semantics/publishedVersio

    Platform and App Histories: Assessing Source Availability in Web Archives and App Repositories

    Get PDF
    In this chapter, we discuss the research opportunities for historical studies of apps and platforms by focusing on their distinctive characteristics and material traces. We demonstrate the value and explore the utility and breadth of web archives and software repositories for building corpora of archived platform and app sources. Platforms and apps notoriously resist archiving due to their ephemerality and continuous updates. As a consequence, their histories are being overwritten with each update rather than written and preserved. We present a method to assess the availability of archived web sources for social media platforms and apps across the leading web archives and app repositories. Additionally, we conduct a comparative source set availability analysis to establish how, and how well, various source sets are represented across web archives. Our preliminary results indicate that despite the challenges of social media and app archiving, many material traces of platforms and apps are in fact well preserved. We understand these contextual materials as important primary sources through which digital objects such as platforms and apps co-author their own “biographies” with web archives and software repositories

    Analysing and Enriching Focused Semantic Web Archives for Parliament Applications

    Get PDF
    The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results
    corecore