380,198 research outputs found

    Power to the people: end-user building of digital library collections

    Get PDF
    Naturally, digital library systems focus principally on the reader: th e consumer of the material that constitutes the library. In contrast, this paper describes an interface that makes it easy for people to build their own library collections. Collections may be built and served locally from the user's own web server, or (given appropriate permissions) remotely on a shared digital library host. End users can easily build new collections styled after existing ones from material on the Web or from their local files-or both, and collections can be updated and new ones brought on-line at any time. The interface, which is intended for non-professional end users, is modeled after widely used commercial software installation packages. Lest one quail at the prospect of end users building their own collections on a shared system, we also describe an interface for the administrative user who is responsible for maintaining a digital library installation

    Focused Crawl of Web Archives to Build Event Collections

    Full text link
    Event collections are frequently built by crawling the live web on the basis of seed URIs nominated by human experts. Focused web crawling is a technique where the crawler is guided by reference content pertaining to the event. Given the dynamic nature of the web and the pace with which topics evolve, the timing of the crawl is a concern for both approaches. We investigate the feasibility of performing focused crawls on the archived web. By utilizing the Memento infrastructure, we obtain resources from 22 web archives that contribute to building event collections. We create collections on four events and compare the relevance of their resources to collections built from crawling the live web as well as from a manually curated collection. Our results show that focused crawling on the archived web can be done and indeed results in highly relevant collections, especially for events that happened further in the past.Comment: accepted for publication at WebSci 201

    Moving towards Adaptive Search

    Get PDF
    Information retrieval has become very popular over the last decade with the advent of the Web. Nevertheless, searching on the Web is very different to searching on smaller, often more structured collections such as intranets and digital libraries. Such collections are the focus of the recently started AutoAdapt project1. The project seeks to aid user search by providing well-structured domain knowledge to assist query modification and navigation. There are two challenges: acquiring the domain knowledge and adapting it automatically to the specific interest of the user community. At the workshop we will demonstrate an implemented prototype that serves as a starting point on the way to truly adaptive search

    Using semantic indexing to improve searching performance in web archives

    Get PDF
    The sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    Enriching Existing Test Collections with OXPath

    Full text link
    Extending TREC-style test collections by incorporating external resources is a time consuming and challenging task. Making use of freely available web data requires technical skills to work with APIs or to create a web scraping program specifically tailored to the task at hand. We present a light-weight alternative that employs the web data extraction language OXPath to harvest data to be added to an existing test collection from web resources. We demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with additional metadata fields harvested via OXPath from the social sciences portal Sowiport. This allows the re-use of this collection for other evaluation purposes like bibliometrics-enhanced retrieval. The demonstrated method can be applied to a variety of similar scenarios and is not limited to extending existing collections but can also be used to create completely new ones with little effort.Comment: Experimental IR Meets Multilinguality, Multimodality, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 201

    ArchivePress: A Really Simple Solution to Archiving Blog Content

    Get PDF
    ArchivePress is a new technical solution for collecting and archiving content from blogs. Current solutions are commonly based on typical web archiving activities, whereby a crawler is configured to harvest a copy of the blog and return the copy to a web archive. This approach is perfectly acceptable if the requirement is that the site is presented as an integral whole. However, ArchivePress is based upon the premise that blogs are a distinct class of web-based resource, in which the post, not the page, is atomic, and certain properties, such as layouts and colours, are demonstrably superfluous for many (if not most) users. As a result, an approach that builds on the functionality provided by web feeds to capture only selected aspects of the blog offers more potential. This is particularly the case when institutions wish to develop collections of aggregated blog content from a range of different sources. The presentation will describe our research to develop such an approach, including work to define the significant properties of blogs, details of the technical development, and pilot collections against which the tool has been tested
    corecore