380,198 research outputs found
Power to the people: end-user building of digital library collections
Naturally, digital library systems focus principally on the reader: th e consumer of the material that constitutes the library. In contrast, this paper describes an interface that makes it easy for people to build their own library collections. Collections may be built and served locally from the user's own web server, or (given appropriate permissions) remotely on a shared digital library host. End users can easily build new collections styled after existing ones from material on the Web or from their local files-or both, and collections can be updated and new ones brought on-line at any time. The interface, which is intended for non-professional end users, is modeled after widely used commercial software installation packages. Lest one quail at the prospect of end users building their own collections on a shared system, we also describe an interface for the administrative user who is responsible for maintaining a digital library installation
Focused Crawl of Web Archives to Build Event Collections
Event collections are frequently built by crawling the live web on the basis
of seed URIs nominated by human experts. Focused web crawling is a technique
where the crawler is guided by reference content pertaining to the event. Given
the dynamic nature of the web and the pace with which topics evolve, the timing
of the crawl is a concern for both approaches. We investigate the feasibility
of performing focused crawls on the archived web. By utilizing the Memento
infrastructure, we obtain resources from 22 web archives that contribute to
building event collections. We create collections on four events and compare
the relevance of their resources to collections built from crawling the live
web as well as from a manually curated collection. Our results show that
focused crawling on the archived web can be done and indeed results in highly
relevant collections, especially for events that happened further in the past.Comment: accepted for publication at WebSci 201
Moving towards Adaptive Search
Information retrieval has become very popular over the last decade with the advent of the Web. Nevertheless, searching on the Web is very different to searching on smaller, often more structured collections such as intranets and digital libraries. Such collections are the focus of the recently started AutoAdapt project1. The project seeks to aid user search by providing well-structured domain knowledge to assist query modification and navigation. There are two challenges: acquiring the domain knowledge and adapting it automatically to the specific interest of the user community. At the workshop we will demonstrate an implemented prototype that serves as a starting point on the way to truly adaptive search
Using semantic indexing to improve searching performance in web archives
The sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Enriching Existing Test Collections with OXPath
Extending TREC-style test collections by incorporating external resources is
a time consuming and challenging task. Making use of freely available web data
requires technical skills to work with APIs or to create a web scraping program
specifically tailored to the task at hand. We present a light-weight
alternative that employs the web data extraction language OXPath to harvest
data to be added to an existing test collection from web resources. We
demonstrate this by creating an extended version of GIRT4 called GIRT4-XT with
additional metadata fields harvested via OXPath from the social sciences portal
Sowiport. This allows the re-use of this collection for other evaluation
purposes like bibliometrics-enhanced retrieval. The demonstrated method can be
applied to a variety of similar scenarios and is not limited to extending
existing collections but can also be used to create completely new ones with
little effort.Comment: Experimental IR Meets Multilinguality, Multimodality, and Interaction
- 8th International Conference of the CLEF Association, CLEF 2017, Dublin,
Ireland, September 11-14, 201
ArchivePress: A Really Simple Solution to Archiving Blog Content
ArchivePress is a new technical solution for collecting and archiving content from blogs. Current solutions are commonly based on typical web archiving activities, whereby a crawler is configured to harvest a copy of the blog and return the copy to a web archive. This approach is perfectly acceptable if the requirement is that the site is presented as an integral whole. However, ArchivePress is based upon the premise that blogs are a distinct class of web-based resource, in which the post, not the page, is atomic, and certain properties, such as layouts and colours, are demonstrably superfluous for many (if not most) users. As a result, an approach that builds on the functionality provided by web feeds to capture only selected aspects of the blog offers more potential. This is particularly the case when institutions wish to develop collections of aggregated blog content from a range of different sources. The presentation will describe our research to develop such an approach, including work to define the significant properties of blogs, details of the technical development, and pilot collections against which the tool has been tested
- …
