Search CORE

182,895 research outputs found

Using semantic indexing to improve searching performance in web archives

Author: Khan Arshad
Martin David J.
Tiropanis Thanassis
Publication venue
Publication date: 28/01/2013
Field of study

The sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives

Southampton (e-Prints Soton)

National Centre for Research Methods: NCRM EPrints Repository

Applying digital content management to support localisation

Author: Jones Gareth J.F.
Lawless Séamus
O'Connor Alexander
Wade Vincent
Zhou Dong
Publication venue: Localisation Research Centre
Publication date: 01/10/2009
Field of study

The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM

Irish Universities

DCU Online Research Access Service

The First International Conference on Building and Exploring Web Based Environments-WEB2013

Author: Khan Arshad
Martin David
Tiropanis Thanassis
Publication venue: ThinkMind
Publication date: 28/01/2013
Field of study

he sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives

National Centre for Research Methods: NCRM EPrints Repository

See a little Warclight: building an open-source web archive portal with project blacklight

Author: Milligan Ian
Ruest Nick
Publication venue: IIPC Web Archiving Conference 2019
Publication date: 06/06/2019
Field of study

In 2014-15, due to close collaboration between UK-based researchers and the UK Web Archive, the open-source Shine project was launched. It allowed faceted search, trend diagram exploration, and other advanced methods of exploring web archives. It had two limitations, however: it was based on the Play framework (which is relatively obscure especially within library settings) and after the Big UK Domain Data for the Arts and Humanities (BUDDAH) project came to an end, development largely languished. The idea of Shine is an important one, however, and our project team wanted to explore how we could take this great work and begin to move it into the wider, open-source library community. Hence the idea of a Project Blacklight-based engine for exploring web archives. Blacklight, an open-source library discovery engine, would be familiar to library IT managers and other technical community members. But what if Blacklight could work with WARCs? The Archives Unleashed team’s first foray towards what we now call “Warclight” — a portmanteau of Blacklight and the ISO-standardized Web ARChive file format — was building a standalone Blacklight Rails application. As we began to realize this doesn’t help those who would like to implement it, development pivoted to building a Rails Engine which, “allows you to wrap a specific Rails application or subset of functionality and share it with other applications or within a larger packaged application.” Put another way, it allows others to use an existing Warclight template to build their own web archive search application. Drawing inspiration from UKWA’s Shine, it allows faceted full-text search, record view, and other advanced discovery options. Warclight is designed to work with web archive data that is indexed via the UK Web Archive’s webarchive-discovery project. Webarchive-discovery is a utility to parse ARCs and WARCs, and index them using Apache Solr, an open source search platform. Once these ARCs and WARCs have been indexed into Solr, it provides us with searchable fields including: title, host, crawl-date, and content type. One of the biggest strengths of Warclight is that it is based on Blacklight. This opens up a mature open source community, which could allow us to go farther if we’re following the old idiom: “If you want to go fast, go alone. If you want to go further, go together.” This presentation will provide and overview of Warclight, and implementation patterns. Including the Archives Unleashed at scale implementation of over 1 billion Solr docs using Apache SolrCloud.This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo

YorkSpace

Lost but not forgotten: finding pages on the unarchived web

Author: Ben-David A.
Huurdeman H.C.
Kamps J.
Rogers R.A. (Richard)
Samar T. (Thaer)
Vries A.P. (Arjen) de
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites

CWI's Institutional Repository

Springer - Publisher Connector

UvA-DARE

International Migration, Integration and Social Cohesion online publications

Using Google Analytics Data to Expand Discovery and Use of Digital Archival Content

Author: Szajewski Michael
Publication venue: DigitalCommons@ILR
Publication date: 01/11/2013
Field of study

This article presents opportunities for the use of Google Analytics, a popular and freely available web analytics tool, to inform decision making for digital archivists managing online digital archives content. Emphasis is placed on the analysis of Google Analytics data to increase the visibility and discoverability of content. The article describes the use of Google Analytics to support fruitful digital outreach programs, to guide metadata creation for enhancing access, and to measure user demand to aid selection for digitization. Valuable reports, features, and tools in Google Analytics are identified and the use of these tools to gather meaningful data is explained

DigitalCommons@ILR

eCommons@Cornell

Profiling Web Archive Coverage for Top-Level Domain and Content Language

Author: AlSum Ahmed
Nelson Michael L.
Van de Sompel Herbert
Weigle Michele C.
Publication venue
Publication date: 01/01/2013
Field of study

The Memento aggregator currently polls every known public web archive when serving a request for an archived web page, even though some web archives focus on only specific domains and ignore the others. Similar to query routing in distributed search, we investigate the impact on aggregated Memento TimeMaps (lists of when and where a web page was archived) by only sending queries to archives likely to hold the archived page. We profile twelve public web archives using data from a variety of sources (the web, archives' access logs, and full-text queries to archives) and discover that only sending queries to the top three web archives (i.e., a 75% reduction in the number of queries) for any request produces the full TimeMaps on 84% of the cases.Comment: Appeared in TPDL 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Old Dominion University

BlogForever D3.2: Interoperability Prospects

Author: Banos V.
Berninger L.
Kalb H.
Kim Y.
Kopidaki S.
Lazaridou P.
Pinsent E.
Ross S.
Publication venue
Publication date: 25/10/2013
Field of study

This report evaluates the interoperability prospects of the BlogForever platform. Therefore, existing interoperability models are reviewed, a Delphi study to identify crucial aspects for the interoperability of web archives and digital libraries is conducted, technical interoperability standards and protocols are reviewed regarding their relevance for BlogForever, a simple approach to consider interoperability in specific usage scenarios is proposed, and a tangible approach to develop a succession plan that would allow a reliable transfer of content from the current digital archive to other digital repositories is presented

ZENODO

Content-Based Exploration of Archival Images Using Neural Networks

Author: Adewoye Tobi
Fritz Samantha
Han Xiao
Lin Jimmy
Milligan Ian
Ruest Nick
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/08/2020
Field of study

We present DAIRE (Deep Archival Image Retrieval Engine), an image exploration tool based on latent representations derived from neural networks, which allows scholars to "query" using an image of interest to rapidly find related images within a web archive. This work represents one part of our broader effort to move away from text-centric analyses of web archives and scholarly tools that are direct reflections of methods for accessing the live web. This short piece describes the implementation of our system and a case study on a subset of the GeoCities web archive.This research was supported in part by the Andrew W. Mellon Foundation and the Social Sciences and Humanities Research Council of Canada

Crossref

YorkSpace

Eprints and the Open Archives Initiative

Author: Warner Simeon
Publication venue: 'Emerald'
Publication date: 03/07/2003
Field of study

The Open Archives Initiative (OAI) was created as a practical way to promote interoperability between eprint repositories. Although the scope of the OAI has been broadened, eprint repositories still represent a significant fraction of OAI data providers. In this article I present a brief survey of OAI eprint repositories, and of services using metadata harvested from eprint repositories using the OAI protocol for metadata harvesting (OAI-PMH). I then discuss several situations where metadata harvesting may be used to further improve the utility of eprint archives as a component of the scholarly communication infrastructure.Comment: 13 page

arXiv.org e-Print Archive

Crossref