Search CORE

18,051 research outputs found

Bots, Seeds and People: Web Archives as Infrastructure

Author: Booms Hans
Botticelli Peter
Cook Terry
Couture Carol
Geertz Clifford
Jordan Brigitte
Mohr Gordon
Niu Jinfang
Seaver Nick
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/11/2016
Field of study

The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community

arXiv.org e-Print Archive

Crossref

Illinois Digital Scholarship: Preserving and Accessing the Digital Past, Present, and Future

Author: Grady Michael
Mischo William H.
Sandore Beth
Publication venue
Publication date: 07/04/2004
Field of study

Since the University's establishment in 1867, its scholarly output has been issued primarily in print, and the University Library and Archives have been readily able to collect, preserve, and to provide access to that output. Today, technological, economic, political and social forces are buffeting all means of scholarly communication. Scholars, academic institutions and publishers are engaged in debate about the impact of digital scholarship and open access publishing on the promotion and tenure process. The upsurge in digital scholarship affects many aspects of the academic enterprise, including how we record, evaluate, preserve, organize and disseminate scholarly work. The result has left the Library with no ready means by which to archive digitally produced publications, reports, presentations, and learning objects, much of which cannot be adequately represented in print form. In this incredibly fluid environment of digital scholarship, the critical question of how we will collect, preserve, and manage access to this important part of the University scholarly record demands a rational and forward-looking plan - one that includes perspectives from diverse scholarly disciplines, incorporates significant research breakthroughs in information science and computer science, and makes effective projections for future integration within the Library and computing services as a part of the campus infrastructure.Prepared jointly by the University of Illinois Library and CITES at the University of Illinois at Urbana-Champaig

Illinois Digital Environment for Access to Learning and Scholarship Repository

Digital Archiving in the Context of Cultural Change

Author: Douglas Kimberly
Publication venue: 'California Institute of Technology Library'
Publication date: 01/01/2000
Field of study

The term 'archiving' has its origins in the context of the printing press. As our social constructs change and evolve with the advent and ubiquity of the network, it is necessary to recognize that established terms can hamper adjusting to the inevitable evolution

Caltech Authors

Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool

Author: Brunelle Justin F.
Nelson Michael L.
Publication venue
Publication date: 01/01/2012
Field of study

Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all pages that have been served. Los Alamos National Laboratory has developed SiteSory, an open-source transactional archive written in Java solution that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used the ApacheBench utility on a pre-release version of to measure response time and content delivery time in different environments and on different machines. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.Comment: 13 pages, Technical Repor

arXiv.org e-Print Archive

CiteSeerX

The selection, appraisal and retention of digital scientific data: dighlights of an ERPANET/CODATA workshop

Author: Anderson W.
Davidson J.
Esanu E.
Ross S.
Publication venue: Ubiquity Press Ltd.
Publication date: 01/12/2004
Field of study

CODATA and ERPANET collaborated to convene an international archiving workshop on the selection, appraisal, and retention of digital scientific data, which was held on 15-17 December 2003 at the Biblioteca Nacional in Lisbon, Portugal. The workshop brought together more than 65 researchers, data and information managers, archivists, and librarians from 13 countries to discuss the issues involved in making critical decisions regarding the long-term preservation of the scientific record. One of the major aims for this workshop was to provide an international forum to exchange information about data archiving policies and practices across different scientific, institutional, and national contexts. Highlights from the workshop discussions are presented

Enlighten

Guidance for selecting materials for digitisation

Author: Ayris P
Publication venue
Publication date: 01/01/1998
Field of study

28-30 September 199

UCL Discovery

Representing Dataset Quality Metadata using Multi-Dimensional Views

Author: Auer Sören
Debattista Jeremy
Lange Christoph
Publication venue
Publication date: 01/01/2014
Field of study

Data quality is commonly defined as fitness for use. The problem of identifying quality of data is faced by many data consumers. Data publishers often do not have the means to identify quality problems in their data. To make the task for both stakeholders easier, we have developed the Dataset Quality Ontology (daQ). daQ is a core vocabulary for representing the results of quality benchmarking of a linked dataset. It represents quality metadata as multi-dimensional and statistical observations using the Data Cube vocabulary. Quality metadata are organised as a self-contained graph, which can, e.g., be embedded into linked open datasets. We discuss the design considerations, give examples for extending daQ by custom quality metrics, and present use cases such as analysing data versions, browsing datasets by quality, and link identification. We finally discuss how data cube visualisation tools enable data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5 September 2014, Leipzig, German

arXiv.org e-Print Archive

Crossref

Fraunhofer-ePrints

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Research assessment in the humanities: problems and challenges

Author: Galimberti Paola
Publication venue
Publication date: 01/01/2010
Field of study

Research assessment is going to play a new role in the governance of universities and research institutions. Evaluation of results is evolving from a simple tool for resource allocation towards policy design. In this respect "measuring" implies a different approach to quantitative aspects as well as to an estimation of qualitative criteria that are difficult to define. Bibliometrics became so popular, in spite of its limits, just offering a simple solution to complex problems. The theory behind it is not so robust but available results confirm this method as a reasonable trade off between costs and benefits. Indeed there are some fields of science where quantitative indicators are very difficult to apply due to the lack of databases and data, in few words the credibility of existing information. Humanities and social sciences (HSS) need a coherent methodology to assess research outputs but current projects are not very convincing. The possibility of creating a shared ranking of journals by the value of their contents at either institutional, national or European level is not enough as it is raising the same bias as in the hard sciences and it does not solve the problem of the various types of outputs and the different, much longer time of creation and dissemination. The web (and web 2.0) represents a revolution in the communication of research results mainly in the HSS, and also their evaluation has to take into account this change. Furthermore, the increase of open access initiatives (green and gold road) offers a large quantity of transparent, verifiable data structured according to international standards that allow comparability beyond national limits and above all is independent from commercial agents. The pilot scheme carried out at the university of Milan for the Faculty of Humanities demonstrated that it is possible to build quantitative, on average more robust indicators, that could provide a proxy of research production and productiivity even in the HSS

E-LIS