18,051 research outputs found
Bots, Seeds and People: Web Archives as Infrastructure
The field of web archiving provides a unique mix of human and automated
agents collaborating to achieve the preservation of the web. Centuries old
theories of archival appraisal are being transplanted into the sociotechnical
environment of the World Wide Web with varying degrees of success. The work of
the archivist and bots in contact with the material of the web present a
distinctive and understudied CSCW shaped problem. To investigate this space we
conducted semi-structured interviews with archivists and technologists who were
directly involved in the selection of content from the web for archives. These
semi-structured interviews identified thematic areas that inform the appraisal
process in web archives, some of which are encoded in heuristics and
algorithms. Making the infrastructure of web archives legible to the archivist,
the automated agents and the future researcher is presented as a challenge to
the CSCW and archival community
Illinois Digital Scholarship: Preserving and Accessing the Digital Past, Present, and Future
Since the University's establishment in 1867, its scholarly output has been issued primarily in print, and the University Library and Archives have been readily able to collect, preserve, and to provide access to that output. Today, technological, economic, political and social forces are buffeting all means of scholarly communication. Scholars, academic institutions and publishers are engaged in debate about the impact of digital scholarship and open access publishing on the promotion and tenure process. The upsurge in digital scholarship affects many aspects of the academic enterprise, including how we record, evaluate, preserve, organize and disseminate scholarly work. The result has left the Library with no ready means by which to archive digitally produced publications, reports, presentations, and learning objects, much of which cannot be adequately represented in print form. In this incredibly fluid environment of digital scholarship, the critical question of how we will collect, preserve, and manage access to this important part of the University scholarly record demands a rational and forward-looking plan - one that includes perspectives from diverse scholarly disciplines, incorporates significant research breakthroughs in information science and computer science, and makes effective projections for future integration within the Library and computing services as a part of the campus infrastructure.Prepared jointly by the University of Illinois Library and CITES at the University of Illinois at Urbana-Champaig
Digital Archiving in the Context of Cultural Change
The term 'archiving' has its origins in the context of the printing press. As our social constructs change and evolve with the advent and ubiquity of the network, it is necessary to recognize that established terms can hamper adjusting to the inevitable evolution
Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool
Conventional Web archives are created by periodically crawling a web site and
archiving the responses from the Web server. Although easy to implement and
common deployed, this form of archiving typically misses updates and may not be
suitable for all preservation scenarios, for example a site that is required
(perhaps for records compliance) to keep a copy of all pages it has served. In
contrast, transactional archives work in conjunction with a Web server to
record all pages that have been served. Los Alamos National Laboratory has
developed SiteSory, an open-source transactional archive written in Java
solution that runs on Apache Web servers, provides a Memento compatible access
interface, and WARC file export features. We used the ApacheBench utility on a
pre-release version of to measure response time and content delivery time in
different environments and on different machines. The performance tests were
designed to determine the feasibility of SiteStory as a production-level
solution for high fidelity automatic Web archiving. We found that SiteStory
does not significantly affect content server performance when it is performing
transactional archiving. Content server performance slows from 0.076 seconds to
0.086 seconds per Web page access when the content server is under load, and
from 0.15 seconds to 0.21 seconds when the resource has many embedded and
changing resources.Comment: 13 pages, Technical Repor
The selection, appraisal and retention of digital scientific data: dighlights of an ERPANET/CODATA workshop
CODATA and ERPANET collaborated to convene an international archiving workshop on the selection, appraisal, and retention of digital scientific data, which was held on 15-17 December 2003 at the Biblioteca Nacional in Lisbon, Portugal. The workshop brought together more than 65 researchers, data and information managers, archivists, and librarians from 13 countries to discuss the issues involved in making critical decisions regarding the long-term preservation of the scientific record. One of the major aims for this workshop was to provide an international forum to exchange information about data archiving policies and practices across different scientific, institutional, and national contexts. Highlights from the workshop discussions are presented
Representing Dataset Quality Metadata using Multi-Dimensional Views
Data quality is commonly defined as fitness for use. The problem of
identifying quality of data is faced by many data consumers. Data publishers
often do not have the means to identify quality problems in their data. To make
the task for both stakeholders easier, we have developed the Dataset Quality
Ontology (daQ). daQ is a core vocabulary for representing the results of
quality benchmarking of a linked dataset. It represents quality metadata as
multi-dimensional and statistical observations using the Data Cube vocabulary.
Quality metadata are organised as a self-contained graph, which can, e.g., be
embedded into linked open datasets. We discuss the design considerations, give
examples for extending daQ by custom quality metrics, and present use cases
such as analysing data versions, browsing datasets by quality, and link
identification. We finally discuss how data cube visualisation tools enable
data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5
September 2014, Leipzig, German
Web Data Extraction, Applications and Techniques: A Survey
Web Data Extraction is an important problem that has been studied by means of
different scientific tools and in a broad range of applications. Many
approaches to extracting data from the Web have been designed to solve specific
problems and operate in ad-hoc domains. Other approaches, instead, heavily
reuse techniques and algorithms developed in the field of Information
Extraction.
This survey aims at providing a structured and comprehensive overview of the
literature in the field of Web Data Extraction. We provided a simple
classification framework in which existing Web Data Extraction applications are
grouped into two main classes, namely applications at the Enterprise level and
at the Social Web level. At the Enterprise level, Web Data Extraction
techniques emerge as a key tool to perform data analysis in Business and
Competitive Intelligence systems as well as for business process
re-engineering. At the Social Web level, Web Data Extraction techniques allow
to gather a large amount of structured data continuously generated and
disseminated by Web 2.0, Social Media and Online Social Network users and this
offers unprecedented opportunities to analyze human behavior at a very large
scale. We discuss also the potential of cross-fertilization, i.e., on the
possibility of re-using Web Data Extraction techniques originally designed to
work in a given domain, in other domains.Comment: Knowledge-based System
Research assessment in the humanities: problems and challenges
Research assessment is going to play a new role in the governance of universities and research institutions. Evaluation of results is evolving from a simple tool for resource allocation towards policy design. In this respect "measuring" implies a different approach to quantitative aspects as well as to an estimation of qualitative criteria that are difficult to define. Bibliometrics became so popular, in spite of its limits, just offering a simple solution to complex problems. The theory behind it is not so robust but available results confirm this method as a reasonable trade off between costs and benefits.
Indeed there are some fields of science where quantitative indicators are very difficult to apply due to the lack of databases and data, in few words the credibility of existing information. Humanities and social sciences (HSS) need a coherent methodology to assess research outputs but current projects are not very convincing.
The possibility of creating a shared ranking of journals by the value of their contents at either institutional, national or European level is not enough as it is raising the same bias as in the hard sciences and it does not solve the problem of the various types of outputs and the different, much longer time of creation and dissemination.
The web (and web 2.0) represents a revolution in the communication of research results mainly in the HSS, and also their evaluation has to take into account this change. Furthermore, the increase of open access initiatives (green and gold road) offers a large quantity of transparent, verifiable data structured according to international standards that allow comparability beyond national limits and above all is independent from commercial agents.
The pilot scheme carried out at the university of Milan for the Faculty of Humanities demonstrated that it is possible to build quantitative, on average more robust indicators, that could provide a proxy of research production and productiivity even in the HSS
- …