18,051 research outputs found

    Bots, Seeds and People: Web Archives as Infrastructure

    Full text link
    The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community

    Illinois Digital Scholarship: Preserving and Accessing the Digital Past, Present, and Future

    Get PDF
    Since the University's establishment in 1867, its scholarly output has been issued primarily in print, and the University Library and Archives have been readily able to collect, preserve, and to provide access to that output. Today, technological, economic, political and social forces are buffeting all means of scholarly communication. Scholars, academic institutions and publishers are engaged in debate about the impact of digital scholarship and open access publishing on the promotion and tenure process. The upsurge in digital scholarship affects many aspects of the academic enterprise, including how we record, evaluate, preserve, organize and disseminate scholarly work. The result has left the Library with no ready means by which to archive digitally produced publications, reports, presentations, and learning objects, much of which cannot be adequately represented in print form. In this incredibly fluid environment of digital scholarship, the critical question of how we will collect, preserve, and manage access to this important part of the University scholarly record demands a rational and forward-looking plan - one that includes perspectives from diverse scholarly disciplines, incorporates significant research breakthroughs in information science and computer science, and makes effective projections for future integration within the Library and computing services as a part of the campus infrastructure.Prepared jointly by the University of Illinois Library and CITES at the University of Illinois at Urbana-Champaig

    Digital Archiving in the Context of Cultural Change

    Get PDF
    The term 'archiving' has its origins in the context of the printing press. As our social constructs change and evolve with the advent and ubiquity of the network, it is necessary to recognize that established terms can hamper adjusting to the inevitable evolution

    Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool

    Full text link
    Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all pages that have been served. Los Alamos National Laboratory has developed SiteSory, an open-source transactional archive written in Java solution that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used the ApacheBench utility on a pre-release version of to measure response time and content delivery time in different environments and on different machines. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.Comment: 13 pages, Technical Repor

    The selection, appraisal and retention of digital scientific data: dighlights of an ERPANET/CODATA workshop

    Get PDF
    CODATA and ERPANET collaborated to convene an international archiving workshop on the selection, appraisal, and retention of digital scientific data, which was held on 15-17 December 2003 at the Biblioteca Nacional in Lisbon, Portugal. The workshop brought together more than 65 researchers, data and information managers, archivists, and librarians from 13 countries to discuss the issues involved in making critical decisions regarding the long-term preservation of the scientific record. One of the major aims for this workshop was to provide an international forum to exchange information about data archiving policies and practices across different scientific, institutional, and national contexts. Highlights from the workshop discussions are presented

    Guidance for selecting materials for digitisation

    Get PDF
    28-30 September 199

    Representing Dataset Quality Metadata using Multi-Dimensional Views

    Full text link
    Data quality is commonly defined as fitness for use. The problem of identifying quality of data is faced by many data consumers. Data publishers often do not have the means to identify quality problems in their data. To make the task for both stakeholders easier, we have developed the Dataset Quality Ontology (daQ). daQ is a core vocabulary for representing the results of quality benchmarking of a linked dataset. It represents quality metadata as multi-dimensional and statistical observations using the Data Cube vocabulary. Quality metadata are organised as a self-contained graph, which can, e.g., be embedded into linked open datasets. We discuss the design considerations, give examples for extending daQ by custom quality metrics, and present use cases such as analysing data versions, browsing datasets by quality, and link identification. We finally discuss how data cube visualisation tools enable data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5 September 2014, Leipzig, German

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Research assessment in the humanities: problems and challenges

    Get PDF
    Research assessment is going to play a new role in the governance of universities and research institutions. Evaluation of results is evolving from a simple tool for resource allocation towards policy design. In this respect "measuring" implies a different approach to quantitative aspects as well as to an estimation of qualitative criteria that are difficult to define. Bibliometrics became so popular, in spite of its limits, just offering a simple solution to complex problems. The theory behind it is not so robust but available results confirm this method as a reasonable trade off between costs and benefits. Indeed there are some fields of science where quantitative indicators are very difficult to apply due to the lack of databases and data, in few words the credibility of existing information. Humanities and social sciences (HSS) need a coherent methodology to assess research outputs but current projects are not very convincing. The possibility of creating a shared ranking of journals by the value of their contents at either institutional, national or European level is not enough as it is raising the same bias as in the hard sciences and it does not solve the problem of the various types of outputs and the different, much longer time of creation and dissemination. The web (and web 2.0) represents a revolution in the communication of research results mainly in the HSS, and also their evaluation has to take into account this change. Furthermore, the increase of open access initiatives (green and gold road) offers a large quantity of transparent, verifiable data structured according to international standards that allow comparability beyond national limits and above all is independent from commercial agents. The pilot scheme carried out at the university of Milan for the Faculty of Humanities demonstrated that it is possible to build quantitative, on average more robust indicators, that could provide a proxy of research production and productiivity even in the HSS
    corecore