3,348 research outputs found

    Preserving Social Media: the Problem of Access

    Get PDF
    As the applications and services made possible through Web 2.0 continue to proliferate and influence the way individuals exchange information, the landscape of social science research, as well as research in the humanities and the arts, has the potential to change dramatically and to be enriched by a wealth of new, user-generated data. In response to this phenomenon, the UK Data Service have commissioned the Digital Preservation Coalition to undertake a 12-month study into the preservation of social media as part of the ‘Big Data Network’ programme funded by the Economic and Social Research Council (ESRC). The larger study focuses on the potential uses and accompanying challenges of data generated by social networking applications. This paper, ‘Preserving Social Media: the Problem of Access’, comprises an excerpt of that longer study, allowing the authors a space to explore in closer detail the issue of making social media archives accessible to researchers and students now and in the future. To do this, the paper addresses use cases that demonstrate the potential value of social media to academic social science. Furthermore, it examines how researchers and collecting institutions acquire and preserve social media data within a context of curatorial and legislative restrictions that may prove an even greater obstacle to access than any technical restrictions. Based on analysis of these obstacles, it will examine existing methods of curating and preserving social media archives, and second, make some recommendations for how collecting institutions might approach the long-term preservation of social media in a way that protects the individuals represented in the data and complies with the conditions of third party platforms. With the understanding that web-based communication technologies will continue to evolve, this paper will focus on the overarching properties of social media, analysing and comparing current methods of curation and preservation that provide sustainable solutions

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling

    Full text link
    Researchers in the Digital Humanities and journalists need to monitor, collect and analyze fresh online content regarding current events such as the Ebola outbreak or the Ukraine crisis on demand. However, existing focused crawling approaches only consider topical aspects while ignoring temporal aspects and therefore cannot achieve thematically coherent and fresh Web collections. Especially Social Media provide a rich source of fresh content, which is not used by state-of-the-art focused crawlers. In this paper we address the issues of enabling the collection of fresh and relevant Web and Social Web content for a topic of interest through seamless integration of Web and Social Media in a novel integrated focused crawler. The crawler collects Web and Social Media content in a single system and exploits the stream of fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries 201

    Describing Web Archives: A Computer-Assisted Approach

    Get PDF
    Currently, web archives are challenging for users to discover and use. Many archives and libraries are actively collecting web archives, but description in this area has been dominated by bibliographic approaches, which do not connect web archives to existing description or contextual information, and have often resulted in format-based silos. This is primarily because web archiving tools such as Archive-It arrange materials by seeds and groups of seeds, which reflect the complex technical process of web crawling or web recording, and are often not very meaningful to users or helpful for discovery. This article makes the case for arranging and describing web archives in meaningful aggregates according to established standards—showing how archival practices allow archivists to arrange the diversity of web content according to their common forms and functions while empowering them to be creative with their time and thoughtful with their labor. It provides a path to exposing important provenance information to users and demonstrates an existing proof of concept. Finally, it outlines a possible integration between ArchivesSpace and Archive-It that is feasible to implement for many archives and would automate the repetitive parts of creating and updating description for new web crawls

    Unruly Records: Personal Archives, Sociotechnical Infrastructure, and Archival Practice

    Get PDF
    Personal records have long occupied a complicated space within archival theory and practice. The archival profession, as it is practiced in the United States today, developed with organizational records, such as those created by governments and businesses, in mind. Personal records were considered to fall beyond the bounds of archival work and were primarily cared for by libraries and other cultural heritage institutions. Since the mid-20th century, this divide has become less pronounced, and it has become common to find personal records within archival institutions. As a result of these conditions in the development of the profession, the archivists who work with personal records have had to reconcile the specific characteristics of personal materials with theoretical and practical approaches that were designed not only to accommodate organizational records but to explicitly exclude personal records. These conditions have been further complicated by the continually changing technological landscape in which personal records are now created. As ownership of personal computers, access to the World Wide Web, and the use of networked social platforms have grown, personal records have increasingly come to be created, stored, and accessed within complex socio-technical systems. The infrastructures that support personal digital record creation today precipitate new methods and strategies, and an abundance of new questions, for the archivists who are responsible for collecting and preserving digital cultural heritage. This dissertation considers how both the history of excluding personal records in the archival profession and the socio-technical systems that support contemporary personal record creation impact archival practice today. This research considers archival approaches to working with personal records created within three environments: personal computers, the open web, and networked social platforms. Ultimately, this dissertation seeks to reevaluate the role that personal records have previously occupied, and to center the personal in archival practice today

    Web archives: the future

    Get PDF
    T his report is structured first, to engage in some speculative thought about the possible futures of the web as an exercise in prom pting us to think about what we need to do now in order to make sure that we can reliably and fruitfully use archives of the w eb in the future. Next, we turn to considering the methods and tools being used to research the live web, as a pointer to the types of things that can be developed to help unde rstand the archived web. Then , we turn to a series of topics and questions that researchers want or may want to address using the archived web. In this final section, we i dentify some of the challenges individuals, organizations, and international bodies can target to increase our ability to explore these topi cs and answer these quest ions. We end the report with some conclusions based on what we have learned from this exercise

    Accessing Web Archives: Integrating an Archive-It Collection into EBSCO Discovery Service

    Get PDF
    Effective collaboration between archives and technical services can increase the discoverability of special collection materials. Archivists at the University of Dayton Libraries began using Archive-It to capture websites relevant to their collecting policies in 2015. However, the collections were only made available to users from the University of Dayton page on the Archive-It website. Content was isolated in a separate platform and was not promoted to users. Working together, the team of archivists and technical services librarians incorporated the web archive collections into the Libraries’ EBSCO Discovery Service (EDS) discovery layer. A local data dictionary was created based on OCLC’s Descriptive Metadata for Web Archiving report (2018), and metadata was added at the seed and collection levels. The result was indexed content on a single, user-friendly platform. The web archive collections were then marketed to the University of Dayton community, and statistics were generated on their use
    • …
    corecore