2,015 research outputs found

    "Q i-jtb the Raven": Taking Dirty OCR Seriously

    Get PDF
    This article argues that scholars must understand mass digitized texts as assemblages of new editions, subsidiary editions, and impressions of their historical sources, and that these various parts require sustained bibliographic analysis and description. To adequately theorize any research conducted in large-scale text archives—including research that includes primary or secondary sources discovered through keyword search—we must avoid the myth of surrogacy proffered by page images and instead consider directly the text files they overlay. Focusing on the OCR (optical character recognition) from which most large-scale historical text data derives, this article argues that the results of this "automatic" process are in fact new editions of their source texts that offer unique insights into both the historical texts they remediate and the more recent era of their remediation. The constitution and provenance of digitized archives are, to some extent at least, knowable and describable. Just as details of type, ink, or paper, or paratext such as printer's records can help us establish the histories under which a printed book was created, details of format, interface, and even grant proposals can help us establish the histories of corpora created under conditions of mass digitization

    Digital Libraries, Intelligent Data Analytics, and Augmented Description: A Demonstration Project

    Get PDF
    From July 16-to November 8, 2019, the Aida digital libraries research team at the University of Nebraska-Lincoln collaborated with the Library of Congress on “Digital Libraries, Intelligent Data Analytics, and Augmented Description: A Demonstration Project.“ This demonstration project sought to (1) develop and investigate the viability and feasibility of textual and image-based data analytics approaches to support and facilitate discovery; (2) understand technical tools and requirements for the Library of Congress to improve access and discovery of its digital collections; and (3) enable the Library of Congress to plan for future possibilities. In pursuit of these goals, we focused our work around two areas: extracting and foregrounding visual content from Chronicling America (chroniclingamerica.loc.gov) and applying a series of image processing and machine learning methods to minimally processed manuscript collections featured in By the People (crowd.loc.gov). We undertook a series of explorations and investigated a range of issues and challenges related to machine learning and the Library’s collections. This final report details the explorations, addresses social and technical challenges with regard to the explorations and that are critical context for the development of machine learning in the cultural heritage sector, and makes several recommendations to the Library of Congress as it plans for future possibilities. We propose two top-level recommendations. First, the Library should focus the weight of its machine learning efforts and energies on social and technical infrastructures for the development of machine learning in cultural heritage organizations, research libraries, and digital libraries. Second, we recommend that the Library invest in continued, ongoing, intentional explorations and investigations of particular machine learning applications to its collections. Both of these top-level recommendations map to the three goals of the Library’s 2019 digital strategy. Within each top-level recommendation, we offer three more concrete, short- and medium-term recommendations. They include, under social and technical infrastructures: (1) Develop a statement of values or principles that will guide how the Library of Congress pursues the use, application, and development of machine learning for cultural heritage. (2) Create and scope a machine learning roadmap for the Library that looks both internally to the Library of Congress and its needs and goals and externally to the larger cultural heritage and other research communities. (3) Focus efforts on developing ground truth sets and benchmarking data and making these easily available. Nested under the recommendation to support ongoing explorations and investigations, we recommend that the Library: (4) Join the Library of Congress’s emergent efforts in machine learning with its existing expertise and leadership in crowdsourcing. Combine these areas as “informed crowdsourcing” as appropriate. (5) Sponsor challenges for teams to create additional metadata for digital collections in the Library of Congress. As part of these challenges, require teams to engage across a range of social and technical questions and problem areas. (6) Continue to create and support opportunities for researchers to partner in substantive ways with the Library of Congress on machine learning explorations. Each of these recommendations speak to the investigation and challenge areas identified by Thomas Padilla in Responsible Operations: Data Science, Machine Learning, and AI in Libraries. This demonstration project—via its explorations, discussion, and recommendations—shows the potential of machine learning toward a variety of goals and use cases, and it argues that the technology itself will not be the hardest part of this work. The hardest part will be the myriad challenges to undertaking this work in ways that are socially and culturally responsible, while also upholding responsibility to make the Library of Congress’s materials available in timely and accessible ways. Fortunately, the Library of Congress is in a remarkable position to advance machine learning for cultural heritage organizations, through its size, the diversity of its collections, and its commitment to digital strategy

    DARIAH and the Benelux

    Get PDF

    Tear Down This (Pay)Wall! Equality, Equity, Liberation for Archivists

    Get PDF
    This paper critically examines the practice of placing archival collections behind paywalls, starting with a microfilming decision that led to portions of collections stewarded by the author’s archives being offered for sale as part of large for-profit subject-based collections. The author uses economic and values-based arguments to illustrate how commodifying the archives by putting collections behind paywalls can be harmful for university libraries, archives, and the communities whose histories are hidden from them. The author then questions the existence of paywalled resources based on our professional associations’ codes of ethics.  The author offers a tool from the field of service-learning that might be used to evaluate how archives can interact ethically with communities, and uses a radical empathy lens to illustrate how various digital initiatives have wrestled with the ethics of paywalled resources and the marginalized communities they originate from. Finally, the author describes efforts to critically examine and disrupt current practices using a radical empathy framing, and offers practical solutions for archival institutions to take the first step toward a liberatory digital archive available to all. Pre-print first published online 04/14/202

    Jefferson Institute's Military Archives Project in Serbia: From Ruins of War, a Nation's History Preserved

    Get PDF
    Analyzes the impact and challenges of a project supported by Knight to digitize Serbia's military documents and make them publicly available in a searchable archive, including evidence for prosecuting war criminals and locating secret mass graves

    Increasing Our Vision for 21st-Century Digital Libraries

    Get PDF
    This presentation Reads digital library interfaces—or their main door interfaces—as glimpses into what we have thus far valued in the development of digital libraries Frames a visual way of thinking about textual materials Introduces the work of our research team—where we are now, and where we\u27re headed Draws some connections between the parts This presentation is very much a look into thinking in process and work in progress and proposes the following ideas: As a community, we can do much more with the digital images we\u27re creating of textual materials than we\u27ve heretofore done. We aspire to have additional layers or levels of image analysis become part of the default processing work in the creation of digital libraries, not only as something that happens external or parallel to digital libraries, and not only toward the purpose of generating text. We aspire to more processing up front and iterative processing of materials—so that digital libraries\u27 materials are not once and done —and that this more processing is presented to users as additional options for how they can explore digital libraries, find materials of relevance, and imagine new possibilities Even as the digital libraries community focuses on supporting computational use of digital libraries—and our research team recognizes that our project very much depends on that computational use being supported—we should not leave behind, in 1998, those users of digital libraries for whom computational use is not their point of entry. (More on that date in a moment.

    Improving Public Record Access

    Get PDF
    Nantucket\u27s public and historic records are maintained by many different institutions and are kept in various forms. The project\u27s goal was to address this fragmentation and find a way to improve access to public and historic records on Nantucket. The team researched other collaborative digitization projects and interviewed record-holding organizations on the island to create an inventory of existing records and to gauge interest in the creation of a single website to provide access to Nantucket\u27s records. The team identified the key steps for a successful digital collaborative project, developed a prototype records database and web interface, and recommended how Nantucket should move this effort forward

    Microfilm, Manuscripts, and Photographs: A Case Study Comparing Three Large-Scale Digitization Projects

    Get PDF
    This article is a case study comparing three large-scale digitization projects at the University of Nevada, Las Vegas (UNLV) Libraries: the Culinary Union Workers Local 226 Photographs, the Nevada Digital Newspaper Project, and the Entertainment Project. The authors compare the project management, workflows, and decision-making related to the many aspects of digitizing special collections and archives materials. The projects used both outsourced vendors and in-house labor and equipment to digitize microfilmed newspapers, mixed-materials manuscript collections, and photographic prints and negatives. Roles and responsibilities; grant funding; copyright, privacy, and confidentiality; arrangement; formats; and metadata are all discussed in relation to large-scale digitization

    The Vermont Digital Newspaper Project and the National Digital Newspaper Program: Cooperative Efforts in Long-Term Digital Newspaper Access and Preservation

    Get PDF
    The Vermont Digital Newspaper Project (VTDNP) is a state partner in the National Digital Newspaper Program (NDNP). Developed by the National Endowment for the Humanities (NEH) and the Library of Congress (LC), the NDNP is a long-term, national effort to build a freely accessible, searchable Internet database of historical US newspapers. NEH provides funding to state projects to select and digitize historic newspapers published between 1836 and 1922. LC provides the technical support and framework for preservation digitization. Digitized newspapers are archived by LC and made freely available through the website Chronicling America: Historic American Newspapers. Vermont joined the NDNP in July 2010, when the University of Vermont Libraries were awarded NEH funding to embark collaboratively with state partners—including the Vermont Department of Libraries, the Ilsley Public Library of Middlebury, and the Vermont Historical Society—on the Vermont Digital Newspaper Project. Institutional partnerships and the engagement of committed individuals serve as a foundation to the VTDNP and provide an avenue to expand statewide infrastructures to accommodate large-scale microfilm-to-digital conversion and preservation efforts. Through collaboration and outreach, project partners select and digitize historical newspapers from microfilm and promote Chronicling America, a tool for discovery of these primary historical resources
    • …
    corecore