3,616 research outputs found

    Impact Analysis of OCR Quality on Research Tasks in Digital Archives

    Get PDF
    Humanities scholars increasingly rely on digital archives for their research in place of time-consuming visits to physical archives. This shift in research methodology has the hidden cost of working with digi- tally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. Based on the interviews and a literature study, we provide a classification scheme relating schol- arly research tasks to their specific OCR-induced uncertainty and the data required for more reliable uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncer- tainty in result sets. We conclude that the current knowledge situation on the usersā€™ side as well as on the tool makers and data providersā€™ side is insufficient and needs further research to be improved

    Impact Analysis of OCR Quality on Research Tasks in Digital Archives

    Get PDF
    Humanities scholars increasingly rely on digital archives for their research instead of time-consuming visits to physical archives. This shift in research method has the hidden cost of working with digitally processed historical documents: how much trust can a scholar place in noisy representations of source texts? In a series of interviews with historians about their use of digital archives, we found that scholars are aware that optical character recognition (OCR) errors may bias their results. They were, however, unable to quantify this bias or to indicate what information they would need to estimate it. This, however, would be important to assess whether the results are publishable. Based on the interviews and a literature study, we provide a classification of scholarly research tasks that gives account of their susceptibility to specific OCR-induced biases and the data required for uncertainty estimations. We conducted a use case study on a national newspaper archive with example research tasks. From this we learned what data is typically available in digital archives and how it could be used to reduce and/or assess the uncertainty in result sets. We conclude that the current knowledge situation on the usersā€™ side as well as on the tool makersā€™ and data providersā€™ side is insufficient and needs to be improved

    Examining and improving the effectiveness of relevance feedback for retrieval of scanned text documents

    Get PDF
    Important legacy paper documents are digitized and collected in online accessible archives. This enables the preservation, sharing, and significantly the searching of these documents. The text contents of these document images can be transcribed automatically using OCR systems and then stored in an information retrieval system. However, OCR systems make errors in character recognition which have previously been shown to impact on document retrieval behaviour. In particular relevance feedback query-expansion methods, which are often effective for improving electronic text retrieval, are observed to be less reliable for retrieval of scanned document images. Our experimental examination of the effects of character recognition errors on an ad hoc OCR retrieval task demonstrates that, while baseline information retrieval can remain relatively unaffected by transcription errors, relevance feedback via query expansion becomes highly unstable. This paper examines the reason for this behaviour, and introduces novel modifications to standard relevance feedback methods. These methods are shown experimentally to improve the effectiveness of relevance feedback for errorful OCR transcriptions. The new methods combine similar recognised character strings based on term collection frequency and a string edit-distance measure. The techniques are domain independent and make no use of external resources such as dictionaries or training data

    Access and Preservation in Archival Mass Digitization Projects

    Get PDF
    [Excerpt] In 2014, the Dalhousie University Archives began its first archival mass digitization project with the Elisabeth Mann Borgese fonds. The successful completion of this project required the project team to address both broad and specific technical and intellectual challenges, from rights management in an online access environment to the durability of the equipment used. To best understand the challenges faced, there will first be a brief introduction to the fonds and project goals of balancing preservation and access before moving on to a discussion of these challenges in further detail, and finally, concluding with a discussion of some considerations, best practices, and lessons learned from this project

    Diploma qualifications monitoring: findings from the scrutiny of level 2 and level 3 diploma constituent qualifications in 2010

    Get PDF
    In 2010, the Office of Qualifications and Examinations Regulation (Ofqual) monitored three new specifications in principal learning: Edexcel level 3 in Construction and the Built Environment; OCR level 3 in Information Technology; and VTCT level 2 in Hair and Beauty Studies. We also completed scrutinies of AQA-City & Guilds level 2 in Engineering; Edexcel level 2 in Society, Health and Development; and OCR level 2 in Creative and Media that were begun in 2009; and we conducted a scrutiny of AQA-City & Guilds level 3 extended project qualification. The findings from our monitoring of these qualifications are detailed in this report. In 2010, the number of candidates who entered the principal learning qualifications remained relatively small, and there was little evidence of candidate performance available at the higher grades. Centres and candidates are still adapting to the demands of these new qualifications, and awarding organisations are still establishing the standards. In general, we found that the scrutinised qualifications addressed the specification content and learning outcomes appropriately, and assessments were varied and challenging for the full range of candidates. However, there were a number of findings that related to each specification individually. These findings included opportunities to improve: the design of question papers and mark schemes; the guidance to centres and consortia; and awarding organisation procedures for the training of examiners and moderators, and for setting grades. We have required awarding organisations to agree appropriate action plans to address the issues raised by our monitoring. We will monitor the implementation of these action plans in future series

    Beyond English text: Multilingual and multimedia information retrieval.

    Get PDF
    Non

    The TREC2001 video track: information retrieval on digital video information

    Get PDF
    The development of techniques to support content-based access to archives of digital video information has recently started to receive much attention from the research community. During 2001, the annual TREC activity, which has been benchmarking the performance of information retrieval techniques on a range of media for 10 years, included a ā€trackā€œ or activity which allowed investigation into approaches to support searching through a video library. This paper is not intended to provide a comprehensive picture of the different approaches taken by the TREC2001 video track participants but instead we give an overview of the TREC video search task and a thumbnail sketch of the approaches taken by different groups. The reason for writing this paper is to highlight the message from the TREC video track that there are now a variety of approaches available for searching and browsing through digital video archives, that these approaches do work, are scalable to larger archives and can yield useful retrieval performance for users. This has important implications in making digital libraries of video information attainable

    DARIAH and the Benelux

    Get PDF
    • ā€¦
    corecore