13 research outputs found

    Optimizing the neural network training for OCR error correction of historical Hebrew texts

    Get PDF
    Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the most effective for OCR post-correction of historical documents. To this end, a series of experiments using several datasets was conducted. The evaluation corpus was based on Hebrew newspapers from the JPress project. An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors. We found that training the network using the proposed method is more effective than using randomly generated errors. The results also show that the performance of the neural net-work for OCR post-correction strongly depends on the genre and area of the training data. Moreover, neural networks that were trained with the proposed method outperform other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. These results may have practical implications for many digital humanities projects

    Toward the Optimized Crowdsourcing Strategy for OCR Post-Correction

    Full text link
    Digitization of historical documents is a challenging task in many digital humanities projects. A popular approach for digitization is to scan the documents into images, and then convert images into text using Optical Character Recognition (OCR) algorithms. However, the outcome of OCR processing of historical documents is usually inaccurate and requires post-processing error correction. This study investigates how crowdsourcing can be utilized to correct OCR errors in historical text collections, and which crowdsourcing methodology is the most effective in different scenarios and for various research objectives. A series of experiments with different micro-task's structures and text lengths was conducted with 753 workers on the Amazon's Mechanical Turk platform. The workers had to fix OCR errors in a selected historical text. To analyze the results, new accuracy and efficiency measures have been devised. The analysis suggests that in terms of accuracy, the optimal text length is medium (paragraph-size) and the optimal structure of the experiment is two-phase with a scanned image. In terms of efficiency, the best results were obtained when using longer text in the single-stage structure with no image. The study provides practical recommendations to researchers on how to build the optimal crowdsourcing task for OCR post-correction. The developed methodology can also be utilized to create golden standard historical texts for automatic OCR post-correction. This is the first attempt to systematically investigate the influence of various factors on crowdsourcing-based OCR post-correction and propose an optimal strategy for this process.Comment: 25 pages, 12 figures, 1 tabl

    Trust and attitude toward information presented using augmented reality and other technological means

    No full text
    In recent years, augmented reality (AR) technology has grown, and its use has become widespread among smartphone users. People are consuming more and more digital information from various sources and in different presentation modes. Therefore, in this study, we investigate the extent to which different presentation modes relate to the level of trust in information, while considering demographic variables, as well as personality traits and thinking styles. The participants in our experiments were asked to indicate whether certain statements that were presented in various presentation methods (image + text, image + audio, AR + text, AR + audio) were true or false. The results indicate that users are more likely to trust statements that are accompanied by AR than statements that are accompanied by a static image. In addition, younger participants have greater trust in audio-presented information than text-presented information. As AR is expected to grow considerably in popularity in the next few years, users should be cautious of the potential impact on their trust in digital information while using AR

    When Suboptimal Rules

    No full text
    This paper represents a paradigm shift in what advice agents should provide people. Contrary to what was previously thought, we empirically show that agents that dispense optimal advice will not necessary facilitate the best improvement in people's strategies. Instead, we claim that agents should at times suboptimally advise. We provide results demonstrating the effectiveness of a suboptimal advising approach in extensive experiments in two canonical mixed agent-human advice-giving domains. Our proposed guideline for suboptimal advising is to rely on the level of intuitiveness of the optimal advice as a measure for how much the suboptimal advice presented to the user should drift from the optimal value

    Enhancing Crowdworkers' Vigilance

    No full text
    This paper presents methods for improving the attention span of workers in tasks that heavily rely on their attention to the occurrence of rare events. The underlying idea in our approach is to dynamically augment the task with some dummy (artificial) events at different times throughout the task, rewarding the worker upon identifying and reporting them. The proposed approach is an alternative to the traditional approach of exclusively relying on rewarding the worker for successfully identifying the event of interest itself. We propose three methods for timing the dummy events throughout the task. Two of these methods are static and determine the timing of the dummy events at random or uniformly throughout the task. The third method is dynamic and uses the identification (or misidentification) of dummy events as a signal for the worker's attention to the task, adjusting the rate of dummy events generation accordingly.Engineering and Applied Science

    Monetary Compensation and Private Information Sharing in Augmented Reality Applications

    No full text
    This research studied people’s responses to requests that ask for accessing their personal information when using augmented reality (AR) technology. AR is a new technology that superimposes digital information onto the real world, creating a unique user experience. As such, AR is often associated with the collection and use of personal information, which may lead to significant privacy concerns. To investigate these potential concerns, we adopted an experimental approach and examined people’s actual responses to real-world requests for various types of personal information while using a designated AR application on their personal smartphones. Our results indicate that the majority (57%) of people are willing to share sensitive personal information with an unknown third party without any compensation other than using the application. Moreover, there is variability in the individuals’ willingness to allow access to various kinds of personal information. For example, while 75% of participants were open to granting access to their microphone, only 35% of participants agreed to allow access to their contacts. Lastly, monetary compensation is linked with an increased willingness to share personal information. When no compensation was offered, only 35% of the participants agreed to grant access to their contacts, but when a low compensation was offered, 57.5% of the participants agreed. These findings combine to suggest several practical implications for the development and distribution of AR technologies
    corecore