Search CORE

71 research outputs found

Recommended from our members

Steganoscription : exploring techniques for privacy-preserving crowdsourced transcription of handwritten documents

Author: Smith Dustin Lee
Publication venue
Publication date: 23/10/2015
Field of study

textThe focus my research is the historical document format represented by the Central State Hospital (CSH) dataset, handwritten medical records. The specific problem innate to the CSH dataset in question is how to transcribe sensitive, cursive-handwritten documents via a manual vehicle- such as crowdsourcing. Manual methods are necessarily no matter the sophistication of the optical character recognition system used because of the inconsistencies within cursive script. To address this problem I've developed an application that enables users to transcribe sensitive, handwritten, document images while preserving the privacy of the context around the transcribed text via random word selection and visual manipulation of the displayed text. This is made possible through several algorithms that process documents from a top-down approach. These system operations detect and segment lines of text in images, reverse the slant common to cursive script, detect and segment words, and finally, manipulate word-images before they are displayed to users; combinations of color, noise, and geometric manipulations are currently supported and used randomly. This system, called Steganoscription, combines the concepts of steganography and transcription.Informatio

Texas ScholarWorks

CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

Author: Dyda Paweł
Graliński Filip
Kaczmarek Karol
Stanisławek Tomasz
Turski Michał
Publication venue
Publication date: 28/04/2023
Field of study

In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.Comment: Accepted at ICDAR 202

arXiv.org e-Print Archive

Traktat Parkosza. Eksperymentalna edycja elektroniczna

Author: Bień Janusz S.
Publication venue: 'Adam Mickiewicz University Poznan'
Publication date: 15/09/2019
Field of study

The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions.The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions

Biblioteka Nauki - repozytorium artykuÅÃ³w

Poznańskie Studia Polonistyczne. Seria Językoznawcza

Developing Quantitative Methodologies for the Digital Humanities: A Case Study of 20th Century American Commentary on Russian Literature

Author: Cai Robert
Carr Matthew Thomas
Elrafei Adam
Goniprow Alexander
Hamins-Puertolas Adrian
Khural Manpreet
Li Andrew
Winter Alexandra
Yanamandra Soumya
Yang Dan
Zhang Kay
Publication venue
Publication date: 01/01/2014
Field of study

Using scientific methods in the humanities is at the forefront of objective literary analysis. However, processing big data is particularly complex when the subject matter is qualitative rather than numerical. Large volumes of text require specialized tools to produce quantifiable data from ideas and sentiments. Our team researched the extent to which tools such as Weka and MALLET can test hypotheses about qualitative information. We examined the claim that literary commentary exists within political environments and used US periodical articles concerning Russian literature in the early twentieth century as a case study. These tools generated useful quantitative data that allowed us to run stepwise binary logistic regressions. These statistical tests allowed for time series experiments using sea change and emergency models of history, as well as classification experiments with regard to author characteristics, social issues, and sentiment expressed. Both types of experiments supported our claim with varying degrees, but more importantly served as a definitive demonstration that digitally enhanced quantitative forms of analysis can apply to qualitative data. Our findings set the foundation for further experiments in the emerging field of digital humanities

Digital Repository at the University of Maryland

Arabic Typed Text Recognition in Graphics Images (ATTR-GI)

Author: El-saedi Lamiya Mohammed
Publication venue: الجامعة الإسلامية - غزة
Publication date: 01/01/2013
Field of study

While optical character recognition (OCR) techniques may perform well on standard text documents, their performance degrades significantly in graphics images. In standard scanned text documents OCR techniques enjoy a number of convenient assumptions such as clear backgrounds, standard fonts, predefined line orientation, page size, the start point of written. These assumptions are not true in graphics documents such as Arabic advertisements, personal cards, screenshot. Therefore, in such types of images, greater attention is required in the initial stage of detecting Arabic text regions in order for subsequent character recognition steps to be successful. Special features of Arabic alphabet characters introduce additional challenges which are not present in Latin alphabet characters. In this research we propose a new technique for automatically detecting text in graphics documents, and preparing them for OCR processing. Our detection approach is based on some mathematical measurements to know is it a text or not and to know is it Arabic Based Text or Latin Based. These measurements are follows, measure the Base Line (the line has maximum number of black pixels). Also, measure Item Area (the content of extracted sub images). Finally, find maximum peak for the adjacent black pixels in Base line and maximum length for sub adjacent black pixels. Our experiment results will come in more details. We believe our technique will enable OCR systems to overcome their major shortcoming when dealing with text in graphics images. This will further enable a variety of OCR-based applications to extend their operation to graphics documents such as SPAM detection from image, reading advertisement for blind people, search and index document which contain image, enhancing for printer property (black white or color printer) and enhancing OCR

Institutional Repository of the Islamic University of Gaza

Content-based image analysis with applications to the multifunction printer imaging pipeline and image databases

Author: Lu Cheng
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

Image understanding is one of the most important topics for various applications. Most of image understanding studies focus on content-based approach while some others also rely on meta data of images. Image understanding includes several sub-topics such as classification, segmentation, retrieval and automatic annotation etc., which are heavily studied recently. This thesis proposes several new methods and algorithms for image classification, retrieval and automatic tag generation. The proposed algorithms have been tested and verified in multiple platforms. For image classification, our proposed method can complete classification in real-time under hardware constraints of all-in-one printer and adaptively improve itself by online learning. Another image understanding engine includes both classification and image quality analysis is designed to solve the optimal compression problem of printing system. Our proposed image retrieval algorithm can be applied to either PC or mobile device to improve the hybrid learning experience. We also develop a new matrix factorization algorithm to better recover the image meta data (tag). The proposed algorithm outperforms other existing matrix factorization methods

Purdue E-Pubs

Recommended from our members

Educational use cases from a shared exploration of e-books and iPads

Author: Kukulska-Hulme Agnes
Page Anna
Smith Martin
Publication venue: Victoria Business School, Victoria University of Wellington
Publication date: 01/01/2012
Field of study

E-books and e-book readers are becoming increasingly widely available, particularly for the general reader, and there have been many studies on their adoption. However, less is known about their use for educational and academic purposes. We report here on work carried out on e-books and e-book applications using iPads by academic and teaching staff. After considering pedagogical issues and reporting survey results, we identify a spiral of six key use case areas for e-books. This spiral of use cases moves from basic e-book use, through situational reading, e-books and learning, using multiple learning resources, collaborative/group learning, to e-book production. We discuss each of these use case areas and provide guidelines that will be of interest to practitioners and researchers alike

Open Research Online (The Open University)