71 research outputs found
Recommended from our members
Steganoscription : exploring techniques for privacy-preserving crowdsourced transcription of handwritten documents
textThe focus my research is the historical document format represented by the Central State Hospital (CSH) dataset, handwritten medical records. The specific problem innate to the CSH dataset in question is how to transcribe sensitive, cursive-handwritten documents via a manual vehicle- such as crowdsourcing. Manual methods are necessarily no matter the sophistication of the optical character recognition system used because of the inconsistencies within cursive script. To address this problem I've developed an application that enables users to transcribe sensitive, handwritten, document images while preserving the privacy of the context around the transcribed text via random word selection and visual manipulation of the displayed text. This is made possible through several algorithms that process documents from a top-down approach. These system operations detect and segment lines of text in images, reverse the slant common to cursive script, detect and segment words, and finally, manipulate word-images before they are displayed to users; combinations of color, noise, and geometric manipulations are currently supported and used randomly. This system, called Steganoscription, combines the concepts of steganography and transcription.Informatio
CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data
In recent years, the field of document understanding has progressed a lot. A
significant part of this progress has been possible thanks to the use of
language models pretrained on large amounts of documents. However, pretraining
corpora used in the domain of document understanding are single domain,
monolingual, or nonpublic. Our goal in this paper is to propose an efficient
pipeline for creating a big-scale, diverse, multilingual corpus of PDF files
from all over the Internet using Common Crawl, as PDF files are the most
canonical types of documents as considered in document understanding. We
analysed extensively all of the steps of the pipeline and proposed a solution
which is a trade-off between data quality and processing time. We also share a
CCpdf corpus in a form or an index of PDF files along with a script for
downloading them, which produces a collection useful for language model
pretraining. The dataset and tools published with this paper offer researchers
the opportunity to develop even better multilingual language models.Comment: Accepted at ICDAR 202
Traktat Parkosza. Eksperymentalna edycja elektroniczna
The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions.The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions
Developing Quantitative Methodologies for the Digital Humanities: A Case Study of 20th Century American Commentary on Russian Literature
Using scientific methods in the humanities is at the forefront of objective literary analysis.
However, processing big data is particularly complex when the subject matter is qualitative
rather than numerical. Large volumes of text require specialized tools to produce quantifiable
data from ideas and sentiments. Our team researched the extent to which tools such as Weka and
MALLET can test hypotheses about qualitative information. We examined the claim that literary
commentary exists within political environments and used US periodical articles concerning
Russian literature in the early twentieth century as a case study. These tools generated useful
quantitative data that allowed us to run stepwise binary logistic regressions. These statistical tests
allowed for time series experiments using sea change and emergency models of history, as well
as classification experiments with regard to author characteristics, social issues, and sentiment
expressed. Both types of experiments supported our claim with varying degrees, but more
importantly served as a definitive demonstration that digitally enhanced quantitative forms of
analysis can apply to qualitative data. Our findings set the foundation for further experiments in
the emerging field of digital humanities
Arabic Typed Text Recognition in Graphics Images (ATTR-GI)
While optical character recognition (OCR) techniques may perform well on standard text documents, their performance degrades significantly in graphics images. In standard scanned text documents OCR techniques enjoy a number of convenient assumptions such as clear backgrounds, standard fonts, predefined line orientation, page size, the start point of written. These assumptions are not true in graphics documents such as Arabic advertisements, personal cards, screenshot. Therefore, in such types of images, greater attention is required in the initial stage of detecting Arabic text regions in order for subsequent character recognition steps to be successful. Special features of Arabic alphabet characters introduce additional challenges which are not present in Latin alphabet characters. In this research we propose a new technique for automatically detecting text in graphics documents, and preparing them for OCR processing. Our detection approach is based on some mathematical measurements to know is it a text or not and to know is it Arabic Based Text or Latin Based. These measurements are follows, measure the Base Line (the line has maximum number of black pixels). Also, measure Item Area (the content of extracted sub images). Finally, find maximum peak for the adjacent black pixels in Base line and maximum length for sub adjacent black pixels. Our experiment results will come in more details. We believe our technique will enable OCR systems to overcome their major shortcoming when dealing with text in graphics images. This will further enable a variety of OCR-based applications to extend their operation to graphics documents such as SPAM detection from image, reading advertisement for blind people, search and index document which contain image, enhancing for printer property (black white or color printer) and enhancing OCR
Content-based image analysis with applications to the multifunction printer imaging pipeline and image databases
Image understanding is one of the most important topics for various applications. Most of image understanding studies focus on content-based approach while some others also rely on meta data of images. Image understanding includes several sub-topics such as classification, segmentation, retrieval and automatic annotation etc., which are heavily studied recently. This thesis proposes several new methods and algorithms for image classification, retrieval and automatic tag generation. The proposed algorithms have been tested and verified in multiple platforms. For image classification, our proposed method can complete classification in real-time under hardware constraints of all-in-one printer and adaptively improve itself by online learning. Another image understanding engine includes both classification and image quality analysis is designed to solve the optimal compression problem of printing system. Our proposed image retrieval algorithm can be applied to either PC or mobile device to improve the hybrid learning experience. We also develop a new matrix factorization algorithm to better recover the image meta data (tag). The proposed algorithm outperforms other existing matrix factorization methods
Recommended from our members
Educational use cases from a shared exploration of e-books and iPads
E-books and e-book readers are becoming increasingly widely available, particularly for the general reader, and there have been many studies on their adoption. However, less is known about their use for educational and academic purposes. We report here on work carried out on e-books and e-book applications using iPads by academic and teaching staff. After considering pedagogical issues and reporting survey results, we identify a spiral of six key use case areas for e-books. This spiral of use cases moves from basic e-book use, through situational reading, e-books and learning, using multiple learning resources, collaborative/group learning, to e-book production. We discuss each of these use case areas and provide guidelines that will be of interest to practitioners and researchers alike
- …