71 research outputs found

    CCpdf: Building a High Quality Corpus for Visually Rich Documents from Web Crawl Data

    Full text link
    In recent years, the field of document understanding has progressed a lot. A significant part of this progress has been possible thanks to the use of language models pretrained on large amounts of documents. However, pretraining corpora used in the domain of document understanding are single domain, monolingual, or nonpublic. Our goal in this paper is to propose an efficient pipeline for creating a big-scale, diverse, multilingual corpus of PDF files from all over the Internet using Common Crawl, as PDF files are the most canonical types of documents as considered in document understanding. We analysed extensively all of the steps of the pipeline and proposed a solution which is a trade-off between data quality and processing time. We also share a CCpdf corpus in a form or an index of PDF files along with a script for downloading them, which produces a collection useful for language model pretraining. The dataset and tools published with this paper offer researchers the opportunity to develop even better multilingual language models.Comment: Accepted at ICDAR 202

    Traktat Parkosza. Eksperymentalna edycja elektroniczna

    Get PDF
    The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions.The fifteenth-century Latin manuscript presenting the proposal of Polish spelling formulated by Jakub Parkosz is an interesting challenge to the editors primarily because it introduces special characters that were not used later. The text of the treatise was presented and discussed in a book by Marian Kucała (available under open licence) which was used as the basis for an experimental electronic edition of the treatise in the form of an interactive PDF file, also available as a LuaLaTeX source. For transliteration, the electronic edition uses the existing Unicode letters with some additions from the private use area of the Medieval Unicode Font Initiative. This paper presents the rationale for the transliteration and discusses some possible alternative forms of electronic editions

    Developing Quantitative Methodologies for the Digital Humanities: A Case Study of 20th Century American Commentary on Russian Literature

    Get PDF
    Using scientific methods in the humanities is at the forefront of objective literary analysis. However, processing big data is particularly complex when the subject matter is qualitative rather than numerical. Large volumes of text require specialized tools to produce quantifiable data from ideas and sentiments. Our team researched the extent to which tools such as Weka and MALLET can test hypotheses about qualitative information. We examined the claim that literary commentary exists within political environments and used US periodical articles concerning Russian literature in the early twentieth century as a case study. These tools generated useful quantitative data that allowed us to run stepwise binary logistic regressions. These statistical tests allowed for time series experiments using sea change and emergency models of history, as well as classification experiments with regard to author characteristics, social issues, and sentiment expressed. Both types of experiments supported our claim with varying degrees, but more importantly served as a definitive demonstration that digitally enhanced quantitative forms of analysis can apply to qualitative data. Our findings set the foundation for further experiments in the emerging field of digital humanities

    Arabic Typed Text Recognition in Graphics Images (ATTR-GI)

    Get PDF
    While optical character recognition (OCR) techniques may perform well on standard text documents, their performance degrades significantly in graphics images. In standard scanned text documents OCR techniques enjoy a number of convenient assumptions such as clear backgrounds, standard fonts, predefined line orientation, page size, the start point of written. These assumptions are not true in graphics documents such as Arabic advertisements, personal cards, screenshot. Therefore, in such types of images, greater attention is required in the initial stage of detecting Arabic text regions in order for subsequent character recognition steps to be successful. Special features of Arabic alphabet characters introduce additional challenges which are not present in Latin alphabet characters. In this research we propose a new technique for automatically detecting text in graphics documents, and preparing them for OCR processing. Our detection approach is based on some mathematical measurements to know is it a text or not and to know is it Arabic Based Text or Latin Based. These measurements are follows, measure the Base Line (the line has maximum number of black pixels). Also, measure Item Area (the content of extracted sub images). Finally, find maximum peak for the adjacent black pixels in Base line and maximum length for sub adjacent black pixels. Our experiment results will come in more details. We believe our technique will enable OCR systems to overcome their major shortcoming when dealing with text in graphics images. This will further enable a variety of OCR-based applications to extend their operation to graphics documents such as SPAM detection from image, reading advertisement for blind people, search and index document which contain image, enhancing for printer property (black white or color printer) and enhancing OCR

    Content-based image analysis with applications to the multifunction printer imaging pipeline and image databases

    Get PDF
    Image understanding is one of the most important topics for various applications. Most of image understanding studies focus on content-based approach while some others also rely on meta data of images. Image understanding includes several sub-topics such as classification, segmentation, retrieval and automatic annotation etc., which are heavily studied recently. This thesis proposes several new methods and algorithms for image classification, retrieval and automatic tag generation. The proposed algorithms have been tested and verified in multiple platforms. For image classification, our proposed method can complete classification in real-time under hardware constraints of all-in-one printer and adaptively improve itself by online learning. Another image understanding engine includes both classification and image quality analysis is designed to solve the optimal compression problem of printing system. Our proposed image retrieval algorithm can be applied to either PC or mobile device to improve the hybrid learning experience. We also develop a new matrix factorization algorithm to better recover the image meta data (tag). The proposed algorithm outperforms other existing matrix factorization methods
    • …
    corecore