42,804 research outputs found

    BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset

    Full text link
    While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain variations and out-of-distribution layouts. To this end, we present the first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines, ii) public domain govt. documents, iii) liberation war documents, iv) newspapers, v) historical newspapers, and vi) property deeds, with 710K polygon annotations for four unit types: text-box, paragraph, image, and table. Through preliminary experiments benchmarking the performance of existing state-of-the-art deep learning architectures for English DLA, we demonstrate the efficacy of our dataset in training deep learning based Bengali document digitization models

    Optical Font Recognition in Smartphone-Captured Images, and its Applicability for ID Forgery Detection

    Full text link
    In this paper, we consider the problem of detecting counterfeit identity documents in images captured with smartphones. As the number of documents contain special fonts, we study the applicability of convolutional neural networks (CNNs) for detection of the conformance of the fonts used with the ones, corresponding to the government standards. Here, we use multi-task learning to differentiate samples by both fonts and characters and compare the resulting classifier with its analogue trained for binary font classification. We train neural networks for authenticity estimation of the fonts used in machine-readable zones and ID numbers of the Russian national passport and test them on samples of individual characters acquired from 3238 images of the Russian national passport. Our results show that the usage of multi-task learning increases sensitivity and specificity of the classifier. Moreover, the resulting CNNs demonstrate high generalization ability as they correctly classify fonts which were not present in the training set. We conclude that the proposed method is sufficient for authentication of the fonts and can be used as a part of the forgery detection system for images acquired with a smartphone camera

    Active OCR: Tightening the Loop in Human Computing for OCR Correction

    Get PDF
    We propose a proof-of-concept application that will experiment with the use of active learning and other iterative techniques for the correction of eighteenth-century texts provided by the HathiTrust Digital Library and the 2,231 ECCO text transcriptions released into the public domain by Gale and distributed by the Text Creation Partnership (TCP) and 18thConnect. In an application based on active learning or a similar approach, the user could identify dozens or hundreds of difficult characters that appear in the articles from that same time period, and the system would use this new knowledge to improve optical character recognition (OCR) across the entire corpus. A portion of our efforts will focus on the need to incentivize engagement in tasks of this type, whether they are traditionally crowdsourced or through a more active, iterative process like the one we propose. We intend to examine how explorations of a users' preferences can improve their engagement with corpora of materials

    Two More Candidate AM Canum Venaticorum (AM CVn) Binaries from the Sloan Digital Sky Survey

    Full text link
    AM CVn systems are a select group of ultracompact binaries with the shortest orbital periods of any known binary subclass; mass-transfer is likely from a low-mass (partially-)degenerate secondary onto a white dwarf primary, driven by gravitational radiation. In the past few years, the Sloan Digital Sky Survey (SDSS) has provided five new AM CVns. Here we report on two further candidates selected from more recent SDSS data. SDSS J1208+3550 is similar to the earlier SDSS discoveries, recognized as an AM CVn via its distinctive spectrum which is dominated by helium emission. From the expanded SDSS Data Release 6 (DR6) spectroscopic area, we provide an updated surface density estimate for such AM CVns of order 10^{-3.1} to 10^{-2.5} per deg^2 for 15<g<20.5. In addition, we present another new candidate AM CVn, SDSS J2047+0008, that was discovered in the course of followup of SDSS-II supernova candidates. It shows nova-like outbursts in multi-epoch imaging data; in contrast to the other SDSS AM CVn discoveries, its (outburst) spectrum is dominated by helium absorption lines, reminiscent of KL Dra and 2003aw. The variability selection of SDSS J2047+0008 from the 300 deg^2 of SDSS Stripe 82 presages further AM CVn discoveries in future deep, multicolor, and time-domain surveys such as LSST. The new additions bring the total SDSS yield to seven AM CVns thus far, a substantial contribution to this rare subclass, versus the dozen previously known.Comment: 19 pages, 5 figures, 1 table; submitted to A
    • …
    corecore