42,804 research outputs found
BaDLAD: A Large Multi-Domain Bengali Document Layout Analysis Dataset
While strides have been made in deep learning based Bengali Optical Character
Recognition (OCR) in the past decade, the absence of large Document Layout
Analysis (DLA) datasets has hindered the application of OCR in document
transcription, e.g., transcribing historical documents and newspapers.
Moreover, rule-based DLA systems that are currently being employed in practice
are not robust to domain variations and out-of-distribution layouts. To this
end, we present the first multidomain large Bengali Document Layout Analysis
Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples
from six domains - i) books and magazines, ii) public domain govt. documents,
iii) liberation war documents, iv) newspapers, v) historical newspapers, and
vi) property deeds, with 710K polygon annotations for four unit types:
text-box, paragraph, image, and table. Through preliminary experiments
benchmarking the performance of existing state-of-the-art deep learning
architectures for English DLA, we demonstrate the efficacy of our dataset in
training deep learning based Bengali document digitization models
Optical Font Recognition in Smartphone-Captured Images, and its Applicability for ID Forgery Detection
In this paper, we consider the problem of detecting counterfeit identity
documents in images captured with smartphones. As the number of documents
contain special fonts, we study the applicability of convolutional neural
networks (CNNs) for detection of the conformance of the fonts used with the
ones, corresponding to the government standards. Here, we use multi-task
learning to differentiate samples by both fonts and characters and compare the
resulting classifier with its analogue trained for binary font classification.
We train neural networks for authenticity estimation of the fonts used in
machine-readable zones and ID numbers of the Russian national passport and test
them on samples of individual characters acquired from 3238 images of the
Russian national passport. Our results show that the usage of multi-task
learning increases sensitivity and specificity of the classifier. Moreover, the
resulting CNNs demonstrate high generalization ability as they correctly
classify fonts which were not present in the training set. We conclude that the
proposed method is sufficient for authentication of the fonts and can be used
as a part of the forgery detection system for images acquired with a smartphone
camera
Active OCR: Tightening the Loop in Human Computing for OCR Correction
We propose a proof-of-concept application that will experiment with the use of active learning and other iterative techniques for the correction of eighteenth-century texts provided by the HathiTrust Digital Library and the 2,231 ECCO text transcriptions released into the public domain by Gale and distributed by the Text Creation Partnership (TCP) and 18thConnect. In an application based on active learning or a similar approach, the user could identify dozens or hundreds of difficult characters that appear in the articles from that same time period, and the system would use this new knowledge to improve optical character recognition (OCR) across the entire corpus. A portion of our efforts will focus on the need to incentivize engagement in tasks of this type, whether they are traditionally crowdsourced or through a more active, iterative process like the one we propose. We intend to examine how explorations of a users' preferences can improve their engagement with corpora of materials
Two More Candidate AM Canum Venaticorum (AM CVn) Binaries from the Sloan Digital Sky Survey
AM CVn systems are a select group of ultracompact binaries with the shortest
orbital periods of any known binary subclass; mass-transfer is likely from a
low-mass (partially-)degenerate secondary onto a white dwarf primary, driven by
gravitational radiation. In the past few years, the Sloan Digital Sky Survey
(SDSS) has provided five new AM CVns. Here we report on two further candidates
selected from more recent SDSS data. SDSS J1208+3550 is similar to the earlier
SDSS discoveries, recognized as an AM CVn via its distinctive spectrum which is
dominated by helium emission. From the expanded SDSS Data Release 6 (DR6)
spectroscopic area, we provide an updated surface density estimate for such AM
CVns of order 10^{-3.1} to 10^{-2.5} per deg^2 for 15<g<20.5. In addition, we
present another new candidate AM CVn, SDSS J2047+0008, that was discovered in
the course of followup of SDSS-II supernova candidates. It shows nova-like
outbursts in multi-epoch imaging data; in contrast to the other SDSS AM CVn
discoveries, its (outburst) spectrum is dominated by helium absorption lines,
reminiscent of KL Dra and 2003aw. The variability selection of SDSS J2047+0008
from the 300 deg^2 of SDSS Stripe 82 presages further AM CVn discoveries in
future deep, multicolor, and time-domain surveys such as LSST. The new
additions bring the total SDSS yield to seven AM CVns thus far, a substantial
contribution to this rare subclass, versus the dozen previously known.Comment: 19 pages, 5 figures, 1 table; submitted to A
- …