6,602 research outputs found
Non-Visual Representation of Complex Documents for Use in Digital Talking Books
Essential written information such as text books, bills, and catalogues needs to be accessible by everyone. However, access is not always available to vision-impaired people. As they require electronic documents to be available in specific formats. In order to address the accessibility issues of electronic documents, this research aims to design an affordable, portable, standalone and simple to use complete reading system that will convert and describe complex components in electronic documents to print disabled users
Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation
Information can include text, pictures and signatures that can be scanned into a document format, such as the Portable Document Format (PDF), and easily emailed to recipients around the world. Upon the document’s arrival, the receiver can open and view it using a vast array of different PDF viewing applications such as Adobe Reader and Apple Preview. Hence, today the use of the PDF has become pervasive. Since the scanned PDF is an image format, it is inaccessible to assistive technologies such as a screen reader. Therefore, the retrieval of the information needs Optical Character Recognition (OCR). The OCR software scans the scanned PDF file and through text extraction generates an editable text formatted document. This text document can then be edited, formatted, searched and indexed as well as translated or converted to speech. A problem that the OCR software does not solve is the accurate regeneration of the full text layout. This paper presents a technology that addresses this issue by closely preserving the original textual layout of the scanned PDF using the open source document analysis and OCR system (OCRopus) based on geometric layout and positioning information. The main issues considered in this research are the preservation of the correct reading order, and the representation of common logical structured elements such as section headings, line breaks, paragraphs, captions, and sidebars, foot-bars, running headers, embedded images, graphics, tables and mathematical expressions
Development of a scalable database for recognition of printed mathemematical expressions
[ES] Buscar información en documentos científicos impresos es un reto problemático que recientemente ha recibido atención especial por parte de la comunidad de
investigación de Reconocimiento de Formas. Las Expresiones Matemáticas son
elementos complejos que aparecen en documentos cientificos, y desarrollar técnicas para localizarlas y reconocerlas requiere preparar data sets que pueden ser
utilizados como punto de referencia. La mayoría de las técnicas actuales para
lidiar con Expresiones Matemáticas están basadas en técnicas de Reconocimiento de Formas y Aprendizaje Automático y por tanto, estos data sets tienen que
ser preparados con información sobre el ground-truth para entrenamiento y test
automático. Sin embargo, preparar data sets grandes es muy costoso y requiere
mucho tiempo. Este proyecto introduce un data set de documentos científicos que
ha sido preparado con el fin de reconocer y buscar Expresiones Matemáticas. Este
data set ha sido generado automáticamente a partir de la versión LATEX de los documentos y consecuentemente puede ser aumentado fácilmente. El ground-truth
incluye la posición a nivel de página, la versión LATEX de las Expresiones Matemáticas integradas y aisladas del texto y la secuencia de símbolos representados
como unicode code points que se han utilizado para definir estas expresiones. En
base a este data set, se han extraído estadísticas como por ejemplo el número total
y el tipo de las expresiones, el número medio de expresiones por documento y las
frecuencias de distribución de todo el conjunto de expresiones. En este documento también se introduce un experimento de clasificación de símbolos matemáticos
que puede ser utilizado como punto de partida.[EN] Searching information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical Expressions are complex elements that appear
in scientific documents, and developing techniques for locating and recognizing
them requires preparation of data sets that can be used as benchmarks. Most
of the current techniques for dealing with Mathematical Expressions are based
in Machine Intelligent techniques and therefore these data sets have to be prepared with ground-truth information for automatic training and testing. However preparing large data sets with ground-truth is a very expensive and timeconsuming task. This project introduces a data set of scientific documents that has
been prepared for Mathematical Expression recognition and searching. This data
set has been automatically generated from the LATEX version of the documents
and consequently can be enlarged easily. The ground-truth includes the position
at page level, the LATEX version for Mathematical Expressions both embedded in
the text and displayed and the sequence of mathematical symbols represented
as unicode code points used to define these expressions. Based on this data set,
statistics such as the total number and type of expressions, the average number
of expressions per document and their frequency distribution were extracted. A
baseline classification experiment with mathematical symbols from this data set
is also reported in this paper.Anitei, D. (2020). Development of a scalable database for recognition of printed mathemematical expressions. Universitat Politècnica de València. http://hdl.handle.net/10251/150390TFG
Scanning Single Shot Detector for Math in Document Images
We introduce the Scanning Single Shot Detector (ScanSSD) for detecting both embedded and displayed math expressions in document images using a single-stage network that does not require page layout, font, or, character information. ScanSSD uses sliding windows to generate sub-images of large document page images rendered at 600 dpi and applies Single Shot Detector (SSD) on each sub-image. Detection results from sub-images are pooled to generate page-level results. For pooling sub-image level detections, we introduce new methods based on the confidence scores and density of detections. ScanSSD is a modular architecture that can be easily applied to detecting other objects in document images.
For the math expression detection task, we have created a new dataset called TFD-ICDAR 2019 from the existing GTDB datasets. Our dataset has 569 pages for training with 26,396 math expressions and 236 pages for testing with 11,885 math expressions. ScanSSD achieves an 80.19% F-score at IOU50 and a 72.96% F-score at IOU75 on TFD-ICDAR 2019 test dataset. An earlier version of ScanSSD placed 2nd in the ICDAR 2019 competition on the Typeset Formula Detection (TFD). Our data and code are publicly available at https://github.com/MaliParag/TFD-ICDAR2019 and https://github.com/MaliParag/ScanSSD, respectively
Ontologies and Information Extraction
This report argues that, even in the simplest cases, IE is an ontology-driven
process. It is not a mere text filtering method based on simple pattern
matching and keywords, because the extracted pieces of texts are interpreted
with respect to a predefined partial domain model. This report shows that
depending on the nature and the depth of the interpretation to be done for
extracting the information, more or less knowledge must be involved. This
report is mainly illustrated in biology, a domain in which there are critical
needs for content-based exploration of the scientific literature and which
becomes a major application domain for IE
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
From Pixels and Minds to the Mathematical Knowledge in a Digital Library
summary:Experience in setting up a workflow from scanned images of mathematical papers into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with description of all main production steps. DML-CZ has recently been launched to public with more than 100,000 digitized pages
A direct TeX-to-Braille transcribing method
The TeX/LaTeX typesetting system is the most wide-spread system for creating documents in Mathematics and Science. However, no reliable tool exists to this day for automatically transcribing documents from the above formats into Braille/Nemeth code. Thus, visually impaired students of related fields do not have access to the bulk of study material available in LaTeX format. We have developed a tool, named latex2nemeth, for directly transcribing LaTeX documents to Nemeth Braille, thus facilitating the access of blind students to Science. In order to support the extensive set of Mathematics symbols covered by TeX, we propose some new symbols based on the extension mechanisms of the Nemeth code
- …