19,200 research outputs found
Recognition and identification of form document layouts
In this thesis, a hierarchical tree representation is introduced to represent the logical structure of a form document. But different forms might have the same logical structure, so the representation will be ambiguous. In this thesis, an improvement is proposed to solve the ambiguity problem by using the physical information of the blocks. To fulfill the application of hierarchical tree representation and extract the physical information of blocks, a pixel tracing approach is used to extract form layout structures from form documents. Compared with Hough transform, the pixel tracing algorithm requires less computation. This algorithm has been tested on 50 different table forms. It effectively extracts all the line information required for the hierarchical tree representation, represents the form by a hierarchical tree, and distinguishes the different forms. The algorithm applies to table form documents
Recommended from our members
Music-reading expertise modulates the visual span for English letters but not Chinese characters.
Recent research has suggested that the visual span in stimulus identification can be enlarged through perceptual learning. Since both English and music reading involve left-to-right sequential symbol processing, music-reading experience may enhance symbol identification through perceptual learning particularly in the right visual field (RVF). In contrast, as Chinese can be read in all directions, and components of Chinese characters do not consistently form a left-right structure, this hypothesized RVF enhancement effect may be limited in Chinese character identification. To test these hypotheses, here we recruited musicians and nonmusicians who read Chinese as their first language (L1) and English as their second language (L2) to identify music notes, English letters, Chinese characters, and novel symbols (Tibetan letters) presented at different eccentricities and visual field locations on the screen while maintaining central fixation. We found that in English letter identification, significantly more musicians achieved above-chance performance in the center-RVF locations than nonmusicians. This effect was not observed in Chinese character or novel symbol identification. We also found that in music note identification, musicians outperformed nonmusicians in accuracy in the center-RVF condition, consistent with the RVF enhancement effect in the visual span observed in English-letter identification. These results suggest that the modulation of music-reading experience on the visual span for stimulus identification depends on the similarities in the perceptual processes involved
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
A survey of comics research in computer science
Graphical novels such as comics and mangas are well known all over the world.
The digital transition started to change the way people are reading comics,
more and more on smartphones and tablets and less and less on paper. In the
recent years, a wide variety of research about comics has been proposed and
might change the way comics are created, distributed and read in future years.
Early work focuses on low level document image analysis: indeed comic books are
complex, they contains text, drawings, balloon, panels, onomatopoeia, etc.
Different fields of computer science covered research about user interaction
and content generation such as multimedia, artificial intelligence,
human-computer interaction, etc. with different sets of values. We propose in
this paper to review the previous research about comics in computer science, to
state what have been done and to give some insights about the main outlooks
XLIndy: interactive recognition and information extraction in spreadsheets
Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.Peer ReviewedPostprint (author's final draft
Identification of Technical Journals by Image Processing Techniques
The emphasis of this study is put on developing an automatic approach to identifying a given unknown technical journal from its cover page. Since journal cover pages contain a great deal of information, determining the title of an unknown journal using optical character recognition techniques seems difficult. Comparing the layout structures of text blocks on the journal cover pages is an effective method for distinguishing one journal from the other. In order to achieve efficient layout-structure comparison, a left-to-right hidden Markov model (HMM) is used to represent the layout structure of text blocks for each kind of journal. Accordingly, title determination of an input unknown journal can be effectively achieved by comparing the layout structure of the unknown journal to each HMM in the database. Besides, from the layout structure of the best matched HMM, we can locate the text block of the issue date, which will be recognized by OCR techniques for accomplishing an automatic journal registration system. Experimental results show the feasibility of the proposed approach
Enhancing Document Information Analysis with Multi-Task Pre-training: A Robust Approach for Information Extraction in Visually-Rich Documents
This paper introduces a deep learning model tailored for document information
analysis, emphasizing document classification, entity relation extraction, and
document visual question answering. The proposed model leverages
transformer-based models to encode all the information present in a document
image, including textual, visual, and layout information. The model is
pre-trained and subsequently fine-tuned for various document image analysis
tasks. The proposed model incorporates three additional tasks during the
pre-training phase, including reading order identification of different layout
segments in a document image, layout segments categorization as per PubLayNet,
and generation of the text sequence within a given layout segment (text block).
The model also incorporates a collective pre-training scheme where losses of
all the tasks under consideration, including pre-training and fine-tuning tasks
with all datasets, are considered. Additional encoder and decoder blocks are
added to the RoBERTa network to generate results for all tasks. The proposed
model achieved impressive results across all tasks, with an accuracy of 95.87%
on the RVL-CDIP dataset for document classification, F1 scores of 0.9306,
0.9804, 0.9794, and 0.8742 on the FUNSD, CORD, SROIE, and Kleister-NDA datasets
respectively for entity relation extraction, and an ANLS score of 0.8468 on the
DocVQA dataset for visual question answering. The results highlight the
effectiveness of the proposed model in understanding and interpreting complex
document layouts and content, making it a promising tool for document analysis
tasks
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
Program documentation standards
A style manual is presented to serve as a reference and guide for system and program documentation. It is intended to set standards for documentation, prescribing the procedures to be followed, format to be used, and information to be produced. The standards for program documentation specify the extent to which the programmer should support his efforts in writing. The first three sections of the manual (system, program, and operation descriptions) contain information of particular interest to management, operators, and program users, respectively. Each section was designed as a self-sufficient description from the management, operator, or user point of view
- …