43,759 research outputs found
Bridging the Performance Gap between DETR and R-CNN for Graphical Object Detection in Document Images
This paper takes an important step in bridging the performance gap between
DETR and R-CNN for graphical object detection. Existing graphical object
detection approaches have enjoyed recent enhancements in CNN-based object
detection methods, achieving remarkable progress. Recently, Transformer-based
detectors have considerably boosted the generic object detection performance,
eliminating the need for hand-crafted features or post-processing steps such as
Non-Maximum Suppression (NMS) using object queries. However, the effectiveness
of such enhanced transformer-based detection algorithms has yet to be verified
for the problem of graphical object detection. Essentially, inspired by the
latest advancements in the DETR, we employ the existing detection transformer
with few modifications for graphical object detection. We modify object queries
in different ways, using points, anchor boxes and adding positive and negative
noise to the anchors to boost performance. These modifications allow for better
handling of objects with varying sizes and aspect ratios, more robustness to
small variations in object positions and sizes, and improved image
discrimination between objects and non-objects. We evaluate our approach on the
four graphical datasets: PubTables, TableBank, NTable and PubLaynet. Upon
integrating query modifications in the DETR, we outperform prior works and
achieve new state-of-the-art results with the mAP of 96.9\%, 95.7\% and 99.3\%
on TableBank, PubLaynet, PubTables, respectively. The results from extensive
ablations show that transformer-based methods are more effective for document
analysis analogous to other applications. We hope this study draws more
attention to the research of using detection transformers in document image
analysis
Learning Object Categories From Internet Image Searches
In this paper, we describe a simple approach to learning models of visual object categories from images gathered from Internet image search engines. The images for a given keyword are typically highly variable, with a large fraction being unrelated to the query term, and thus pose a challenging environment from which to learn. By training our models directly from Internet images, we remove the need to laboriously compile training data sets, required by most other recognition approaches-this opens up the possibility of learning object category models “on-the-fly.” We describe two simple approaches, derived from the probabilistic latent semantic analysis (pLSA) technique for text document analysis, that can be used to automatically learn object models from these data. We show two applications of the learned model: first, to rerank the images returned by the search engine, thus improving the quality of the search engine; and second, to recognize objects in other image data sets
WordFences: Text localization and recognition
En col·laboració amb la Universitat de Barcelona (UB) i la Universitat Rovira i Virgili (URV)In recent years, text recognition has achieved remarkable success in recognizing scanned
document text. However, word recognition in natural images is still an open problem,
which generally requires time consuming post-processing steps. We present a novel architecture
for individual word detection in scene images based on semantic segmentation.
Our contributions are twofold: the concept of WordFence, which detects border areas
surrounding each individual word and a unique pixelwise weighted softmax loss function
which penalizes background and emphasizes small text regions. WordFence ensures that
each word is detected individually, and the new loss function provides a strong training
signal to both text and word border localization. The proposed technique avoids intensive
post-processing by combining semantic word segmentation with a voting scheme
for merging segmentations of multiple scales, producing an end-to-end word detection
system. We achieve superior localization recall on common benchmark datasets - 92%
recall on ICDAR11 and ICDAR13 and 63% recall on SVT. Furthermore, end-to-end
word recognition achieves state-of-the-art 86% F-Score on ICDAR13
Automatic Structural Scene Digitalization
In this paper, we present an automatic system for the analysis and labeling
of structural scenes, floor plan drawings in Computer-aided Design (CAD)
format. The proposed system applies a fusion strategy to detect and recognize
various components of CAD floor plans, such as walls, doors, windows and other
ambiguous assets. Technically, a general rule-based filter parsing method is
fist adopted to extract effective information from the original floor plan.
Then, an image-processing based recovery method is employed to correct
information extracted in the first step. Our proposed method is fully automatic
and real-time. Such analysis system provides high accuracy and is also
evaluated on a public website that, on average, archives more than ten
thousands effective uses per day and reaches a relatively high satisfaction
rate.Comment: paper submitted to PloS On
- …