9,750 research outputs found
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation
This paper presents a comprehensive evaluation of the Optical Character
Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large
Multimodal Model (LMM). We assess the model's performance across a range of OCR
tasks, including scene text recognition, handwritten text recognition,
handwritten mathematical expression recognition, table structure recognition,
and information extraction from visually-rich document. The evaluation reveals
that GPT-4V performs well in recognizing and understanding Latin contents, but
struggles with multilingual scenarios and complex tasks. Specifically, it
showed limitations when dealing with non-Latin languages and complex tasks such
as handwriting mathematical expression recognition, table structure
recognition, and end-to-end semantic entity recognition and pair extraction
from document image. Based on these observations, we affirm the necessity and
continued research value of specialized OCR models. In general, despite its
versatility in handling diverse OCR tasks, GPT-4V does not outperform existing
state-of-the-art OCR models. How to fully utilize pre-trained general-purpose
LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study
offers a critical reference for future research in OCR with LMMs. Evaluation
pipeline and results are available at
https://github.com/SCUT-DLVCLab/GPT-4V_OCR
Multi-Scale Attention with Dense Encoder for Handwritten Mathematical Expression Recognition
Handwritten mathematical expression recognition is a challenging problem due
to the complicated two-dimensional structures, ambiguous handwriting input and
variant scales of handwritten math symbols. To settle this problem, we utilize
the attention based encoder-decoder model that recognizes mathematical
expression images from two-dimensional layouts to one-dimensional LaTeX
strings. We improve the encoder by employing densely connected convolutional
networks as they can strengthen feature extraction and facilitate gradient
propagation especially on a small training set. We also present a novel
multi-scale attention model which is employed to deal with the recognition of
math symbols in different scales and save the fine-grained details that will be
dropped by pooling operations. Validated on the CROHME competition task, the
proposed method significantly outperforms the state-of-the-art methods with an
expression recognition accuracy of 52.8% on CROHME 2014 and 50.1% on CROHME
2016, by only using the official training dataset
Symbol detection in online handwritten graphics using Faster R-CNN
Symbol detection techniques in online handwritten graphics (e.g. diagrams and
mathematical expressions) consist of methods specifically designed for a single
graphic type. In this work, we evaluate the Faster R-CNN object detection
algorithm as a general method for detection of symbols in handwritten graphics.
We evaluate different configurations of the Faster R-CNN method, and point out
issues relative to the handwritten nature of the data. Considering the online
recognition context, we evaluate efficiency and accuracy trade-offs of using
Deep Neural Networks of different complexities as feature extractors. We
evaluate the method on publicly available flowchart and mathematical expression
(CROHME-2016) datasets. Results show that Faster R-CNN can be effectively used
on both datasets, enabling the possibility of developing general methods for
symbol detection, and furthermore, general graphic understanding methods that
could be built on top of the algorithm.Comment: Submitted to DAS-201
Math Search for the Masses: Multimodal Search Interfaces and Appearance-Based Retrieval
We summarize math search engines and search interfaces produced by the
Document and Pattern Recognition Lab in recent years, and in particular the min
math search interface and the Tangent search engine. Source code for both
systems are publicly available. "The Masses" refers to our emphasis on creating
systems for mathematical non-experts, who may be looking to define unfamiliar
notation, or browse documents based on the visual appearance of formulae rather
than their mathematical semantics.Comment: Paper for Invited Talk at 2015 Conference on Intelligent Computer
Mathematics (July, Washington DC
- …