Recognition-based Approach of Numeral Extraction in Handwritten Chemistry Documents using Contextual Knowledge

Abstract

International audienceThis paper presents a complete procedure that uses contextual and syntactic information to identify and recognize amount fields in the table regions of chemistry documents. The proposed method is composed of two main modules. Firstly, a structural analysis based on connected component (CC) dimensions and positions identifies some special symbols and clusters other CCs into three groups: fragment of characters, isolated characters or connected characters. Then, a specific processing is performed on each group of CCs. The fragment of characters are merged with the nearest character or string using geometric relationship based rules. The characters are sent to a recognition module to identify the numeral components. For the connected characters, the final decision on the string nature (numeric or non-numeric) is made based on a global score computed on the full string using the height regularity property and the recognition probabilities of its segmented fragments. Finally, a simple syntactic verification at table row level is conducted in order to correct eventual errors. The experimental tests are carried out on real-world chemistry documents provided by our industrial partner eNovalys. The obtained results show the effectiveness of the proposed system in extracting amount fields

    Similar works