8 research outputs found

    SpottingNet: Learning the Similarity of Word Images with Convolutional Neural Network for Word Spotting in Handwritten Historical Documents

    Get PDF
    International audienceWord spotting is a content-based retrieval process that obtains a ranked list of word image candidates similar to the query word in digital document images. In this paper, we propose a similarity score fusion method integrated with hybrid deep-learning classification and regression models to enhance performance for Query-by-Example (QBE) word spotting. Based on the convolutional neural networkend-to-end framework, the presented models enable conjointly learning of the representative word image descriptors and evaluation of the similarity measure between word descriptors directly from the word image, which are the two crucial factors in this task. In addition, we present a sample generation method using location jitter to balance similar and dissimilar image pairs and enlarge the dataset. Experiments are conducted on the classical George Washington (GW) dataset without involving any recognition methods or prior word category information. Our experiments show that the proposed model yields state-of-the-art mean average precision (mAP) of 80.03%, significantly outperforming previous results

    A Question-Answering Approach to Key Value Pair Extraction from Form-Like Document Images

    No full text
    In this paper, we present a new question-answering (QA) based key-value pair extraction approach, called KVPFormer, to robustly extracting key-value relationships between entities from form-like document images. Specifically, KVPFormer first identifies key entities from all entities in an image with a Transformer encoder, then takes these key entities as questions and feeds them into a Transformer decoder to predict their corresponding answers (i.e., value entities) in parallel. To achieve higher answer prediction accuracy, we propose a coarse-to-fine answer prediction approach further, which first extracts multiple answer candidates for each identified question in the coarse stage and then selects the most likely one among these candidates in the fine stage. In this way, the learning difficulty of answer prediction can be effectively reduced so that the prediction accuracy can be improved. Moreover, we introduce a spatial compatibility attention bias into the self-attention/cross-attention mechanism for KVPFormer to better model the spatial interactions between entities. With these new techniques, our proposed KVPFormer achieves state-of-the-art results on FUNSD and XFUND datasets, outperforming the previous best-performing method by 7.2% and 13.2% in F1 score, respectively
    corecore