441 research outputs found

    Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition

    Full text link
    Handwritten mathematical expression recognition (HMER) has attracted extensive attention recently. However, current methods cannot explicitly study the interactions between different symbols, which may fail when faced similar symbols. To alleviate this issue, we propose a simple but efficient method to enhance semantic interaction learning (SIL). Specifically, we firstly construct a semantic graph based on the statistical symbol co-occurrence probabilities. Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space. The cosine distance between different projected vectors indicates the correlation between symbols. And jointly optimizing HMER and SIL can explicitly enhances the model's understanding of symbol relationships. In addition, SAM can be easily plugged into existing attention-based models for HMER and consistently bring improvement. Extensive experiments on public benchmark datasets demonstrate that our proposed module can effectively enhance the recognition performance. Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.Comment: 12 Page

    Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation

    Full text link
    The recently rising markup-to-image generation poses greater challenges as compared to natural image generation, due to its low tolerance for errors as well as the complex sequence and context correlations between markup and rendered image. This paper proposes a novel model named "Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment" (FSA-CDM), which introduces contrastive positive/negative samples into the diffusion model to boost performance for markup-to-image generation. Technically, we design a fine-grained cross-modal alignment module to well explore the sequence similarity between the two modalities for learning robust feature representations. To improve the generalization ability, we propose a contrast-augmented diffusion model to explicitly explore positive and negative samples by maximizing a novel contrastive variational objective, which is mathematically inferred to provide a tighter bound for the model's optimization. Moreover, the context-aware cross attention module is developed to capture the contextual information within markup language during the denoising process, yielding better noise prediction results. Extensive experiments are conducted on four benchmark datasets from different domains, and the experimental results demonstrate the effectiveness of the proposed components in FSA-CDM, significantly exceeding state-of-the-art performance by about 2%-12% DTW improvements. The code will be released at https://github.com/zgj77/FSACDM.Comment: Accepted to ACM MM 2023. The code will be released at https://github.com/zgj77/FSACD

    LiveSketch: Query Perturbations for Guided Sketch-based Visual Search

    Get PDF
    LiveSketch is a novel algorithm for searching large image collections using hand-sketched queries. LiveSketch tackles the inherent ambiguity of sketch search by creating visual suggestions that augment the query as it is drawn, making query specification an iterative rather than one-shot process that helps disambiguate users' search intent. Our technical contributions are: a triplet convnet architecture that incorporates an RNN based variational autoencoder to search for images using vector (stroke-based) queries; real-time clustering to identify likely search intents (and so, targets within the search embedding); and the use of backpropagation from those targets to perturb the input stroke sequence, so suggesting alterations to the query in order to guide the search. We show improvements in accuracy and time-to-task over contemporary baselines using a 67M image corpus.Comment: Accepted to CVPR 201

    BiOcularGAN: Bimodal Synthesis and Annotation of Ocular Images

    Full text link
    Current state-of-the-art segmentation techniques for ocular images are critically dependent on large-scale annotated datasets, which are labor-intensive to gather and often raise privacy concerns. In this paper, we present a novel framework, called BiOcularGAN, capable of generating synthetic large-scale datasets of photorealistic (visible light and near-infrared) ocular images, together with corresponding segmentation labels to address these issues. At its core, the framework relies on a novel Dual-Branch StyleGAN2 (DB-StyleGAN2) model that facilitates bimodal image generation, and a Semantic Mask Generator (SMG) component that produces semantic annotations by exploiting latent features of the DB-StyleGAN2 model. We evaluate BiOcularGAN through extensive experiments across five diverse ocular datasets and analyze the effects of bimodal data generation on image quality and the produced annotations. Our experimental results show that BiOcularGAN is able to produce high-quality matching bimodal images and annotations (with minimal manual intervention) that can be used to train highly competitive (deep) segmentation models (in a privacy aware-manner) that perform well across multiple real-world datasets. The source code for the BiOcularGAN framework is publicly available at https://github.com/dariant/BiOcularGAN.Comment: 13 pages, 14 figure

    DenseBAM-GI: Attention Augmented DeneseNet with momentum aided GRU for HMER

    Full text link
    The task of recognising Handwritten Mathematical Expressions (HMER) is crucial in the fields of digital education and scholarly research. However, it is difficult to accurately determine the length and complex spatial relationships among symbols in handwritten mathematical expressions. In this study, we present a novel encoder-decoder architecture (DenseBAM-GI) for HMER, where the encoder has a Bottleneck Attention Module (BAM) to improve feature representation and the decoder has a Gated Input-GRU (GI-GRU) unit with an extra gate to make decoding long and complex expressions easier. The proposed model is an efficient and lightweight architecture with performance equivalent to state-of-the-art models in terms of Expression Recognition Rate (exprate). It also performs better in terms of top 1, 2, and 3 error accuracy across the CROHME 2014, 2016, and 2019 datasets. DenseBAM-GI achieves the best exprate among all models on the CROHME 2019 dataset. Importantly, these successes are accomplished with a drop in the complexity of the calculation and a reduction in the need for GPU memory
    • …
    corecore