386 research outputs found
Convolutional neural network architecture for geometric matching
We address the problem of determining correspondences between two images in
agreement with a geometric model such as an affine or thin-plate spline
transformation, and estimating its parameters. The contributions of this work
are three-fold. First, we propose a convolutional neural network architecture
for geometric matching. The architecture is based on three main components that
mimic the standard steps of feature extraction, matching and simultaneous
inlier detection and model parameter estimation, while being trainable
end-to-end. Second, we demonstrate that the network parameters can be trained
from synthetically generated imagery without the need for manual annotation and
that our matching layer significantly increases generalization capabilities to
never seen before images. Finally, we show that the same model can perform both
instance-level and category-level matching giving state-of-the-art results on
the challenging Proposal Flow dataset.Comment: In 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR 2017
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.Comment: 18 pages, 16 figures, and 3 table
SFNet: Learning Object-aware Semantic Correspondence
We address the problem of semantic correspondence, that is, establishing a
dense flow field between images depicting different instances of the same
object or scene category. We propose to use images annotated with binary
foreground masks and subjected to synthetic geometric deformations to train a
convolutional neural network (CNN) for this task. Using these masks as part of
the supervisory signal offers a good compromise between semantic flow methods,
where the amount of training data is limited by the cost of manually selecting
point correspondences, and semantic alignment ones, where the regression of a
single global geometric transformation between images may be sensitive to
image-specific details such as background clutter. We propose a new CNN
architecture, dubbed SFNet, which implements this idea. It leverages a new and
differentiable version of the argmax function for end-to-end training, with a
loss that combines mask and flow consistency with smoothness terms.
Experimental results demonstrate the effectiveness of our approach, which
significantly outperforms the state of the art on standard benchmarks.Comment: cvpr 2019 oral pape
Learning Semantic Correspondence Exploiting an Object-level Prior
We address the problem of semantic correspondence, that is, establishing a dense flow field between images depicting different instances of the same object or scene category. We propose to use images annotated with binary foreground masks and subjected to synthetic geometric deformations to train a convolutional neural network (CNN) for this task. Using these masks as part of the supervisory signal provides an object-level prior for the semantic correspondence task and offers a good compromise between semantic flow methods, where the amount of training data is limited by the cost of manually selecting point correspondences, and semantic alignment ones, where the regression of a single global geometric transformation between images may be sensitive to image-specific details such as background clutter. We propose a new CNN architecture, dubbed SFNet, which implements this idea. It leverages a new and differentiable version of the argmax function for end-to-end training, with a loss that combines mask and flow consistency with smoothness terms. Experimental results demonstrate the effectiveness of our approach, which significantly outperforms the state of the art on standard benchmarks
VGFlow: Visibility guided Flow Network for Human Reposing
The task of human reposing involves generating a realistic image of a person
standing in an arbitrary conceivable pose. There are multiple difficulties in
generating perceptually accurate images, and existing methods suffer from
limitations in preserving texture, maintaining pattern coherence, respecting
cloth boundaries, handling occlusions, manipulating skin generation, etc. These
difficulties are further exacerbated by the fact that the possible space of
pose orientation for humans is large and variable, the nature of clothing items
is highly non-rigid, and the diversity in body shape differs largely among the
population. To alleviate these difficulties and synthesize perceptually
accurate images, we propose VGFlow. Our model uses a visibility-guided flow
module to disentangle the flow into visible and invisible parts of the target
for simultaneous texture preservation and style manipulation. Furthermore, to
tackle distinct body shapes and avoid network artifacts, we also incorporate a
self-supervised patch-wise "realness" loss to improve the output. VGFlow
achieves state-of-the-art results as observed qualitatively and quantitatively
on different image quality metrics (SSIM, LPIPS, FID).Comment: 9 pages, 18 figures, computer visio
A framework for ancient and machine-printed manuscripts categorization
Document image understanding (DIU) has attracted a lot of attention and became an of active fields of research. Although, the ultimate goal of DIU is extracting textual information of a document image, many steps are involved in a such a process such as categorization, segmentation and layout analysis. All of these steps are needed in order to obtain an accurate result from character recognition or word recognition of a document image. One of the important steps in DIU is document image categorization (DIC) that is needed in many situations such as document image written or printed in more than one script, font or language. This step provides useful information for recognition system and helps in reducing its error by allowing to incorporate a category-specific Optical Character Recognition (OCR) system or word recognition (WR) system. This research focuses on the problem of DIC in different categories of scripts, styles and languages and establishes a framework for flexible representation and feature extraction that can be adapted to many DIC problem. The current methods for DIC have many limitations and drawbacks that restrict the practical usage of these methods. We proposed an efficient framework for categorization of document image based on patch representation and Non-negative Matrix Factorization (NMF). This framework is flexible and can be adapted to different categorization problem.
Many methods exist for script identification of document image but few of them addressed the problem in handwritten manuscripts and they have many limitations and drawbacks. Therefore, our first goal is to introduce a novel method for script identification of ancient manuscripts. The proposed method is based on patch representation in which the patches are extracted using skeleton map of a document images. This representation overcomes the limitation of the current methods about the fixed level of layout. The proposed feature extraction scheme based on Projective Non-negative Matrix Factorization (PNMF) is robust against noise and handwriting variation and can be used for different scripts. The proposed method has higher performance compared to state of the art methods and can be applied to different levels of layout.
The current methods for font (style) identification are mostly proposed to be applied on machine-printed document image and many of them can only be used for a specific level of layout. Therefore, we proposed new method for font and style identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The images are represented by overlapping patches obtained from the foreground pixels. The position of these patches are set based on skeleton map to reduce the number of patches. Non-Negative Matrix Tri-Factorization is used to learn bases from each fonts (style) and then these bases are used to classify a new image based on minimum representation error. The proposed method can easily be extended to new fonts as the bases for each font are learned separately from the other fonts. This method is tested on two datasets of machine-printed and ancient manuscript and the results confirmed its performance compared to the state of the art methods.
Finally, we proposed a novel method for language identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The current methods for language identification are based on textual data obtained by OCR engine or images data through coding and comparing with textual data. The OCR based method needs lots of processing and the current image based method are not applicable to cursive scripts such as Arabic. In this work we introduced a new method for language identification of machine-printed and handwritten manuscripts based on patch representation and NMTF. The patch representation provides the component of the Arabic script (letters) that can not be extracted simply by segmentation methods. Then NMTF is used for dictionary learning and generating codebooks that will be used to represent document image with a histogram. The proposed method is tested on two datasets of machine-printed and handwritten manuscripts and compared to n-gram features (text-based), texture features and codebook features (imagebased) to validate the performance.
The above proposed methods are robust against variation in handwritings, changes in the font (handwriting style) and presence of degradation and are flexible that can be used to various levels of layout (from a textline to paragraph). The methods in this research have been tested on datasets of handwritten and machine-printed manuscripts and compared to state-of-the-art methods. All of the evaluations show the efficiency, robustness and flexibility of the proposed methods for categorization of document image. As mentioned before the proposed strategies provide a framework for efficient and flexible representation and feature extraction for document image categorization. This frame work can be applied to different levels of layout, the information from different levels of layout can be merged and mixed and this framework can be extended to more complex situations and different tasks
- …