160,718 research outputs found
Few-to-few Cross-domain Object Matching
Cross-domain object matching refers to the task of inferring unknown alignment between objects in two data collections that do not have a shared data representation. In recent years several methods have been proposed for solving the special case that assumes each object is to be paired with exactly one object, resulting in a constrained optimization problem over permutations. A related problem formulation of cluster matching seeks to match a cluster of objects in one data set to a cluster of objects in the other set, which can be considered as many-to-many extension of cross-domain object matching and can be solved without explicit constraints. In this work we study the intermediate region between these two special cases, presenting a range of Bayesian inference algorithms that work alsofor few-to-few cross-domain object matching problems where constrained optimization is necessary but the optimization domain is broader than just permutations.Peer reviewe
Cross-Modal Concept Learning and Inference for Vision-Language Models
Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP,
establish the correlation between texts and images, achieving remarkable
success on various downstream tasks with fine-tuning. In existing fine-tuning
methods, the class-specific text description is matched against the whole
image. We recognize that this whole image matching is not effective since
images from the same class often contain a set of different semantic objects,
and an object further consists of a set of semantic parts or concepts.
Individual semantic parts or concepts may appear in image samples from
different classes. To address this issue, in this paper, we develop a new
method called cross-model concept learning and inference (CCLI). Using the
powerful text-image correlation capability of CLIP, our method automatically
learns a large set of distinctive visual concepts from images using a set of
semantic text concepts. Based on these visual concepts, we construct a
discriminative representation of images and learn a concept inference network
to perform downstream image classification tasks, such as few-shot learning and
domain generalization. Extensive experimental results demonstrate that our CCLI
method is able to improve the performance upon the current state-of-the-art
methods by large margins, for example, by up to 8.0% improvement on few-shot
learning and by up to 1.3% for domain generalization
Class-Agnostic Counting
Nearly all existing counting methods are designed for a specific object
class. Our work, however, aims to create a counting model able to count any
class of object. To achieve this goal, we formulate counting as a matching
problem, enabling us to exploit the image self-similarity property that
naturally exists in object counting problems. We make the following three
contributions: first, a Generic Matching Network (GMN) architecture that can
potentially count any object in a class-agnostic manner; second, by
reformulating the counting problem as one of matching objects, we can take
advantage of the abundance of video data labeled for tracking, which contains
natural repetitions suitable for training a counting model. Such data enables
us to train the GMN. Third, to customize the GMN to different user
requirements, an adapter module is used to specialize the model with minimal
effort, i.e. using a few labeled examples, and adapting only a small fraction
of the trained parameters. This is a form of few-shot learning, which is
practical for domains where labels are limited due to requiring expert
knowledge (e.g. microbiology). We demonstrate the flexibility of our method on
a diverse set of existing counting benchmarks: specifically cells, cars, and
human crowds. The model achieves competitive performance on cell and crowd
counting datasets, and surpasses the state-of-the-art on the car dataset using
only three training images. When training on the entire dataset, the proposed
method outperforms all previous methods by a large margin.Comment: Asian Conference on Computer Vision (ACCV), 201
Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching
This paper presents a robotic pick-and-place system that is capable of
grasping and recognizing both known and novel objects in cluttered
environments. The key new feature of the system is that it handles a wide range
of object categories without needing any task-specific training data for novel
objects. To achieve this, it first uses a category-agnostic affordance
prediction algorithm to select and execute among four different grasping
primitive behaviors. It then recognizes picked objects with a cross-domain
image classification framework that matches observed images to product images.
Since product images are readily available for a wide range of objects (e.g.,
from the web), the system works out-of-the-box for novel objects without
requiring any additional training data. Exhaustive experimental results
demonstrate that our multi-affordance grasping achieves high success rates for
a wide variety of objects in clutter, and our recognition algorithm achieves
high accuracy for both known and novel grasped objects. The approach was part
of the MIT-Princeton Team system that took 1st place in the stowing task at the
2017 Amazon Robotics Challenge. All code, datasets, and pre-trained models are
available online at http://arc.cs.princeton.eduComment: Project webpage: http://arc.cs.princeton.edu Summary video:
https://youtu.be/6fG7zwGfIk
Sketch-based 3D Shape Retrieval using Convolutional Neural Networks
Retrieving 3D models from 2D human sketches has received considerable
attention in the areas of graphics, image retrieval, and computer vision.
Almost always in state of the art approaches a large amount of "best views" are
computed for 3D models, with the hope that the query sketch matches one of
these 2D projections of 3D models using predefined features.
We argue that this two stage approach (view selection -- matching) is
pragmatic but also problematic because the "best views" are subjective and
ambiguous, which makes the matching inputs obscure. This imprecise nature of
matching further makes it challenging to choose features manually. Instead of
relying on the elusive concept of "best views" and the hand-crafted features,
we propose to define our views using a minimalism approach and learn features
for both sketches and views. Specifically, we drastically reduce the number of
views to only two predefined directions for the whole dataset. Then, we learn
two Siamese Convolutional Neural Networks (CNNs), one for the views and one for
the sketches. The loss function is defined on the within-domain as well as the
cross-domain similarities. Our experiments on three benchmark datasets
demonstrate that our method is significantly better than state of the art
approaches, and outperforms them in all conventional metrics.Comment: CVPR 201
- …