58,843 research outputs found
Attentional Selection in Object Recognition
A key problem in object recognition is selection, namely, the problem of identifying regions in an image within which to start the recognition process, ideally by isolating regions that are likely to come from a single object. Such a selection mechanism has been found to be crucial in reducing the combinatorial search involved in the matching stage of object recognition. Even though selection is of help in recognition, it has largely remained unsolved because of the difficulty in isolating regions belonging to objects under complex imaging conditions involving occlusions, changing illumination, and object appearances. This thesis presents a novel approach to the selection problem by proposing a computational model of visual attentional selection as a paradigm for selection in recognition. In particular, it proposes two modes of attentional selection, namely, attracted and pay attention modes as being appropriate for data and model-driven selection in recognition. An implementation of this model has led to new ways of extracting color, texture and line group information in images, and their subsequent use in isolating areas of the scene likely to contain the model object. Among the specific results in this thesis are: a method of specifying color by perceptual color categories for fast color region segmentation and color-based localization of objects, and a result showing that the recognition of texture patterns on model objects is possible under changes in orientation and occlusions without detailed segmentation. The thesis also presents an evaluation of the proposed model by integrating with a 3D from 2D object recognition system and recording the improvement in performance. These results indicate that attentional selection can significantly overcome the computational bottleneck in object recognition, both due to a reduction in the number of features, and due to a reduction in the number of matches during recognition using the information derived during selection. Finally, these studies have revealed a surprising use of selection, namely, in the partial solution of the pose of a 3D object
Construction of Latent Descriptor Space and Inference Model of Hand-Object Interactions
Appearance-based generic object recognition is a challenging problem because
all possible appearances of objects cannot be registered, especially as new
objects are produced every day. Function of objects, however, has a
comparatively small number of prototypes. Therefore, function-based
classification of new objects could be a valuable tool for generic object
recognition. Object functions are closely related to hand-object interactions
during handling of a functional object; i.e., how the hand approaches the
object, which parts of the object and contact the hand, and the shape of the
hand during interaction. Hand-object interactions are helpful for modeling
object functions. However, it is difficult to assign discrete labels to
interactions because an object shape and grasping hand-postures intrinsically
have continuous variations. To describe these interactions, we propose the
interaction descriptor space which is acquired from unlabeled appearances of
human hand-object interactions. By using interaction descriptors, we can
numerically describe the relation between an object's appearance and its
possible interaction with the hand. The model infers the quantitative state of
the interaction from the object image alone. It also identifies the parts of
objects designed for hand interactions such as grips and handles. We
demonstrate that the proposed method can unsupervisedly generate interaction
descriptors that make clusters corresponding to interaction types. And also we
demonstrate that the model can infer possible hand-object interactions
Flow-Guided Feature Aggregation for Video Object Detection
Extending state-of-the-art object detectors from image to video is
challenging. The accuracy of detection suffers from degenerated object
appearances in videos, e.g., motion blur, video defocus, rare poses, etc.
Existing work attempts to exploit temporal information on box level, but such
methods are not trained end-to-end. We present flow-guided feature aggregation,
an accurate and end-to-end learning framework for video object detection. It
leverages temporal coherence on feature level instead. It improves the
per-frame features by aggregation of nearby features along the motion paths,
and thus improves the video recognition accuracy. Our method significantly
improves upon strong single-frame baselines in ImageNet VID, especially for
more challenging fast moving objects. Our framework is principled, and on par
with the best engineered systems winning the ImageNet VID challenges 2016,
without additional bells-and-whistles. The proposed method, together with Deep
Feature Flow, powered the winning entry of ImageNet VID challenges 2017. The
code is available at
https://github.com/msracver/Flow-Guided-Feature-Aggregation
- …