8,070 research outputs found
Occlusion-Aware Instance Segmentation via BiLayer Network Architectures
Segmenting highly-overlapping image objects is challenging, because there is
typically no distinction between real object contours and occlusion boundaries
on images. Unlike previous instance segmentation methods, we model image
formation as a composition of two overlapping layers, and propose Bilayer
Convolutional Network (BCNet), where the top layer detects occluding objects
(occluders) and the bottom layer infers partially occluded instances
(occludees). The explicit modeling of occlusion relationship with bilayer
structure naturally decouples the boundaries of both the occluding and occluded
instances, and considers the interaction between them during mask regression.
We investigate the efficacy of bilayer structure using two popular
convolutional network designs, namely, Fully Convolutional Network (FCN) and
Graph Convolutional Network (GCN). Further, we formulate bilayer decoupling
using the vision transformer (ViT), by representing instances in the image as
separate learnable occluder and occludee queries. Large and consistent
improvements using one/two-stage and query-based object detectors with various
backbones and network layer choices validate the generalization ability of
bilayer decoupling, as shown by extensive experiments on image instance
segmentation benchmarks (COCO, KINS, COCOA) and video instance segmentation
benchmarks (YTVIS, OVIS, BDD100K MOTS), especially for heavy occlusion cases.
Code and data are available at https://github.com/lkeab/BCNet.Comment: Extended version of "Deep Occlusion-Aware Instance Segmentation with
Overlapping BiLayers", CVPR 2021 (arXiv:2103.12340
Robustness of Object Recognition under Extreme Occlusion in Humans and Computational Models
Most objects in the visual world are partially occluded, but humans can
recognize them without difficulty. However, it remains unknown whether object
recognition models like convolutional neural networks (CNNs) can handle
real-world occlusion. It is also a question whether efforts to make these
models robust to constant mask occlusion are effective for real-world
occlusion. We test both humans and the above-mentioned computational models in
a challenging task of object recognition under extreme occlusion, where target
objects are heavily occluded by irrelevant real objects in real backgrounds.
Our results show that human vision is very robust to extreme occlusion while
CNNs are not, even with modifications to handle constant mask occlusion. This
implies that the ability to handle constant mask occlusion does not entail
robustness to real-world occlusion. As a comparison, we propose another
computational model that utilizes object parts/subparts in a compositional
manner to build robustness to occlusion. This performs significantly better
than CNN-based models on our task with error patterns similar to humans. These
findings suggest that testing under extreme occlusion can better reveal the
robustness of visual recognition, and that the principle of composition can
encourage such robustness.Comment: To be presented at the 41st Annual Meeting of the Cognitive Science
Societ
Learning Robust Object Recognition Using Composed Scenes from Generative Models
Recurrent feedback connections in the mammalian visual system have been
hypothesized to play a role in synthesizing input in the theoretical framework
of analysis by synthesis. The comparison of internally synthesized
representation with that of the input provides a validation mechanism during
perceptual inference and learning. Inspired by these ideas, we proposed that
the synthesis machinery can compose new, unobserved images by imagination to
train the network itself so as to increase the robustness of the system in
novel scenarios. As a proof of concept, we investigated whether images composed
by imagination could help an object recognition system to deal with occlusion,
which is challenging for the current state-of-the-art deep convolutional neural
networks. We fine-tuned a network on images containing objects in various
occlusion scenarios, that are imagined or self-generated through a deep
generator network. Trained on imagined occluded scenarios under the object
persistence constraint, our network discovered more subtle and localized image
features that were neglected by the original network for object classification,
obtaining better separability of different object classes in the feature space.
This leads to significant improvement of object recognition under occlusion for
our network relative to the original network trained only on un-occluded
images. In addition to providing practical benefits in object recognition under
occlusion, this work demonstrates the use of self-generated composition of
visual scenes through the synthesis loop, combined with the object persistence
constraint, can provide opportunities for neural networks to discover new
relevant patterns in the data, and become more flexible in dealing with novel
situations.Comment: Accepted by 14th Conference on Computer and Robot Visio
Boundary, Brightness, and Depth Interactions During Preattentive Representation and Attentive Recognition of Figure and Ground
This article applies a recent theory of 3-D biological vision, called FACADE Theory, to explain several percepts which Kanizsa pioneered. These include 3-D pop-out of an occluding form in front of an occluded form, leading to completion and recognition of the occluded form; 3-D transparent and opaque percepts of Kanizsa squares, with and without Varin wedges; and interactions between percepts of illusory contours, brightness, and depth in response to 2-D Kanizsa images. These explanations clarify how a partially occluded object representation can be completed for purposes of object recognition, without the completed part of the representation necessarily being seen. The theory traces these percepts to neural mechanisms that compensate for measurement uncertainty and complementarity at individual cortical processing stages by using parallel and hierarchical interactions among several cortical processing stages. These interactions are modelled by a Boundary Contour System (BCS) that generates emergent boundary segmentations and a complementary Feature Contour System (FCS) that fills-in surface representations of brightness, color, and depth. The BCS and FCS interact reciprocally with an Object Recognition System (ORS) that binds BCS boundary and FCS surface representations into attentive object representations. The BCS models the parvocellular LGN→Interblob→Interstripe→V4 cortical processing stream, the FCS models the parvocellular LGN→Blob→Thin Stripe→V4 cortical processing stream, and the ORS models inferotemporal cortex.Air Force Office of Scientific Research (F49620-92-J-0499); Defense Advanced Research Projects Agency (N00014-92-J-4015); Office of Naval Research (N00014-91-J-4100
A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects
Tracking humans that are interacting with the other subjects or environment
remains unsolved in visual tracking, because the visibility of the human of
interests in videos is unknown and might vary over time. In particular, it is
still difficult for state-of-the-art human trackers to recover complete human
trajectories in crowded scenes with frequent human interactions. In this work,
we consider the visibility status of a subject as a fluent variable, whose
change is mostly attributed to the subject's interaction with the surrounding,
e.g., crossing behind another object, entering a building, or getting into a
vehicle, etc. We introduce a Causal And-Or Graph (C-AOG) to represent the
causal-effect relations between an object's visibility fluent and its
activities, and develop a probabilistic graph model to jointly reason the
visibility fluent change (e.g., from visible to invisible) and track humans in
videos. We formulate this joint task as an iterative search of a feasible
causal graph structure that enables fast search algorithm, e.g., dynamic
programming method. We apply the proposed method on challenging video sequences
to evaluate its capabilities of estimating visibility fluent changes of
subjects and tracking subjects of interests over time. Results with comparisons
demonstrate that our method outperforms the alternative trackers and can
recover complete trajectories of humans in complicated scenarios with frequent
human interactions.Comment: accepted by CVPR 201
Class-Agnostic Counting
Nearly all existing counting methods are designed for a specific object
class. Our work, however, aims to create a counting model able to count any
class of object. To achieve this goal, we formulate counting as a matching
problem, enabling us to exploit the image self-similarity property that
naturally exists in object counting problems. We make the following three
contributions: first, a Generic Matching Network (GMN) architecture that can
potentially count any object in a class-agnostic manner; second, by
reformulating the counting problem as one of matching objects, we can take
advantage of the abundance of video data labeled for tracking, which contains
natural repetitions suitable for training a counting model. Such data enables
us to train the GMN. Third, to customize the GMN to different user
requirements, an adapter module is used to specialize the model with minimal
effort, i.e. using a few labeled examples, and adapting only a small fraction
of the trained parameters. This is a form of few-shot learning, which is
practical for domains where labels are limited due to requiring expert
knowledge (e.g. microbiology). We demonstrate the flexibility of our method on
a diverse set of existing counting benchmarks: specifically cells, cars, and
human crowds. The model achieves competitive performance on cell and crowd
counting datasets, and surpasses the state-of-the-art on the car dataset using
only three training images. When training on the entire dataset, the proposed
method outperforms all previous methods by a large margin.Comment: Asian Conference on Computer Vision (ACCV), 201
- …