20,040 research outputs found
Fast, invariant representation for human action in the visual system
Humans can effortlessly recognize others' actions in the presence of complex
transformations, such as changes in viewpoint. Several studies have located the
regions in the brain involved in invariant action recognition, however, the
underlying neural computations remain poorly understood. We use
magnetoencephalography (MEG) decoding and a dataset of well-controlled,
naturalistic videos of five actions (run, walk, jump, eat, drink) performed by
different actors at different viewpoints to study the computational steps used
to recognize actions across complex transformations. In particular, we ask when
the brain discounts changes in 3D viewpoint relative to when it initially
discriminates between actions. We measure the latency difference between
invariant and non-invariant action decoding when subjects view full videos as
well as form-depleted and motion-depleted stimuli. Our results show no
difference in decoding latency or temporal profile between invariant and
non-invariant action recognition in full videos. However, when either form or
motion information is removed from the stimulus set, we observe a decrease and
delay in invariant action decoding. Our results suggest that the brain
recognizes actions and builds invariance to complex transformations at the same
time, and that both form and motion information are crucial for fast, invariant
action recognition
Person Re-identification by Local Maximal Occurrence Representation and Metric Learning
Person re-identification is an important technique towards automatic search
of a person's presence in a surveillance video. Two fundamental problems are
critical for person re-identification, feature representation and metric
learning. An effective feature representation should be robust to illumination
and viewpoint changes, and a discriminant metric should be learned to match
various person images. In this paper, we propose an effective feature
representation called Local Maximal Occurrence (LOMO), and a subspace and
metric learning method called Cross-view Quadratic Discriminant Analysis
(XQDA). The LOMO feature analyzes the horizontal occurrence of local features,
and maximizes the occurrence to make a stable representation against viewpoint
changes. Besides, to handle illumination variations, we apply the Retinex
transform and a scale invariant texture operator. To learn a discriminant
metric, we propose to learn a discriminant low dimensional subspace by
cross-view quadratic discriminant analysis, and simultaneously, a QDA metric is
learned on the derived subspace. We also present a practical computation method
for XQDA, as well as its regularization. Experiments on four challenging person
re-identification databases, VIPeR, QMUL GRID, CUHK Campus, and CUHK03, show
that the proposed method improves the state-of-the-art rank-1 identification
rates by 2.2%, 4.88%, 28.91%, and 31.55% on the four databases, respectively.Comment: This paper has been accepted by CVPR 2015. For source codes and
extracted features please visit
http://www.cbsr.ia.ac.cn/users/scliao/projects/lomo_xqda
Learning optimised representations for view-invariant gait recognition
Gait recognition can be performed without subject cooperation under harsh conditions, thus it is an important tool in forensic gait analysis, security control, and other commercial applications. One critical issue that prevents gait recognition systems from being widely accepted is the performance drop when the camera viewpoint varies between the registered templates and the query data. In this paper, we explore the potential of combining feature optimisers and representations learned by convolutional neural networks (CNN) to achieve efficient view-invariant gait recognition. The experimental results indicate that CNN learns highly discriminative representations across moderate view variations, and these representations can be further improved using view-invariant feature selectors, achieving a high matching accuracy across views
Automatic vehicle tracking and recognition from aerial image sequences
This paper addresses the problem of automated vehicle tracking and
recognition from aerial image sequences. Motivated by its successes in the
existing literature focus on the use of linear appearance subspaces to describe
multi-view object appearance and highlight the challenges involved in their
application as a part of a practical system. A working solution which includes
steps for data extraction and normalization is described. In experiments on
real-world data the proposed methodology achieved promising results with a high
correct recognition rate and few, meaningful errors (type II errors whereby
genuinely similar targets are sometimes being confused with one another).
Directions for future research and possible improvements of the proposed method
are discussed
How can cells in the anterior medial face patch be viewpoint invariant?
In a recent paper, Freiwald and Tsao (2010) found evidence that the responses of cells in the macaque anterior medial (AM) face patch are invariant to significant changes in viewpoint. The monkey subjects had no prior experience with the individuals depicted in the stimuli and were never given an opportunity to view the same individual from different viewpoints sequentially. These results cannot be explained by a mechanism based on temporal association of experienced views. Employing a biologically plausible model of object recognition (software available at cbcl.mit.edu), we show two mechanisms which could account for these results. First, we show that hair style and skin color provide sufficient information to enable viewpoint recognition without resorting to any mechanism that associates images across views. It is likely that a large part of the effect described in patch AM is attributable to these cues. Separately, we show that it is possible to further improve view-invariance using class-specific features (see Vetter 1997). Faces, as a class, transform under 3D rotation in similar enough ways that it is possible to use previously viewed example faces to learn a general model of how all faces rotate. Novel faces can be encoded relative to these previously encountered “template” faces and thus recognized with some degree of invariance to 3D rotation. Since each object class transforms differently under 3D rotation, it follows that invariant recognition from a single view requires a recognition architecture with a detection step determining the class of an object (e.g. face or non-face) prior to a subsequent identification stage utilizing the appropriate class-specific features
- …