430 research outputs found
Functional Compartmentalization and Viewpoint Generalization Within the Macaque Face-Processing System
Primates can recognize faces across a range of viewing conditions. Representations of individual identity should thus exist that are invariant to accidental image transformations like view direction. We targeted the recently discovered face-processing network of the macaque monkey that consists of six interconnected face-selective regions and recorded from the two middle patches (ML, middle lateral, and MF, middle fundus) and two anterior patches (AL, anterior lateral, and AM, anterior medial). We found that the anatomical position of a face patch was associated with a unique functional identity: Face patches differed qualitatively in how they represented identity across head orientations. Neurons in ML and MF were view-specific; neurons in AL were tuned to identity mirror-symetrically across views, thus achieving partial view invariance; and neurons in AM, the most anterior face patch, achieved almost full view invariance
Evidence for view-invariant Face Recognition Units in unfamiliar face learning
Many models of face recognition incorporate the idea of a face recognition unit (FRU). This is an abstracted representation formed from each experience of a face. Longmore et al. (2008) devised a face learning experiment to investigate such a construct (i.e., view-invariance) but failed to find evidence of its existence. Three experiments developed Longmore et al.’s study further by using a different learning task, by employing more stimuli. One or two views of previously unfamiliar faces were shown to participants in a serial matching task (learning). Later, participants attempted to recognise both seen and novel views of the learned faces. Experiment one tested participants’ recognition of a novel view, a day after learning. Experiment two was identical, but tested participants on the same day as learning. And experiment three repeated experiment one, but tested participants on a novel view that was outside the rotation of those views learned. Results revealed a significant advantage for recognising a novel view when two views had been learned, rather than a single learned view – for all experiments. The effect of view-invariance found when both views were learned is discussed
Multi-View Region Adaptive Multi-temporal DMM and RGB Action Recognition
Human action recognition remains an important yet challenging task. This work
proposes a novel action recognition system. It uses a novel Multiple View
Region Adaptive Multi-resolution in time Depth Motion Map (MV-RAMDMM)
formulation combined with appearance information. Multiple stream 3D
Convolutional Neural Networks (CNNs) are trained on the different views and
time resolutions of the region adaptive Depth Motion Maps. Multiple views are
synthesised to enhance the view invariance. The region adaptive weights, based
on localised motion, accentuate and differentiate parts of actions possessing
faster motion. Dedicated 3D CNN streams for multi-time resolution appearance
information (RGB) are also included. These help to identify and differentiate
between small object interactions. A pre-trained 3D-CNN is used here with
fine-tuning for each stream along with multiple class Support Vector Machines
(SVM)s. Average score fusion is used on the output. The developed approach is
capable of recognising both human action and human-object interaction. Three
public domain datasets including: MSR 3D Action,Northwestern UCLA multi-view
actions and MSR 3D daily activity are used to evaluate the proposed solution.
The experimental results demonstrate the robustness of this approach compared
with state-of-the-art algorithms.Comment: 14 pages, 6 figures, 13 tables. Submitte
Study Of Human Activity In Video Data With An Emphasis On View-invariance
The perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations, and the large degrees of freedom of articulated bodies. In addition, we are interested in methods that require little or no training. The current solutions to action recognition usually assume that there is a huge dataset of actions available so that a classifier can be trained. However, this means that in order to define a new action, the user has to record a number of videos from different viewpoints with varying camera intrinsic parameters and then retrain the classifier, which is not very practical from a development point of view. We propose algorithms that overcome these challenges and require just a few instances of the action from any viewpoint with any intrinsic camera parameters. Our first algorithm is based on the rank constraint on the family of planar homographies associated with triplets of body points. We represent action as a sequence of poses, and decompose the pose into triplets. Therefore, the pose transition is broken down into a set of movement of body point planes. In this way, we transform the non-rigid motion of the body points into a rigid motion of body point iii planes. We use the fact that the family of homographies associated with two identical poses would have rank 4 to gauge similarity of the pose between two subjects, observed by different perspective cameras and from different viewpoints. This method requires only one instance of the action. We then show that it is possible to extend the concept of triplets to line segments. In particular, we establish that if we look at the movement of line segments instead of triplets, we have more redundancy in data thus leading to better results. We demonstrate this concept on “fundamental ratios.” We decompose a human body pose into line segments instead of triplets and look at set of movement of line segments. This method needs only three instances of the action. If a larger dataset is available, we can also apply weighting on line segments for better accuracy. The last method is based on the concept of “Projective Depth”. Given a plane, we can find the relative depth of a point relative to the given plane. We propose three different ways of using “projective depth:” (i) Triplets - the three points of a triplet along with the epipole defines the plane and the movement of points relative to these body planes can be used to recognize actions; (ii) Ground plane - if we are able to extract the ground plane, we can find the “projective depth” of the body points with respect to it. Therefore, the problem of action recognition would translate to curve matching; and (iii) Mirror person - We can use the mirror view of the person to extract mirror symmetric planes. This method also needs only one instance of the action. Extensive experiments are reported on testing view invariance, robustness to noisy localization and occlusions of body points, and action recognition. The experimental results are very promising and demonstrate the efficiency of our proposed invariants. i
View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation
The primate brain contains a hierarchy of visual areas, dubbed the ventral
stream, which rapidly computes object representations that are both specific
for object identity and relatively robust against identity-preserving
transformations like depth-rotations. Current computational models of object
recognition, including recent deep learning networks, generate these properties
through a hierarchy of alternating selectivity-increasing filtering and
tolerance-increasing pooling operations, similar to simple-complex cells
operations. While simulations of these models recapitulate the ventral stream's
progression from early view-specific to late view-tolerant representations,
they fail to generate the most salient property of the intermediate
representation for faces found in the brain: mirror-symmetric tuning of the
neural population to head orientation. Here we prove that a class of
hierarchical architectures and a broad set of biologically plausible learning
rules can provide approximate invariance at the top level of the network. While
most of the learning rules do not yield mirror-symmetry in the mid-level
representations, we characterize a specific biologically-plausible Hebb-type
learning rule that is guaranteed to generate mirror-symmetric tuning to faces
tuning at intermediate levels of the architecture
Histogram of Oriented Principal Components for Cross-View Action Recognition
Existing techniques for 3D action recognition are sensitive to viewpoint
variations because they extract features from depth images which are viewpoint
dependent. In contrast, we directly process pointclouds for cross-view action
recognition from unknown and unseen views. We propose the Histogram of Oriented
Principal Components (HOPC) descriptor that is robust to noise, viewpoint,
scale and action speed variations. At a 3D point, HOPC is computed by
projecting the three scaled eigenvectors of the pointcloud within its local
spatio-temporal support volume onto the vertices of a regular dodecahedron.
HOPC is also used for the detection of Spatio-Temporal Keypoints (STK) in 3D
pointcloud sequences so that view-invariant STK descriptors (or Local HOPC
descriptors) at these key locations only are used for action recognition. We
also propose a global descriptor computed from the normalized spatio-temporal
distribution of STKs in 4-D, which we refer to as STK-D. We have evaluated the
performance of our proposed descriptors against nine existing techniques on two
cross-view and three single-view human action recognition datasets. The
Experimental results show that our techniques provide significant improvement
over state-of-the-art methods
How can cells in the anterior medial face patch be viewpoint invariant?
In a recent paper, Freiwald and Tsao (2010) found evidence that the responses of cells in the macaque anterior medial (AM) face patch are invariant to significant changes in viewpoint. The monkey subjects had no prior experience with the individuals depicted in the stimuli and were never given an opportunity to view the same individual from different viewpoints sequentially. These results cannot be explained by a mechanism based on temporal association of experienced views. Employing a biologically plausible model of object recognition (software available at cbcl.mit.edu), we show two mechanisms which could account for these results. First, we show that hair style and skin color provide sufficient information to enable viewpoint recognition without resorting to any mechanism that associates images across views. It is likely that a large part of the effect described in patch AM is attributable to these cues. Separately, we show that it is possible to further improve view-invariance using class-specific features (see Vetter 1997). Faces, as a class, transform under 3D rotation in similar enough ways that it is possible to use previously viewed example faces to learn a general model of how all faces rotate. Novel faces can be encoded relative to these previously encountered “template” faces and thus recognized with some degree of invariance to 3D rotation. Since each object class transforms differently under 3D rotation, it follows that invariant recognition from a single view requires a recognition architecture with a detection step determining the class of an object (e.g. face or non-face) prior to a subsequent identification stage utilizing the appropriate class-specific features
- …