748 research outputs found
K-means clustering for efficient and robust registration of multi-view point sets
Generally, there are three main factors that determine the practical
usability of registration, i.e., accuracy, robustness, and efficiency. In
real-time applications, efficiency and robustness are more important. To
promote these two abilities, we cast the multi-view registration into a
clustering task. All the centroids are uniformly sampled from the initially
aligned point sets involved in the multi-view registration, which makes it
rather efficient and effective for the clustering. Then, each point is assigned
to a single cluster and each cluster centroid is updated accordingly.
Subsequently, the shape comprised by all cluster centroids is used to
sequentially estimate the rigid transformation for each point set. For accuracy
and stability, clustering and transformation estimation are alternately and
iteratively applied to all point sets. We tested our proposed approach on
several benchmark datasets and compared it with state-of-the-art approaches.
Experimental results validate its efficiency and robustness for the
registration of multi-view point sets
Supervised multiview learning based on simultaneous learning of multiview intact and single view classifier
Multiview learning problem refers to the problem of learning a classifier
from multiple view data. In this data set, each data points is presented by
multiple different views. In this paper, we propose a novel method for this
problem. This method is based on two assumptions. The first assumption is that
each data point has an intact feature vector, and each view is obtained by a
linear transformation from the intact vector. The second assumption is that the
intact vectors are discriminative, and in the intact space, we have a linear
classifier to separate the positive class from the negative class. We define an
intact vector for each data point, and a view-conditional transformation matrix
for each view, and propose to reconstruct the multiple view feature vectors by
the product of the corresponding intact vectors and transformation matrices.
Moreover, we also propose a linear classifier in the intact space, and learn it
jointly with the intact vectors. The learning problem is modeled by a
minimization problem, and the objective function is composed of a Cauchy error
estimator-based view-conditional reconstruction term over all data points and
views, and a classification error term measured by hinge loss over all the
intact vectors of all the data points. Some regularization terms are also
imposed to different variables in the objective function. The minimization
problem is solve by an iterative algorithm using alternate optimization
strategy and gradient descent algorithm. The proposed algorithm shows it
advantage in the compression to other multiview learning algorithms on
benchmark data sets
A Fast and Accurate Unconstrained Face Detector
We propose a method to address challenges in unconstrained face detection,
such as arbitrary pose variations and occlusions. First, a new image feature
called Normalized Pixel Difference (NPD) is proposed. NPD feature is computed
as the difference to sum ratio between two pixel values, inspired by the Weber
Fraction in experimental psychology. The new feature is scale invariant,
bounded, and is able to reconstruct the original image. Second, we propose a
deep quadratic tree to learn the optimal subset of NPD features and their
combinations, so that complex face manifolds can be partitioned by the learned
rules. This way, only a single soft-cascade classifier is needed to handle
unconstrained face detection. Furthermore, we show that the NPD features can be
efficiently obtained from a look up table, and the detection template can be
easily scaled, making the proposed face detector very fast. Experimental
results on three public face datasets (FDDB, GENKI, and CMU-MIT) show that the
proposed method achieves state-of-the-art performance in detecting
unconstrained faces with arbitrary pose variations and occlusions in cluttered
scenes.Comment: This paper has been accepted by TPAMI. The source code is available
on the project page
http://www.cbsr.ia.ac.cn/users/scliao/projects/npdface/index.htm
Cell identification in whole-brain multiview images of neural activation
We present a scalable method for brain cell identification in multiview
confocal light sheet microscopy images. Our algorithmic pipeline includes a
hierarchical registration approach and a novel multiview version of semantic
deconvolution that simultaneously enhance visibility of fluorescent cell
bodies, equalize their contrast, and fuses adjacent views into a single 3D
images on which cell identification is performed with mean shift.
We present empirical results on a whole-brain image of an adult Arc-dVenus
mouse acquired at 4micron resolution. Based on an annotated test volume
containing 3278 cells, our algorithm achieves an measure of 0.89
Lifting Object Detection Datasets into 3D
While data has certainly taken the center stage in computer vision in recent
years, it can still be difficult to obtain in certain scenarios. In particular,
acquiring ground truth 3D shapes of objects pictured in 2D images remains a
challenging feat and this has hampered progress in recognition-based object
reconstruction from a single image. Here we propose to bypass previous
solutions such as 3D scanning or manual design, that scale poorly, and instead
populate object category detection datasets semi-automatically with dense,
per-object 3D reconstructions, bootstrapped from:(i) class labels, (ii) ground
truth figure-ground segmentations and (iii) a small set of keypoint
annotations. Our proposed algorithm first estimates camera viewpoint using
rigid structure-from-motion and then reconstructs object shapes by optimizing
over visual hull proposals guided by loose within-class shape similarity
assumptions. The visual hull sampling process attempts to intersect an object's
projection cone with the cones of minimal subsets of other similar objects
among those pictured from certain vantage points. We show that our method is
able to produce convincing per-object 3D reconstructions and to accurately
estimate cameras viewpoints on one of the most challenging existing
object-category detection datasets, PASCAL VOC. We hope that our results will
re-stimulate interest on joint object recognition and 3D reconstruction from a
single image
3D human pose estimation from depth maps using a deep combination of poses
Many real-world applications require the estimation of human body joints for
higher-level tasks as, for example, human behaviour understanding. In recent
years, depth sensors have become a popular approach to obtain three-dimensional
information. The depth maps generated by these sensors provide information that
can be employed to disambiguate the poses observed in two-dimensional images.
This work addresses the problem of 3D human pose estimation from depth maps
employing a Deep Learning approach. We propose a model, named Deep Depth Pose
(DDP), which receives a depth map containing a person and a set of predefined
3D prototype poses and returns the 3D position of the body joints of the
person. In particular, DDP is defined as a ConvNet that computes the specific
weights needed to linearly combine the prototypes for the given input. We have
thoroughly evaluated DDP on the challenging 'ITOP' and 'UBC3V' datasets, which
respectively depict realistic and synthetic samples, defining a new
state-of-the-art on them.Comment: Accepted for publication at "Journal of Visual Communication and
Image Representation
Effective Image Retrieval via Multilinear Multi-index Fusion
Multi-index fusion has demonstrated impressive performances in retrieval task
by integrating different visual representations in a unified framework.
However, previous works mainly consider propagating similarities via neighbor
structure, ignoring the high order information among different visual
representations. In this paper, we propose a new multi-index fusion scheme for
image retrieval. By formulating this procedure as a multilinear based
optimization problem, the complementary information hidden in different indexes
can be explored more thoroughly. Specially, we first build our multiple indexes
from various visual representations. Then a so-called index-specific functional
matrix, which aims to propagate similarities, is introduced for updating the
original index. The functional matrices are then optimized in a unified tensor
space to achieve a refinement, such that the relevant images can be pushed more
closer. The optimization problem can be efficiently solved by the augmented
Lagrangian method with theoretical convergence guarantee. Unlike the
traditional multi-index fusion scheme, our approach embeds the multi-index
subspace structure into the new indexes with sparse constraint, thus it has
little additional memory consumption in online query stage. Experimental
evaluation on three benchmark datasets reveals that the proposed approach
achieves the state-of-the-art performance, i.e., N-score 3.94 on UKBench, mAP
94.1\% on Holiday and 62.39\% on Market-1501.Comment: 12 page
Multiview Detection with Feature Perspective Transformation
Incorporating multiple camera views for detection alleviates the impact of
occlusions in crowded scenes. In a multiview system, we need to answer two
important questions when dealing with ambiguities that arise from occlusions.
First, how should we aggregate cues from the multiple views? Second, how should
we aggregate unreliable 2D and 3D spatial information that has been tainted by
occlusions? To address these questions, we propose a novel multiview detection
system, MVDet. For multiview aggregation, existing methods combine anchor box
features from the image plane, which potentially limits performance due to
inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free
approach to aggregate multiview information by projecting feature maps onto the
ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we
apply large kernel convolutions on the ground plane feature map and infer
locations from detection peaks. Our entire model is end-to-end learnable and
achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the
state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a
newly introduced synthetic dataset, MultiviewX, which allows us to control the
level of occlusion. Code and MultiviewX dataset are available at
https://github.com/hou-yz/MVDet
Investigating and Mitigating the Side Effects of Noisy Views in Multi-view Clustering in Practical Scenarios
Multi-view clustering (MvC) aims at exploring category structures among
multi-view data without label supervision. Multiple views provide more
information than single views and thus existing MvC methods can achieve
satisfactory performance. However, their performance might seriously degenerate
when the views are noisy in practical scenarios. In this paper, we first
formally investigate the drawback of noisy views and then propose a
theoretically grounded deep MvC method (namely MvCAN) to address this issue.
Specifically, we propose a novel MvC objective that enables un-shared
parameters and inconsistent clustering predictions across multiple views to
reduce the side effects of noisy views. Furthermore, a non-parametric iterative
process is designed to generate a robust learning target for mining multiple
views' useful information. Theoretical analysis reveals that MvCAN works by
achieving the multi-view consistency, complementarity, and noise robustness.
Finally, experiments on extensive public datasets demonstrate that MvCAN
outperforms state-of-the-art methods and is robust against the existence of
noisy views
Learning Articulated Motion Models from Visual and Lingual Signals
In order for robots to operate effectively in homes and workplaces, they must
be able to manipulate the articulated objects common within environments built
for and by humans. Previous work learns kinematic models that prescribe this
manipulation from visual demonstrations. Lingual signals, such as natural
language descriptions and instructions, offer a complementary means of
conveying knowledge of such manipulation models and are suitable to a wide
range of interactions (e.g., remote manipulation). In this paper, we present a
multimodal learning framework that incorporates both visual and lingual
information to estimate the structure and parameters that define kinematic
models of articulated objects. The visual signal takes the form of an RGB-D
image stream that opportunistically captures object motion in an unprepared
scene. Accompanying natural language descriptions of the motion constitute the
lingual signal. We present a probabilistic language model that uses word
embeddings to associate lingual verbs with their corresponding kinematic
structures. By exploiting the complementary nature of the visual and lingual
input, our method infers correct kinematic structures for various multiple-part
objects on which the previous state-of-the-art, visual-only system fails. We
evaluate our multimodal learning framework on a dataset comprised of a variety
of household objects, and demonstrate a 36% improvement in model accuracy over
the vision-only baseline
- …