24 research outputs found
Exploring 3D Data and Beyond in a Low Data Regime
3D object classification of point clouds is an essential task as laser scanners, or other depth sensors, producing point clouds are now a commodity on, e.g., autonomous vehicles, surveying vehicles, service robots, and drones. There have been fewer advances using deep learning methods in the area of point clouds compared to 2D images and videos, partially because the data in a point cloud are typically unordered as opposed to the pixels in a 2D image, which implies standard deep learning architectures are not suitable. Additionally, we identify there is a shortcoming of labelled 3D data in many computer vision tasks, as collecting 3D data is significantly more costly and difficult. This implies using zero- or few-shot learning approaches, where some classes have not been observed often or at all during training. As our first objective, we study the problem of 3D object classification of point clouds in a supervised setting where there are labelled samples for each class in the dataset. To this end, we introduce the {3DCapsule}, which is a 3D extension of the recently introduced Capsule concept by Hinton et al. that makes it applicable to unordered point sets. The 3DCapsule is a drop-in replacement of the commonly used fully connected classifier. It is demonstrated that when the 3DCapsule is applied to contemporary 3D point set classification architectures, it consistently shows an improvement, in particular when subjected to noisy data.
We then turn our attention to the problem of 3D object classification of point clouds in a Zero-shot Learning (ZSL) setting, where there are no labelled data for some classes. Several recent 3D point cloud recognition algorithms are adapted to the ZSL setting with some necessary changes to their respective architectures. To the best of our knowledge, at the time, this was the first attempt to classify unseen 3D point cloud objects in a ZSL setting. A standard protocol (which includes the choice of datasets and determines the seen/unseen split) to evaluate such systems is also proposed. In the next contribution, we address the hubness problem on 3D point cloud data, which is when a model is biased to predict only a few particular labels for most of the test instances. To this end, we propose a loss function which is useful for both Zero-Shot and Generalized Zero-Shot Learning. Besides, we tackle 3D object classification of point clouds in a different setting, called the transductive setting, wherein the test samples are allowed to be observed during the training stage but then as unlabelled data. We extend, for the first time, transductive Zero-Shot Learning (ZSL) and Generalized Zero-Shot Learning (GZSL) approaches to the domain of 3D point cloud classification by developing a novel triplet loss that takes advantage of the unlabeled test data. While designed for the task of 3D point cloud classification, the method is also shown to be applicable to the more common use-case of 2D image classification. Lastly, we study the Generalized Zero-Shot Learning (GZSL) problem in the 2D image domain. However, we also demonstrate that our proposed method is applicable to 3D point cloud data. We propose using a mixture of subspaces which represents input features and semantic information in a way that reduces the imbalance between seen and unseen prediction scores. Subspaces define the cluster structure of the visual domain and help describe the visual and semantic domain considering the overall distribution of the data
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
As the most fundamental tasks of computer vision, object detection and
segmentation have made tremendous progress in the deep learning era. Due to the
expensive manual labeling, the annotated categories in existing datasets are
often small-scale and pre-defined, i.e., state-of-the-art detectors and
segmentors fail to generalize beyond the closed-vocabulary. To resolve this
limitation, the last few years have witnessed increasing attention toward
Open-Vocabulary Detection (OVD) and Segmentation (OVS). In this survey, we
provide a comprehensive review on the past and recent development of OVD and
OVS. To this end, we develop a taxonomy according to the type of task and
methodology. We find that the permission and usage of weak supervision signals
can well discriminate different methodologies, including: visual-semantic space
mapping, novel visual feature synthesis, region-aware training,
pseudo-labeling, knowledge distillation-based, and transfer learning-based. The
proposed taxonomy is universal across different tasks, covering object
detection, semantic/instance/panoptic segmentation, 3D scene and video
understanding. In each category, its main principles, key challenges,
development routes, strengths, and weaknesses are thoroughly discussed. In
addition, we benchmark each task along with the vital components of each
method. Finally, several promising directions are provided to stimulate future
research
Cross-View Learning
PhDKey to achieving more efficient machine intelligence is the capability to analysing and understanding
data across different views – which can be camera views or modality views (such as
visual and textual). One generic learning paradigm for automated understanding data from different
views called cross-view learning which includes cross-view matching, cross-view fusion
and cross-view generation. Specifically, this thesis investigates two of them, cross-view matching
and cross-view generation, by developing new methods for addressing the following specific
computer vision problems.
The first problem is cross-view matching for person re-identification which a person is captured
by multiple non-overlapping camera views, the objective is to match him/her across views
among a large number of imposters. Typically a person’s appearance is represented using features
of thousands of dimensions, whilst only hundreds of training samples are available due
to the difficulties in collecting matched training samples. With the number of training samples
much smaller than the feature dimension, the existing methods thus face the classic small sample
size (SSS) problem and have to resort to dimensionality reduction techniques and/or matrix
regularisation, which lead to loss of discriminative power for cross-view matching. To that end,
this thesis proposes to overcome the SSS problem in subspace learning by matching cross-view
data in a discriminative null space of the training data.
The second problem is cross-view matching for zero-shot learning where data are drawn
from different modalities each for a different view (e.g. visual or textual), versus single-modal
data considered in the first problem. This is inherently more challenging as the gap between
different views becomes larger. Specifically, the zero-shot learning problem can be solved if
the visual representation/view of the data (object) and its textual view are matched. Moreover,
it requires learning a joint embedding space where different view data can be projected to for
nearest neighbour search. This thesis argues that the key to make zero-shot learning models succeed
is to choose the right embedding space. Different from most existing zero-shot learning
models utilising a textual or an intermediate space as the embedding space for achieving crossview
matching, the proposed method uniquely explores the visual space as the embedding space.
This thesis finds that in the visual space, the subsequent nearest neighbour search would suffer
much less from the hubness problem and thus become more effective. Moreover, a natural mechanism
for multiple textual modalities optimised jointly in an end-to-end manner in this model
demonstrates significant advantages over existing methods.
The last problem is cross-view generation for image captioning which aims to automatically
generate textual sentences from visual images. Most existing image captioning studies are limited
to investigate variants of deep learning-based image encoders, improving the inputs for the
subsequent deep sentence decoders. Existing methods have two limitations: (i) They are trained
to maximise the likelihood of each ground-truth word given the previous ground-truth words and
the image, termed Teacher-Forcing. This strategy may cause a mismatch between training and
testing since at test-time the model uses the previously generated words from the model distribution
to predict the next word. This exposure bias can result in error accumulation in sentence
generation during test time, since the model has never been exposed to its own predictions. (ii)
The training supervision metric, such as the widely used cross entropy loss, is different from
the evaluation metrics at test time. In other words, the model is not directly optimised towards
the task expectation. This learned model is therefore suboptimal. One main underlying reason
responsible is that the evaluation metrics are non-differentiable and therefore much harder to be
optimised against. This thesis overcomes the problems as above by exploring the reinforcement
learning idea. Specifically, a novel actor-critic based learning approach is formulated to directly
maximise the reward - the actual Natural Language Processing quality metrics of interest. As
compared to existing reinforcement learning based captioning models, the new method has the
unique advantage of a per-token advantage and value computation is enabled leading to better
model training
PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning
Contrastive Language-Image Pre-training (CLIP) has shown promising open-world
performance on 2D image tasks, while its transferred capacity on 3D point
clouds, i.e., PointCLIP, is still far from satisfactory. In this work, we
propose PointCLIP V2, a powerful 3D open-world learner, to fully unleash the
potential of CLIP on 3D point cloud data. First, we introduce a realistic shape
projection module to generate more realistic depth maps for CLIP's visual
encoder, which is quite efficient and narrows the domain gap between projected
point clouds with natural images. Second, we leverage large-scale language
models to automatically design a more descriptive 3D-semantic prompt for CLIP's
textual encoder, instead of the previous hand-crafted one. Without introducing
any training in 3D domains, our approach significantly surpasses PointCLIP by
+42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D
classification. Furthermore, PointCLIP V2 can be extended to few-shot
classification, zero-shot part segmentation, and zero-shot 3D object detection
in a simple manner, demonstrating our superior generalization ability for 3D
open-world learning. Code will be available at
https://github.com/yangyangyang127/PointCLIP_V2