164 research outputs found
PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding
The recent trend in deep learning methods for 3D point cloud understanding is
to propose increasingly sophisticated architectures either to better capture 3D
geometries or by introducing possibly undesired inductive biases. Moreover,
prior works introducing novel architectures compared their performance on the
same domain, devoting less attention to their generalization to other domains.
We argue that the ability of a model to transfer the learnt knowledge to
different domains is an important feature that should be evaluated to
exhaustively assess the quality of a deep network architecture. In this work we
propose PatchMixer, a simple yet effective architecture that extends the ideas
behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our
approach are the processing of local patches instead of the whole shape to
promote robustness to partial point clouds, and the aggregation of patch-wise
features using an MLP as a simpler alternative to the graph convolutions or the
attention mechanisms that are used in prior works. We evaluated our method on
the shape classification and part segmentation tasks, achieving superior
generalization performance compared to a selection of the most relevant deep
architectures.Comment: Published in the Image and Vision Computing journa
Distinctive 3D local deep descriptors
We present a simple but yet effective method for learning distinctive 3D
local deep descriptors (DIPs) that can be used to register point clouds without
requiring an initial alignment. Point cloud patches are extracted,
canonicalised with respect to their estimated local reference frame and encoded
into rotation-invariant compact descriptors by a PointNet-based deep neural
network. DIPs can effectively generalise across different sensor modalities
because they are learnt end-to-end from locally and randomly sampled points.
Because DIPs encode only local geometric information, they are robust to
clutter, occlusions and missing regions. We evaluate and compare DIPs against
alternative hand-crafted and deep descriptors on several indoor and outdoor
datasets consisting of point clouds reconstructed using different sensors.
Results show that DIPs (i) achieve comparable results to the state-of-the-art
on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a
large margin on laser-scanner outdoor scenes (ETH dataset), and (iii)
generalise to indoor scenes reconstructed with the Visual-SLAM system of
Android ARCore. Source code: https://github.com/fabiopoiesi/dip.Comment: IEEE International Conference on Pattern Recognition 202
TOWARDS GESTURE-BASED MULTI-USER INTERACTIONS IN COLLABORATIVE VIRTUAL ENVIRONMENTS
We present a virtual reality (VR) setup that enables multiple users to participate in collaborative virtual environments and interact via gestures. A collaborative VR session is established through a network of users that is composed of a server and a set of clients. The server manages the communication amongst clients and is created by one of the users. Each user's VR setup consists of a Head Mounted Display (HMD) for immersive visualisation, a hand tracking system to interact with virtual objects and a single-hand joypad to move in the virtual environment. We use Google Cardboard as a HMD for the VR experience and a Leap Motion for hand tracking, thus making our solution low cost. We evaluate our VR setup though a forensics use case, where real-world objects pertaining to a simulated crime scene are included in a VR environment, acquired using a smartphone-based 3D reconstruction pipeline. Users can interact using virtual gesture-based tools such as pointers and rulers
Multi-view data capture for dynamic object reconstruction using handheld augmented reality mobiles
We propose a system to capture nearly-synchronous frame streams from multiple
and moving handheld mobiles that is suitable for dynamic object 3D
reconstruction. Each mobile executes Simultaneous Localisation and Mapping
on-board to estimate its pose, and uses a wireless communication channel to
send or receive synchronisation triggers. Our system can harvest frames and
mobile poses in real time using a decentralised triggering strategy and a
data-relay architecture that can be deployed either at the Edge or in the
Cloud. We show the effectiveness of our system by employing it for 3D skeleton
and volumetric reconstructions. Our triggering strategy achieves equal
performance to that of an NTP-based synchronisation approach, but offers higher
flexibility, as it can be adjusted online based on application needs. We
created a challenging new dataset, namely 4DM, that involves six handheld
augmented reality mobiles recording an actor performing sports actions
outdoors. We validate our system on 4DM, analyse its strengths and limitations,
and compare its modules with alternative ones.Comment: Accepted in Journal of Real-Time Image Processin
Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation
Recent works on 6D object pose estimation focus on learning keypoint
correspondences between images and object models, and then determine the object
pose through RANSAC-based algorithms or by directly regressing the pose with
end-to-end optimisations. We argue that learning point-level discriminative
features is overlooked in the literature. To this end, we revisit Fully
Convolutional Geometric Features (FCGF) and tailor it for object 6D pose
estimation to achieve state-of-the-art performance. FCGF employs sparse
convolutions and learns point-level features using a fully-convolutional
network by optimising a hardest contrastive loss. We can outperform recent
competitors on popular benchmarks by adopting key modifications to the loss and
to the input data representations, by carefully tuning the training strategies,
and by employing data augmentations suitable for the underlying problem. We
carry out a thorough ablation to study the contribution of each modification.Comment: 17 pages. Preprint, currently under revie
Multi-target tracking and performance evaluation on videos
PhDMulti-target tracking is the process that allows the extraction of object motion patterns of
interest from a scene. Motion patterns are often described through metadata representing object
locations and shape information. In the first part of this thesis we discuss the state-of-the-art
methods aimed at accomplishing this task on monocular views and also analyse the methods for
evaluating their performance. The second part of the thesis describes our research contribution
to these topics.
We begin presenting a method for multi-target tracking based on track-before-detect (MTTBD)
formulated as a particle filter. The novelty involves the inclusion of the target identity
(ID) into the particle state, which enables the algorithm to deal with an unknown and unlimited
number of targets. We propose a probabilistic model of particle birth and death based on Markov
Random Fields. This model allows us to overcome the problem of the mixing of IDs of close
targets.
We then propose three evaluation measures that take into account target-size variations, combine
accuracy and cardinality errors, quantify long-term tracking accuracy at different accuracy
levels, and evaluate ID changes relative to the duration of the track in which they occur. This
set of measures does not require pre-setting of parameters and allows one to holistically evaluate
tracking performance in an application-independent manner.
Lastly, we present a framework for multi-target localisation applied on scenes with a high
density of compact objects. Candidate target locations are initially generated by extracting object
features from intensity maps using an iterative method based on a gradient-climbing technique
and an isocontour slicing approach. A graph-based data association method for multi-target
tracking is then applied to link valid candidate target locations over time and to discard those
which are spurious. This method can deal with point targets having indistinguishable appearance
and unpredictable motion.
MT-TBD is evaluated and compared with state-of-the-art methods on real-world surveillanceThis work was supported by the EU, under the FP7 project APIDIS (ICT-216023) and the
Artemis JU and TSB as part of the COPCAMS project (332913)
Data augmentation for NeRF: a geometric consistent solution based on view morphing
NeRF aims to learn a continuous neural scene representation by using a finite
set of input images taken from different viewpoints. The fewer the number of
viewpoints, the higher the likelihood of overfitting on them. This paper
mitigates such limitation by presenting a novel data augmentation approach to
generate geometrically consistent image transitions between viewpoints using
view morphing. View morphing is a highly versatile technique that does not
requires any prior knowledge about the 3D scene because it is based on general
principles of projective geometry. A key novelty of our method is to use the
very same depths predicted by NeRF to generate the image transitions that are
then added to NeRF training. We experimentally show that this procedure enables
NeRF to improve the quality of its synthesised novel views in the case of
datasets with few training viewpoints. We improve PSNR up to 1.8dB and 10.5dB
when eight and four views are used for training, respectively. To the best of
our knowledge, this is the first data augmentation strategy for NeRF that
explicitly synthesises additional new input images to improve the model
generalisation
Support Vector Motion Clustering
This work was supported in part by the Erasmus Mundus Joint Doctorate in Interactive and Cognitive Environments (which is funded by the EACEA Agency of the European Commission under EMJD ICE FPA n 2010-0012) and by the Artemis JU and the UK Technology Strategy Board through COPCAMS Project under Grant 332913
Survey on video anomaly detection in dynamic scenes with moving cameras
The increasing popularity of compact and inexpensive cameras, e.g.~dash
cameras, body cameras, and cameras equipped on robots, has sparked a growing
interest in detecting anomalies within dynamic scenes recorded by moving
cameras. However, existing reviews primarily concentrate on Video Anomaly
Detection (VAD) methods assuming static cameras. The VAD literature with moving
cameras remains fragmented, lacking comprehensive reviews to date. To address
this gap, we endeavor to present the first comprehensive survey on Moving
Camera Video Anomaly Detection (MC-VAD). We delve into the research papers
related to MC-VAD, critically assessing their limitations and highlighting
associated challenges. Our exploration encompasses three application domains:
security, urban transportation, and marine environments, which in turn cover
six specific tasks. We compile an extensive list of 25 publicly-available
datasets spanning four distinct environments: underwater, water surface,
ground, and aerial. We summarize the types of anomalies these datasets
correspond to or contain, and present five main categories of approaches for
detecting such anomalies. Lastly, we identify future research directions and
discuss novel contributions that could advance the field of MC-VAD. With this
survey, we aim to offer a valuable reference for researchers and practitioners
striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie
Novel-View Human Action Synthesis
Novel-View Human Action Synthesis aims to synthesize the movement of a body
from a virtual viewpoint, given a video from a real viewpoint. We present a
novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D
mesh of the target body and transfer the rough textures from the 2D images to
the mesh. As this transfer may generate sparse textures on the mesh due to
frame resolution or occlusions. We produce a semi-dense textured mesh by
propagating the transferred textures both locally, within local geodesic
neighborhoods, and globally, across symmetric semantic parts. Next, we
introduce a context-based generator to learn how to correct and complete the
residual appearance information. This allows the network to independently focus
on learning the foreground and background synthesis tasks. We validate the
proposed solution on the public NTU RGB+D dataset. The code and resources are
available at https://bit.ly/36u3h4K.Comment: Asian Conference on Computer Vision (ACCV) 202
- …