164 research outputs found

    PatchMixer: Rethinking network design to boost generalization for 3D point cloud understanding

    Full text link
    The recent trend in deep learning methods for 3D point cloud understanding is to propose increasingly sophisticated architectures either to better capture 3D geometries or by introducing possibly undesired inductive biases. Moreover, prior works introducing novel architectures compared their performance on the same domain, devoting less attention to their generalization to other domains. We argue that the ability of a model to transfer the learnt knowledge to different domains is an important feature that should be evaluated to exhaustively assess the quality of a deep network architecture. In this work we propose PatchMixer, a simple yet effective architecture that extends the ideas behind the recent MLP-Mixer paper to 3D point clouds. The novelties of our approach are the processing of local patches instead of the whole shape to promote robustness to partial point clouds, and the aggregation of patch-wise features using an MLP as a simpler alternative to the graph convolutions or the attention mechanisms that are used in prior works. We evaluated our method on the shape classification and part segmentation tasks, achieving superior generalization performance compared to a selection of the most relevant deep architectures.Comment: Published in the Image and Vision Computing journa

    Distinctive 3D local deep descriptors

    Full text link
    We present a simple but yet effective method for learning distinctive 3D local deep descriptors (DIPs) that can be used to register point clouds without requiring an initial alignment. Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded into rotation-invariant compact descriptors by a PointNet-based deep neural network. DIPs can effectively generalise across different sensor modalities because they are learnt end-to-end from locally and randomly sampled points. Because DIPs encode only local geometric information, they are robust to clutter, occlusions and missing regions. We evaluate and compare DIPs against alternative hand-crafted and deep descriptors on several indoor and outdoor datasets consisting of point clouds reconstructed using different sensors. Results show that DIPs (i) achieve comparable results to the state-of-the-art on RGB-D indoor scenes (3DMatch dataset), (ii) outperform state-of-the-art by a large margin on laser-scanner outdoor scenes (ETH dataset), and (iii) generalise to indoor scenes reconstructed with the Visual-SLAM system of Android ARCore. Source code: https://github.com/fabiopoiesi/dip.Comment: IEEE International Conference on Pattern Recognition 202

    TOWARDS GESTURE-BASED MULTI-USER INTERACTIONS IN COLLABORATIVE VIRTUAL ENVIRONMENTS

    Get PDF
    We present a virtual reality (VR) setup that enables multiple users to participate in collaborative virtual environments and interact via gestures. A collaborative VR session is established through a network of users that is composed of a server and a set of clients. The server manages the communication amongst clients and is created by one of the users. Each user's VR setup consists of a Head Mounted Display (HMD) for immersive visualisation, a hand tracking system to interact with virtual objects and a single-hand joypad to move in the virtual environment. We use Google Cardboard as a HMD for the VR experience and a Leap Motion for hand tracking, thus making our solution low cost. We evaluate our VR setup though a forensics use case, where real-world objects pertaining to a simulated crime scene are included in a VR environment, acquired using a smartphone-based 3D reconstruction pipeline. Users can interact using virtual gesture-based tools such as pointers and rulers

    Multi-view data capture for dynamic object reconstruction using handheld augmented reality mobiles

    Full text link
    We propose a system to capture nearly-synchronous frame streams from multiple and moving handheld mobiles that is suitable for dynamic object 3D reconstruction. Each mobile executes Simultaneous Localisation and Mapping on-board to estimate its pose, and uses a wireless communication channel to send or receive synchronisation triggers. Our system can harvest frames and mobile poses in real time using a decentralised triggering strategy and a data-relay architecture that can be deployed either at the Edge or in the Cloud. We show the effectiveness of our system by employing it for 3D skeleton and volumetric reconstructions. Our triggering strategy achieves equal performance to that of an NTP-based synchronisation approach, but offers higher flexibility, as it can be adjusted online based on application needs. We created a challenging new dataset, namely 4DM, that involves six handheld augmented reality mobiles recording an actor performing sports actions outdoors. We validate our system on 4DM, analyse its strengths and limitations, and compare its modules with alternative ones.Comment: Accepted in Journal of Real-Time Image Processin

    Revisiting Fully Convolutional Geometric Features for Object 6D Pose Estimation

    Full text link
    Recent works on 6D object pose estimation focus on learning keypoint correspondences between images and object models, and then determine the object pose through RANSAC-based algorithms or by directly regressing the pose with end-to-end optimisations. We argue that learning point-level discriminative features is overlooked in the literature. To this end, we revisit Fully Convolutional Geometric Features (FCGF) and tailor it for object 6D pose estimation to achieve state-of-the-art performance. FCGF employs sparse convolutions and learns point-level features using a fully-convolutional network by optimising a hardest contrastive loss. We can outperform recent competitors on popular benchmarks by adopting key modifications to the loss and to the input data representations, by carefully tuning the training strategies, and by employing data augmentations suitable for the underlying problem. We carry out a thorough ablation to study the contribution of each modification.Comment: 17 pages. Preprint, currently under revie

    Multi-target tracking and performance evaluation on videos

    Get PDF
    PhDMulti-target tracking is the process that allows the extraction of object motion patterns of interest from a scene. Motion patterns are often described through metadata representing object locations and shape information. In the first part of this thesis we discuss the state-of-the-art methods aimed at accomplishing this task on monocular views and also analyse the methods for evaluating their performance. The second part of the thesis describes our research contribution to these topics. We begin presenting a method for multi-target tracking based on track-before-detect (MTTBD) formulated as a particle filter. The novelty involves the inclusion of the target identity (ID) into the particle state, which enables the algorithm to deal with an unknown and unlimited number of targets. We propose a probabilistic model of particle birth and death based on Markov Random Fields. This model allows us to overcome the problem of the mixing of IDs of close targets. We then propose three evaluation measures that take into account target-size variations, combine accuracy and cardinality errors, quantify long-term tracking accuracy at different accuracy levels, and evaluate ID changes relative to the duration of the track in which they occur. This set of measures does not require pre-setting of parameters and allows one to holistically evaluate tracking performance in an application-independent manner. Lastly, we present a framework for multi-target localisation applied on scenes with a high density of compact objects. Candidate target locations are initially generated by extracting object features from intensity maps using an iterative method based on a gradient-climbing technique and an isocontour slicing approach. A graph-based data association method for multi-target tracking is then applied to link valid candidate target locations over time and to discard those which are spurious. This method can deal with point targets having indistinguishable appearance and unpredictable motion. MT-TBD is evaluated and compared with state-of-the-art methods on real-world surveillanceThis work was supported by the EU, under the FP7 project APIDIS (ICT-216023) and the Artemis JU and TSB as part of the COPCAMS project (332913)

    Data augmentation for NeRF: a geometric consistent solution based on view morphing

    Full text link
    NeRF aims to learn a continuous neural scene representation by using a finite set of input images taken from different viewpoints. The fewer the number of viewpoints, the higher the likelihood of overfitting on them. This paper mitigates such limitation by presenting a novel data augmentation approach to generate geometrically consistent image transitions between viewpoints using view morphing. View morphing is a highly versatile technique that does not requires any prior knowledge about the 3D scene because it is based on general principles of projective geometry. A key novelty of our method is to use the very same depths predicted by NeRF to generate the image transitions that are then added to NeRF training. We experimentally show that this procedure enables NeRF to improve the quality of its synthesised novel views in the case of datasets with few training viewpoints. We improve PSNR up to 1.8dB and 10.5dB when eight and four views are used for training, respectively. To the best of our knowledge, this is the first data augmentation strategy for NeRF that explicitly synthesises additional new input images to improve the model generalisation

    Support Vector Motion Clustering

    Get PDF
    This work was supported in part by the Erasmus Mundus Joint Doctorate in Interactive and Cognitive Environments (which is funded by the EACEA Agency of the European Commission under EMJD ICE FPA n 2010-0012) and by the Artemis JU and the UK Technology Strategy Board through COPCAMS Project under Grant 332913

    Survey on video anomaly detection in dynamic scenes with moving cameras

    Full text link
    The increasing popularity of compact and inexpensive cameras, e.g.~dash cameras, body cameras, and cameras equipped on robots, has sparked a growing interest in detecting anomalies within dynamic scenes recorded by moving cameras. However, existing reviews primarily concentrate on Video Anomaly Detection (VAD) methods assuming static cameras. The VAD literature with moving cameras remains fragmented, lacking comprehensive reviews to date. To address this gap, we endeavor to present the first comprehensive survey on Moving Camera Video Anomaly Detection (MC-VAD). We delve into the research papers related to MC-VAD, critically assessing their limitations and highlighting associated challenges. Our exploration encompasses three application domains: security, urban transportation, and marine environments, which in turn cover six specific tasks. We compile an extensive list of 25 publicly-available datasets spanning four distinct environments: underwater, water surface, ground, and aerial. We summarize the types of anomalies these datasets correspond to or contain, and present five main categories of approaches for detecting such anomalies. Lastly, we identify future research directions and discuss novel contributions that could advance the field of MC-VAD. With this survey, we aim to offer a valuable reference for researchers and practitioners striving to develop and advance state-of-the-art MC-VAD methods.Comment: Under revie

    Novel-View Human Action Synthesis

    Full text link
    Novel-View Human Action Synthesis aims to synthesize the movement of a body from a virtual viewpoint, given a video from a real viewpoint. We present a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target body and transfer the rough textures from the 2D images to the mesh. As this transfer may generate sparse textures on the mesh due to frame resolution or occlusions. We produce a semi-dense textured mesh by propagating the transferred textures both locally, within local geodesic neighborhoods, and globally, across symmetric semantic parts. Next, we introduce a context-based generator to learn how to correct and complete the residual appearance information. This allows the network to independently focus on learning the foreground and background synthesis tasks. We validate the proposed solution on the public NTU RGB+D dataset. The code and resources are available at https://bit.ly/36u3h4K.Comment: Asian Conference on Computer Vision (ACCV) 202
    • …
    corecore