13 research outputs found
Local features for view matching across independently moving cameras.
PhD ThesisMoving platforms, such as wearable and robotic cameras, need to recognise the same place
observed from different viewpoints in order to collaboratively reconstruct a 3D scene and to support
augmented reality or autonomous navigation. However, matching views is challenging for
independently moving cameras that directly interact with each other due to severe geometric and
photometric differences, such as viewpoint, scale, and illumination changes, can considerably
decrease the matching performance. This thesis proposes novel, compact, local features that can
cope with with scale and viewpoint variations. We extract and describe an image patch at different
scales of an image pyramid by comparing intensity values between learnt pixel pairs (binary
test), and employ a cross-scale distance when matching these features. We capture, at multiple
scales, the temporal changes of a 3D point, as observed in the image sequence of a camera, by
tracking local binary descriptors. After validating the feature-point trajectories through 3D reconstruction,
we reduce, for each scale, the sequence of binary features to a compact, fixed-length
descriptor that identifies the most frequent and the most stable binary tests over time. We then
propose XC-PR, a cross-camera place recognition approach that stores locally, for each uncalibrated
camera, spatio-temporal descriptors, extracted at a single scale, in a tree that is selectively
updated, as the camera moves. Cameras exchange descriptors selected from previous frames
within an adaptive temporal window and with the highest number of local features corresponding
to the descriptors. The other camera locally searches and matches the received descriptors to
identify and geometrically validate a previously seen place. Experiments on different scenarios
show the improved matching accuracy of the joint multi-scale extraction and temporal reduction
through comparisons of different temporal reduction strategies, as well as the cross-camera
matching strategy based on Bag of Binary Words, and the application to several binary descriptors.
We also show that XC-PR achieves similar accuracy but faster, on average, than a baseline
consisting of an incremental list of spatio-temporal descriptors. Moreover, XC-PR achieves similar
accuracy of a frame-based Bag of Binary Words approach adapted to our approach, while
avoiding to match features that cannot be informative, e.g. for 3D reconstruction
Improving filling level classification with adversarial training
We investigate the problem of classifying - from a single image - the level
of content in a cup or a drinking glass. This problem is made challenging by
several ambiguities caused by transparencies, shape variations and partial
occlusions, and by the availability of only small training datasets. In this
paper, we tackle this problem with an appropriate strategy for transfer
learning. Specifically, we use adversarial training in a generic source dataset
and then refine the training with a task-specific dataset. We also discuss and
experimentally evaluate several training strategies and their combination on a
range of container types of the CORSMAL Containers Manipulation dataset. We
show that transfer learning with adversarial training in the source domain
consistently improves the classification accuracy on the test set and limits
the overfitting of the classifier to specific features of the training data.Comment: Accepted to the 28th IEEE International Conference on Image
Processing (ICIP) 202
Affordance segmentation of hand-occluded containers from exocentric images
Visual affordance segmentation identifies the surfaces of an object an agent
can interact with. Common challenges for the identification of affordances are
the variety of the geometry and physical properties of these surfaces as well
as occlusions. In this paper, we focus on occlusions of an object that is
hand-held by a person manipulating it. To address this challenge, we propose an
affordance segmentation model that uses auxiliary branches to process the
object and hand regions separately. The proposed model learns affordance
features under hand-occlusion by weighting the feature map through hand and
object segmentation. To train the model, we annotated the visual affordances of
an existing dataset with mixed-reality images of hand-held containers in
third-person (exocentric) images. Experiments on both real and mixed-reality
images show that our model achieves better affordance segmentation and
generalisation than existing models.Comment: Paper accepted to Workshop on Assistive Computer Vision and Robotics
(ACVR) in International Conference on Computer Vision (ICCV) 2023; 10 pages,
4 figures, 2 tables. Data, code, and trained models are available at
https://apicis.github.io/projects/acanet.htm
Astro MBSE: model based system engineering synthesized for the Italian astronomical community
Systems Engineering requires the involvement of different engineering disciplines: Software, Electronics, Mechanics
(often nowadays together as Mechatronics), Optics etc. Systems Engineering of Astronomical Instrumentation is no
exception to this. A critical point is the handling of the different point of view introduced by these disciplines often
related to different tools and cultures. Model Based Systems Engineering (MBSE) approach can help the Systems
Engineer to always have a complete view of the full system. Moreover, in an ideal situation, all of the information
resides in the model thus allowing different views of the System without having to resort to different sources of
information, often outdated. In the real world, however, this does not happen because the different actors (Optical
Designers, Mechanical Engineers, Astronomers etc.) should adopt the same language and this is clearly, at least
nowadays and for the immediate future, close to impossible.
In the Italian Astronomical Community, we are developing methodologies and tools to share the expertise in this field
among the different projects. In this paper we present the status of this activity that aims to deliver to the community
proper tools and template to enable a uniformed use of MBSE (friendly name Astro MBSE) among different projects
(ground and space based, …). We will analyze here different software and different approaches. The target and synthesis
of this work will be a support framework for the MBSE based system Engineering activity to the Italian Astronomical
Community (INAF)
Astro MBSE: overview on requirement management approaches for astronomical instrumentation
Systems Engineering requires the involvement of different engineering disciplines: Software, Electronics, Mechanics
(often nowadays together as Mechatronics), Optics etc. Astronomical Instrumentation is no exception to this. A critical
point is the handling of the requirements, their tracing, flow down and the interaction with stakeholders (flow up) and
subsystems (flow down) in order to have traceable and methodical evolution and management.
In the Italian Astronomical Community, we are developing methodologies and tools to share the expertise in this field
among the different projects. In this paper we will focus on the requirement management approach among different
projects (ground and space based, …). The target and synthesis of tis work will be a support framework for the
Requirement management of the Italian Astronomical Community (INAF) projects
MORB: A Multi-Scale Binary Descriptor
Local image features play an important role in matching images under different geometric and photometric transformations. However, as the scale difference across views increases, the matching performance may considerably decrease. To address this problem we propose MORB, a multi-scale binary descriptor that is based on ORB and that improves the accuracy of feature matching under scale changes. MORB describes an image patch at different scales using an oriented sampling pattern of intensity comparisons in a predefined set of pixel pairs. We also propose a matching strategy that estimates the cross-scale match between MORB descriptors across views. Experiments show that MORB outperforms state-of-the-art binary descriptors under several transformations
The CORSMAL Benchmark for the Prediction of the Properties of Containers
The contactless estimation of the weight of a container and the amount of its content manipulated by a person are key pre-requisites for safe human-To-robot handovers. However, opaqueness and transparencies of the container and the content, and variability of materials, shapes, and sizes, make this estimation difficult. In this paper, we present a range of methods and an open framework to benchmark acoustic and visual perception for the estimation of the capacity of a container, and the type, mass, and amount of its content. The framework includes a dataset, specific tasks and performance measures. We conduct an in-depth comparative analysis of methods that used this framework and audio-only or vision-only baselines designed from related works. Based on this analysis, we can conclude that audio-only and audio-visual classifiers are suitable for the estimation of the type and amount of the content using different types of convolutional neural networks, combined with either recurrent neural networks or a majority voting strategy, whereas computer vision methods are suitable to determine the capacity of the container using regression and geometric approaches. Classifying the content type and level using only audio achieves a weighted average F1-score up to 81% and 97%, respectively. Estimating the container capacity with vision-only approaches and estimating the filling mass with audio-visual multi-stage approaches reach up to 65% weighted average capacity and mass scores. These results show that there is still room for improvement on the design of new methods. These new methods can be ranked and compared on the individual leaderboards provided by our open framework. © 2013 IEEE
Multi-camera Matching of Spatio-Temporal Binary Features
Local image features are generally robust to different geometric and photometric transformations on planar surfaces or under narrow baseline views. However, the matching performance decreases considerably across cameras with unknown poses separated by a wide baseline. To address this problem, we accumulate temporal information within each view by tracking local binary features, which encode intensity comparisons of pixel pairs in an image patch. We then encode the spatio-temporal features into fixed-length binary descriptors by selecting temporally dominant binary values. We complement the descriptor with a binary vector that identifies intensity comparisons that are temporally unstable. Finally, we use this additional vector to ignore the corresponding binary values in the fixed-length binary descriptor when matching the features across cameras. We analyse the performance of the proposed approach and compare it with baselines
3D mouth tracking from a compact microphone array co-located with a camera
We address the problem of 3D audio-visual person tracking using a compact platform with co-located audio-visual sensors, without a depth camera. We present a face detection driven approach supported by 3D hypothesis mapping to image plane for visual feature matching. We then propose a video-assisted audio likelihood computation, which relies on a GCC-PHAT based acoustic map. Audio and video likelihoods are fused together in a particle filtering framework. The proposed approach copes with a reverberant and noisy environment, and can deal with person being occluded, outside the camera’s Field of View (FoV), as well as not facing or far from the sensing platform. Experimental results show that we can provide accurate person tracking in both 3D and on imag
Accurate Target Annotation in 3D from Multimodal Streams
Accurate annotation is fundamental to quantify the performance of multi-sensor and multi-modal object detectors and trackers. However, invasive or expensive instrumentation is needed to automatically generate these annotations. To mitigate this problem, we present a multi-modal approach that leverages annotations from reference streams (e.g. individual camera views) and measurements from unannotated additional streams (e.g. audio) to infer 3D trajectories through an optimization. The core of our approach is a multi-modal extension of Bundle Adjustment with a cross-modal correspondence detection that selectively uses measurements in the optimization. We apply the proposed approach to fully annotate a new multi-modal and multi-view dataset for multi-speaker 3D tracking