94 research outputs found
SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising
Recently, promptable segmentation models, such as the Segment Anything Model
(SAM), have demonstrated robust zero-shot generalization capabilities on static
images. These promptable models exhibit denoising abilities for imprecise
prompt inputs, such as imprecise bounding boxes. In this paper, we explore the
potential of applying SAM to track and segment objects in videos where we
recognize the tracking task as a prompt denoising task. Specifically, we
iteratively propagate the bounding box of each object's mask in the preceding
frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising
capability against position and size variations, we propose a multi-prompt
strategy where we provide multiple jittered and scaled box prompts for each
object and preserve the mask prediction with the highest semantic similarity to
the template mask. We also introduce a point-based refinement stage to handle
occlusions and reduce cumulative errors. Without involving tracking modules,
our approach demonstrates comparable performance in video object/instance
segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO,
serving as a concise baseline and endowing SAM-based downstream applications
with tracking capabilities
InterTracker: Discovering and Tracking General Objects Interacting with Hands in the Wild
Understanding human interaction with objects is an important research topic
for embodied Artificial Intelligence and identifying the objects that humans
are interacting with is a primary problem for interaction understanding.
Existing methods rely on frame-based detectors to locate interacting objects.
However, this approach is subjected to heavy occlusions, background clutter,
and distracting objects. To address the limitations, in this paper, we propose
to leverage spatio-temporal information of hand-object interaction to track
interactive objects under these challenging cases. Without prior knowledge of
the general objects to be tracked like object tracking problems, we first
utilize the spatial relation between hands and objects to adaptively discover
the interacting objects from the scene. Second, the consistency and continuity
of the appearance of objects between successive frames are exploited to track
the objects. With this tracking formulation, our method also benefits from
training on large-scale general object-tracking datasets. We further curate a
video-level hand-object interaction dataset for testing and evaluation from
100DOH. The quantitative results demonstrate that our proposed method
outperforms the state-of-the-art methods. Specifically, in scenes with
continuous interaction with different objects, we achieve an impressive
improvement of about 10% as evaluated using the Average Precision (AP) metric.
Our qualitative findings also illustrate that our method can produce more
continuous trajectories for interacting objects.Comment: IROS 202
mmBody Benchmark: 3D Body Reconstruction Dataset and Analysis for Millimeter Wave Radar
Millimeter Wave (mmWave) Radar is gaining popularity as it can work in
adverse environments like smoke, rain, snow, poor lighting, etc. Prior work has
explored the possibility of reconstructing 3D skeletons or meshes from the
noisy and sparse mmWave Radar signals. However, it is unclear how accurately we
can reconstruct the 3D body from the mmWave signals across scenes and how it
performs compared with cameras, which are important aspects needed to be
considered when either using mmWave radars alone or combining them with
cameras. To answer these questions, an automatic 3D body annotation system is
first designed and built up with multiple sensors to collect a large-scale
dataset. The dataset consists of synchronized and calibrated mmWave radar point
clouds and RGB(D) images in different scenes and skeleton/mesh annotations for
humans in the scenes. With this dataset, we train state-of-the-art methods with
inputs from different sensors and test them in various scenarios. The results
demonstrate that 1) despite the noise and sparsity of the generated point
clouds, the mmWave radar can achieve better reconstruction accuracy than the
RGB camera but worse than the depth camera; 2) the reconstruction from the
mmWave radar is affected by adverse weather conditions moderately while the
RGB(D) camera is severely affected. Further, analysis of the dataset and the
results shadow insights on improving the reconstruction from the mmWave radar
and the combination of signals from different sensors.Comment: ACM Multimedia 2022, Project Page:
https://chen3110.github.io/mmbody/index.htm
Context-Aware Integration of Language and Visual References for Natural Language Tracking
Tracking by natural language specification (TNL) aims to consistently
localize a target in a video sequence given a linguistic description in the
initial frame. Existing methodologies perform language-based and template-based
matching for target reasoning separately and merge the matching results from
two sources, which suffer from tracking drift when language and visual
templates miss-align with the dynamic target state and ambiguity in the later
merging stage. To tackle the issues, we propose a joint multi-modal tracking
framework with 1) a prompt modulation module to leverage the complementarity
between temporal visual templates and language expressions, enabling precise
and context-aware appearance and linguistic cues, and 2) a unified target
decoding module to integrate the multi-modal reference cues and executes the
integrated queries on the search image to predict the target location in an
end-to-end manner directly. This design ensures spatio-temporal consistency by
leveraging historical visual information and introduces an integrated solution,
generating predictions in a single step. Extensive experiments conducted on
TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed
approach. The results demonstrate competitive performance against
state-of-the-art methods for both tracking and grounding.Comment: Accepted by CVPR202
NeurAR: Neural Uncertainty for Autonomous 3D Reconstruction
Implicit neural representations have shown compelling results in offline 3D
reconstruction and also recently demonstrated the potential for online SLAM
systems. However, applying them to autonomous 3D reconstruction, where robots
are required to explore a scene and plan a view path for the reconstruction,
has not been studied. In this paper, we explore for the first time the
possibility of using implicit neural representations for autonomous 3D scene
reconstruction by addressing two key challenges: 1) seeking a criterion to
measure the quality of the candidate viewpoints for the view planning based on
the new representations, and 2) learning the criterion from data that can
generalize to different scenes instead of hand-crafting one. For the first
challenge, a proxy of Peak Signal-to-Noise Ratio (PSNR) is proposed to quantify
a viewpoint quality. The proxy is acquired by treating the color of a spatial
point in a scene as a random variable under a Gaussian distribution rather than
a deterministic one; the variance of the distribution quantifies the
uncertainty of the reconstruction and composes the proxy. For the second
challenge, the proxy is optimized jointly with the parameters of an implicit
neural network for the scene. With the proposed view quality criterion, we can
then apply the new representations to autonomous 3D reconstruction. Our method
demonstrates significant improvements on various metrics for the rendered image
quality and the geometry quality of the reconstructed 3D models when compared
with variants using TSDF or reconstruction without view planning.Comment: 8 pages, 6 figures, 2 table
Multiple target tracking under occlusions using modified Joint Probabilistic Data Association
International audienceThe size of target will induce a degradation of tracking performance, which has been neglected for simplicity in most previous studies. In multiple target tracking, occlusions will be caused by target size effect, one target can become a moving obstacle blocking the direct channel between the anchor and another target. In this paper, the data association problem in multiple target tracking is investigated. To reduce the computational complexity of traditional Joint Probabilistic Data Association (JPDA) algorithm, a modified JPDA algorithm is proposed to execute data association in multiple target tracking by utilizing the information of occlusion conditions, which is identified by a three-step algorithm. Simulation results show that the proposed algorithm is with good tracking performance and low computational complexity
ImmFusion: Robust mmWave-RGB Fusion for 3D Human Body Reconstruction in All Weather Conditions
3D human reconstruction from RGB images achieves decent results in good
weather conditions but degrades dramatically in rough weather. Complementary,
mmWave radars have been employed to reconstruct 3D human joints and meshes in
rough weather. However, combining RGB and mmWave signals for robust all-weather
3D human reconstruction is still an open challenge, given the sparse nature of
mmWave and the vulnerability of RGB images. In this paper, we present
ImmFusion, the first mmWave-RGB fusion solution to reconstruct 3D human bodies
in all weather conditions robustly. Specifically, our ImmFusion consists of
image and point backbones for token feature extraction and a Transformer module
for token fusion. The image and point backbones refine global and local
features from original data, and the Fusion Transformer Module aims for
effective information fusion of two modalities by dynamically selecting
informative tokens. Extensive experiments on a large-scale dataset, mmBody,
captured in various environments demonstrate that ImmFusion can efficiently
utilize the information of two modalities to achieve a robust 3D human body
reconstruction in all weather conditions. In addition, our method's accuracy is
significantly superior to that of state-of-the-art Transformer-based
LiDAR-camera fusion methods
Realtime characteristic of FF like centralized control fieldbus and its state-of-art
Colloque avec actes et comité de lecture. internationale.International audienceThe temporal property of MAC protocol of fieldbus is critical to meet real-time constraints of field devices in factory floor. Among various types of MAC protocols, the one using centralized strategy is characterized by providing feasible schedule to meet different temporal constraints of field devices online, but also providing schedulability analysis offline a priori. WorldFIP and FF, two popular international standards of fieldbus, both adapt centralized strategy, which is mainly implemented by schedule table (ST). This paper mainly discusses how to construct ST, including size of ST, schedule algorithm and schedulability analysis, to meet requirement of field devices on response time, jitter, synchronization, and its State-of-the Art
- …