176 research outputs found
BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin
Skeleton-based motion representations are robust for action localization and
understanding for their invariance to perspective, lighting, and occlusion,
compared with images. Yet, they are often ambiguous and incomplete when taken
out of context, even for human annotators. As infants discern gestures before
associating them with words, actions can be conceptualized before being
grounded with labels. Therefore, we propose the first unsupervised pre-training
framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based
motion sequence into discovered semantically meaningful pre-action segments. By
fine-tuning our pre-training network with a small number of annotated data, we
show results out-performing SOTA methods by a large margin.Comment: 18 pages, 8 figure
SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution
Diffusion-based super-resolution (SR) models have recently garnered
significant attention due to their potent restoration capabilities. But
conventional diffusion models perform noise sampling from a single
distribution, constraining their ability to handle real-world scenes and
complex textures across semantic regions. With the success of segment anything
model (SAM), generating sufficiently fine-grained region masks can enhance the
detail recovery of diffusion-based SR model. However, directly integrating SAM
into SR models will result in much higher computational cost. In this paper, we
propose the SAM-DiffSR model, which can utilize the fine-grained structure
information from SAM in the process of sampling noise to improve the image
quality without additional computational cost during inference. In the process
of training, we encode structural position information into the segmentation
mask from SAM. Then the encoded mask is integrated into the forward diffusion
process by modulating it to the sampled noise. This adjustment allows us to
independently adapt the noise mean within each corresponding segmentation area.
The diffusion model is trained to estimate this modulated noise. Crucially, our
proposed framework does NOT change the reverse diffusion process and does NOT
require SAM at inference. Experimental results demonstrate the effectiveness of
our proposed method, showcasing superior performance in suppressing artifacts,
and surpassing existing diffusion-based methods by 0.74 dB at the maximum in
terms of PSNR on DIV2K dataset. The code and dataset are available at
https://github.com/lose4578/SAM-DiffSR
MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos
Convolutional neural network inference on video input is computationally
expensive and requires high memory bandwidth. Recently, DeltaCNN managed to
reduce the cost by only processing pixels with significant updates over the
previous frame. However, DeltaCNN relies on static camera input. Moving cameras
add new challenges in how to fuse newly unveiled image regions with already
processed regions efficiently to minimize the update rate - without increasing
memory overhead and without knowing the camera extrinsics of future frames. In
this work, we propose MotionDeltaCNN, a sparse CNN inference framework that
supports moving cameras. We introduce spherical buffers and padded convolutions
to enable seamless fusion of newly unveiled regions and previously processed
regions -- without increasing memory footprint. Our evaluation shows that we
outperform DeltaCNN by up to 90% for moving camera videos
Road Assessment Model and Pilot Application in China
Risk assessment of roads is an effective approach for road agencies to determine safety improvement investments. It can increases the cost-effective returns in crash and injury reductions. To get a powerful Chinese risk assessment model, Research Institute of Highway (RIOH) is developing China Road Assessment Programme (ChinaRAP) model to show the traffic crashes in China in partnership with International Road Assessment Programme (iRAP). The ChinaRAP model is based upon RIOH’s achievements and iRAP models. This paper documents part of ChinaRAP’s research work, mainly including the RIOH model and its pilot application in a province in China
EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild
We present EMDB, the Electromagnetic Database of Global 3D Human Pose and
Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL
pose and shape parameters with global body and camera trajectories for
in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and
a hand-held iPhone to record a total of 58 minutes of motion data, distributed
over 81 indoor and outdoor sequences and 10 participants. Together with
accurate body poses and shapes, we also provide global camera poses and body
root trajectories. To construct EMDB, we propose a multi-stage optimization
procedure, which first fits SMPL to the 6-DoF EM measurements and then refines
the poses via image observations. To achieve high-quality results, we leverage
a neural implicit avatar model to reconstruct detailed human surface geometry
and appearance, which allows for improved alignment and smoothness via a dense
pixel-level objective. Our evaluations, conducted with a multi-view volumetric
capture system, indicate that EMDB has an expected accuracy of 2.3 cm
positional and 10.6 degrees angular error, surpassing the accuracy of previous
in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB
methods for camera-relative and global pose estimation on EMDB. EMDB is
publicly available under https://ait.ethz.ch/emdbComment: Accepted to ICCV 202
Visual Estimation of Fingertip Pressure on Diverse Surfaces using Easily Captured Data
People often use their hands to make contact with the world and apply
pressure. Machine perception of this important human activity could be widely
applied. Prior research has shown that deep models can estimate hand pressure
based on a single RGB image. Yet, evaluations have been limited to controlled
settings, since performance relies on training data with high-resolution
pressure measurements that are difficult to obtain. We present a novel approach
that enables diverse data to be captured with only an RGB camera and a
cooperative participant. Our key insight is that people can be prompted to
perform actions that correspond with categorical labels describing contact
pressure (contact labels), and that the resulting weakly labeled data can be
used to train models that perform well under varied conditions. We demonstrate
the effectiveness of our approach by training on a novel dataset with 51
participants making fingertip contact with instrumented and uninstrumented
objects. Our network, ContactLabelNet, dramatically outperforms prior work,
performs well under diverse conditions, and matched or exceeded the performance
of human annotators
- …