176 research outputs found

    BID: Boundary-Interior Decoding for Unsupervised Temporal Action Localization Pre-Trainin

    Full text link
    Skeleton-based motion representations are robust for action localization and understanding for their invariance to perspective, lighting, and occlusion, compared with images. Yet, they are often ambiguous and incomplete when taken out of context, even for human annotators. As infants discern gestures before associating them with words, actions can be conceptualized before being grounded with labels. Therefore, we propose the first unsupervised pre-training framework, Boundary-Interior Decoding (BID), that partitions a skeleton-based motion sequence into discovered semantically meaningful pre-action segments. By fine-tuning our pre-training network with a small number of annotated data, we show results out-performing SOTA methods by a large margin.Comment: 18 pages, 8 figure

    SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

    Full text link
    Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR

    MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos

    Full text link
    Convolutional neural network inference on video input is computationally expensive and requires high memory bandwidth. Recently, DeltaCNN managed to reduce the cost by only processing pixels with significant updates over the previous frame. However, DeltaCNN relies on static camera input. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a sparse CNN inference framework that supports moving cameras. We introduce spherical buffers and padded convolutions to enable seamless fusion of newly unveiled regions and previously processed regions -- without increasing memory footprint. Our evaluation shows that we outperform DeltaCNN by up to 90% for moving camera videos

    Road Assessment Model and Pilot Application in China

    Get PDF
    Risk assessment of roads is an effective approach for road agencies to determine safety improvement investments. It can increases the cost-effective returns in crash and injury reductions. To get a powerful Chinese risk assessment model, Research Institute of Highway (RIOH) is developing China Road Assessment Programme (ChinaRAP) model to show the traffic crashes in China in partnership with International Road Assessment Programme (iRAP). The ChinaRAP model is based upon RIOH’s achievements and iRAP models. This paper documents part of ChinaRAP’s research work, mainly including the RIOH model and its pilot application in a province in China

    EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild

    Full text link
    We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and outdoor sequences and 10 participants. Together with accurate body poses and shapes, we also provide global camera poses and body root trajectories. To construct EMDB, we propose a multi-stage optimization procedure, which first fits SMPL to the 6-DoF EM measurements and then refines the poses via image observations. To achieve high-quality results, we leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance, which allows for improved alignment and smoothness via a dense pixel-level objective. Our evaluations, conducted with a multi-view volumetric capture system, indicate that EMDB has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB. EMDB is publicly available under https://ait.ethz.ch/emdbComment: Accepted to ICCV 202

    Visual Estimation of Fingertip Pressure on Diverse Surfaces using Easily Captured Data

    Full text link
    People often use their hands to make contact with the world and apply pressure. Machine perception of this important human activity could be widely applied. Prior research has shown that deep models can estimate hand pressure based on a single RGB image. Yet, evaluations have been limited to controlled settings, since performance relies on training data with high-resolution pressure measurements that are difficult to obtain. We present a novel approach that enables diverse data to be captured with only an RGB camera and a cooperative participant. Our key insight is that people can be prompted to perform actions that correspond with categorical labels describing contact pressure (contact labels), and that the resulting weakly labeled data can be used to train models that perform well under varied conditions. We demonstrate the effectiveness of our approach by training on a novel dataset with 51 participants making fingertip contact with instrumented and uninstrumented objects. Our network, ContactLabelNet, dramatically outperforms prior work, performs well under diverse conditions, and matched or exceeded the performance of human annotators
    • …
    corecore