19 research outputs found

    REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos

    Full text link
    Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. Moreover, despite existing garment reconstruction methods based on feature curve representation demonstrating impressive results for garment reconstruction from a single image, they struggle to generate temporally consistent surfaces for the video input. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV, to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space. Experiments on multiple casually captured datasets show that our approach outperforms existing methods and can produce high-quality dynamic garment surfaces. The source code is available at https://github.com/GAP-LAB-CUHK-SZ/REC-MV.Comment: CVPR2023; Project Page:https://lingtengqiu.github.io/2023/REC-MV

    MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

    Full text link
    Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach

    SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

    Full text link
    We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.Comment: Project page: https://mutianxu.github.io/sampro3d

    Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

    Full text link
    Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios

    Unveiling the comprehensive resources and environmental efficiency and its influencing factors: Within and across the five urban agglomerations in Northwest China

    No full text
    Promoting the comprehensive resources and environmental efficiency (CREE) in urban agglomerations (UAs) is of great practical significance for China’s sustainable development. However, CREE in UAs of underdeveloped regions has not received enough attention. Under this background, we constructed a systematic and coherent framework to study CREE and took the five UAs of Northwest China as a case. The super epsilon-based measure (EBM) model was performed to quantify CREE during 2000–2017. Subsequently, we analyzed the spatio-temporal patterns in detail. Through the Super-EBM and GTWR (geographically and temporally weighted regression) model, the endogenous components and exogenous determinants of CREE were examined. The results indicated that the CREE in the five UAs of Northwest China underwent a slight decrease as a whole, and showed an intensified spatial divergence. It exhibited an obvious discontinuity and path bifurcation while being negatively correlated with spatial imbalance across the UAs. The CREE of different UAs showed various spatial distribution characteristics. Regarding the endogenous mechanism, the UAs had certain commonalities and characteristics. The exogenous mechanism manifested certain spatial heterogeneity across UAs while it was generally consistent within each single UA. These results could provide insightful recommendations for the resources and environmental governance in the study area and other similar regions

    China’s Land Uses in the Multi-Region Input–Output Framework

    No full text
    The finite resource of land is subject to competing pressures from food demand, urbanization, and ecosystem service provision. Linking the land resource use to the whole production chain and final consumption of various products and services offers a new perspective to understand and manage land uses. This study conducted a systematic analysis of land uses at the provincial level in China using the multi-region input–output model in 2012. Land use patterns related to the sectoral production and consumption in different provinces were examined. The results indicated that the land use transfers between different provinces in China have formed a highly interacting network. Products and services involved in the inter-provincial trades in China contained 2.3 million km2 land uses, which constituted approximately 40% of the total national land uses that were finally consumed in China. Agriculture was the most direct land use intensive sector, and industry was the most indirect land use intensive sector. Land resource-scarce provinces with low per capita land availability have outsourced parts of their land uses by net importing lands from other provinces. The results have important policy implications towards sustainable land uses in China

    TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes

    Full text link
    Many basic indoor activities such as eating or writing are always conducted upon different tabletops (e.g., coffee tables, writing desks). It is indispensable to understanding tabletop scenes in 3D indoor scene parsing applications. Unfortunately, it is hard to meet this demand by directly deploying data-driven algorithms, since 3D tabletop scenes are rarely available in current datasets. To remedy this defect, we introduce TO-Scene, a large-scale dataset focusing on tabletop scenes, which contains 20,740 scenes with three variants. To acquire the data, we design an efficient and scalable framework, where a crowdsourcing UI is developed to transfer CAD objects from ModelNet and ShapeNet onto tables from ScanNet, then the output tabletop scenes are simulated into real scans and annotated automatically. Further, a tabletop-aware learning strategy is proposed for better perceiving the small-sized tabletop instances. Notably, we also provide a real scanned test set TO-Real to verify the practical value of TO-Scene. Experiments show that the algorithms trained on TO-Scene indeed work on the realistic test data, and our proposed tabletop-aware learning strategy greatly improves the state-of-the-art results on both 3D semantic segmentation and object detection tasks. Dataset and code are available at https://github.com/GAP-LAB-CUHK-SZ/TO-Scene.Comment: ECCV 2022 (Oral Presentation

    Learning Geometry-Disentangled Representation for Complementary Understanding of 3D Object Point Cloud

    Full text link
    In 2D image processing, some attempts decompose images into high and low frequency components for describing edge and smooth parts respectively. Similarly, the contour and flat area of 3D objects, such as the boundary and seat area of a chair, describe different but also complementary geometries. However, such investigation is lost in previous deep networks that understand point clouds by directly treating all points or local patches equally. To solve this problem, we propose Geometry-Disentangled Attention Network (GDANet). GDANet introduces Geometry-Disentangle Module to dynamically disentangle point clouds into the contour and flat part of 3D objects, respectively denoted by sharp and gentle variation components. Then GDANet exploits Sharp-Gentle Complementary Attention Module that regards the features from sharp and gentle variation components as two holistic representations, and pays different attentions to them while fusing them respectively with original point cloud features. In this way, our method captures and refines the holistic and complementary 3D geometric semantics from two distinct disentangled components to supplement the local information. Extensive experiments on 3D object classification and segmentation benchmarks demonstrate that GDANet achieves the state-of-the-arts with fewer parameters. Code is released on https://github.com/mutianxu/GDANet.Comment: Accepted by AAAI202

    HybridCap: Inertia-Aid Monocular Capture of Challenging Human Motions

    No full text
    Monocular 3D motion capture (mocap) is beneficial to many applications. The use of a single camera, however, often fails to handle occlusions of different body parts and hence it is limited to capture relatively simple movements. We present a light-weight, hybrid mocap technique called HybridCap that augments the camera with only 4 Inertial Measurement Units (IMUs) in a novel learning-and-optimization framework. We first employ a weakly-supervised and hierarchical motion inference module based on cooperative pure residual recurrent blocks that serve as limb, body and root trackers as well as an inverse kinematics solver. Our network effectively narrows the search space of plausible motions via coarse-to-fine pose estimation and manages to tackle challenging movements with high efficiency. We further develop a hybrid optimization scheme that combines inertial feedback and visual cues to improve tracking accuracy. Extensive experiments on various datasets demonstrate HybridCap can robustly handle challenging movements ranging from fitness actions to Latin dance. It also achieves real-time performance up to 60 fps with state-of-the-art accuracy
    corecore