Search CORE

19 research outputs found

REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos

Author: Chen Guanying
Han Xiaoguang
Qiu Lingteng
Wang Junle
Xu Mutian
Zhou Jiapeng
Publication venue
Publication date: 27/05/2023
Field of study

Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. Moreover, despite existing garment reconstruction methods based on feature curve representation demonstrating impressive results for garment reconstruction from a single image, they struggle to generate temporally consistent surfaces for the video input. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV, to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space. Experiments on multiple casually captured datasets show that our approach outperforms existing methods and can produce high-quality dynamic garment surfaces. The source code is available at https://github.com/GAP-LAB-CUHK-SZ/REC-MV.Comment: CVPR2023; Project Page:https://lingtengqiu.github.io/2023/REC-MV

arXiv.org e-Print Archive

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

Author: Han Xiaoguang
He Tong
Ouyang Wanli
Qiao Yu
Wang Yali
Xu Mingye
Xu Mutian
Publication venue
Publication date: 09/06/2023
Field of study

Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach

arXiv.org e-Print Archive

SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation

Author: Han Xiaoguang
Liu Yang
Qiu Lingteng
Tong Xin
Xu Mutian
Yin Xingyilang
Publication venue
Publication date: 29/11/2023
Field of study

We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the 3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments 3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames. Our key idea involves locating 3D points in scenes as natural 3D prompts to align their projected pixel prompts across frames, ensuring frame-consistency in both pixel prompts and their SAM-predicted masks. Moreover, we suggest filtering out low-quality 3D prompts based on feedback from all 2D frames, for enhancing segmentation quality. We also propose to consolidate different 3D prompts if they are segmenting the same object, bringing a more comprehensive segmentation. Notably, our method does not require any additional training on domain-specific data, enabling us to preserve the zero-shot power of SAM. Extensive qualitative and quantitative results show that our method consistently achieves higher quality and more diverse segmentation than previous zero-shot or fully supervised approaches, and in many cases even surpasses human-level annotations. The project page can be accessed at https://mutianxu.github.io/sampro3d/.Comment: Project page: https://mutianxu.github.io/sampro3d

arXiv.org e-Print Archive

Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks

Author: Bai Song
Han Xiaoguang
Shou Mike Zheng
Xu Mutian
Xue Chuhui
Zhang David Junhao
Zhang Wenqing
Publication venue
Publication date: 13/08/2023
Field of study

Despite the rapid advancement of unsupervised learning in visual representation, it requires training on large-scale datasets that demand costly data collection, and pose additional challenges due to concerns regarding data privacy. Recently, synthetic images generated by text-to-image diffusion models, have shown great potential for benefiting image recognition. Although promising, there has been inadequate exploration dedicated to unsupervised learning on diffusion-generated images. To address this, we start by uncovering that diffusion models' cross-attention layers inherently provide annotation-free attention masks aligned with corresponding text inputs on generated images. We then investigate the problems of three prevalent unsupervised learning techniques ( i.e., contrastive learning, masked modeling, and vision-language pretraining) and introduce customized solutions by fully exploiting the aforementioned free attention masks. Our approach is validated through extensive experiments that show consistent improvements in baseline models across various downstream tasks, including image classification, detection, segmentation, and image-text retrieval. By utilizing our method, it is possible to close the performance gap between unsupervised pretraining on synthetic data and real-world scenarios

arXiv.org e-Print Archive

Unveiling the comprehensive resources and environmental efficiency and its influencing factors: Within and across the five urban agglomerations in Northwest China

Author: Chao Bao
Mutian Xu
Publication venue: Elsevier
Publication date: 01/10/2023
Field of study

Promoting the comprehensive resources and environmental efficiency (CREE) in urban agglomerations (UAs) is of great practical significance for China’s sustainable development. However, CREE in UAs of underdeveloped regions has not received enough attention. Under this background, we constructed a systematic and coherent framework to study CREE and took the five UAs of Northwest China as a case. The super epsilon-based measure (EBM) model was performed to quantify CREE during 2000–2017. Subsequently, we analyzed the spatio-temporal patterns in detail. Through the Super-EBM and GTWR (geographically and temporally weighted regression) model, the endogenous components and exogenous determinants of CREE were examined. The results indicated that the CREE in the five UAs of Northwest China underwent a slight decrease as a whole, and showed an intensified spatial divergence. It exhibited an obvious discontinuity and path bifurcation while being negatively correlated with spatial imbalance across the UAs. The CREE of different UAs showed various spatial distribution characteristics. Regarding the endogenous mechanism, the UAs had certain commonalities and characteristics. The exogenous mechanism manifested certain spatial heterogeneity across UAs while it was generally consistent within each single UA. These results could provide insightful recommendations for the resources and environmental governance in the study area and other similar regions

Directory of Open Access Journals

China’s Land Uses in the Multi-Region Input–Output Framework

Author: Chao Bao
Mutian Xu
Siao Sun
Publication venue: 'MDPI AG'
Publication date: 16/08/2019
Field of study

The finite resource of land is subject to competing pressures from food demand, urbanization, and ecosystem service provision. Linking the land resource use to the whole production chain and final consumption of various products and services offers a new perspective to understand and manage land uses. This study conducted a systematic analysis of land uses at the provincial level in China using the multi-region input–output model in 2012. Land use patterns related to the sectoral production and consumption in different provinces were examined. The results indicated that the land use transfers between different provinces in China have formed a highly interacting network. Products and services involved in the inter-provincial trades in China contained 2.3 million km2 land uses, which constituted approximately 40% of the total national land uses that were finally consumed in China. Agriculture was the most direct land use intensive sector, and industry was the most indirect land use intensive sector. Land resource-scarce provinces with low per capita land availability have outsourced parts of their land uses by net importing lands from other provinces. The results have important policy implications towards sustainable land uses in China

Multidisciplinary Digital Publishing Institute

TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes

Author: Chen Pei
Han Xiaoguang
Liu Haolin
Xu Mutian
Publication venue
Publication date: 20/07/2022
Field of study

Many basic indoor activities such as eating or writing are always conducted upon different tabletops (e.g., coffee tables, writing desks). It is indispensable to understanding tabletop scenes in 3D indoor scene parsing applications. Unfortunately, it is hard to meet this demand by directly deploying data-driven algorithms, since 3D tabletop scenes are rarely available in current datasets. To remedy this defect, we introduce TO-Scene, a large-scale dataset focusing on tabletop scenes, which contains 20,740 scenes with three variants. To acquire the data, we design an efficient and scalable framework, where a crowdsourcing UI is developed to transfer CAD objects from ModelNet and ShapeNet onto tables from ScanNet, then the output tabletop scenes are simulated into real scans and annotated automatically. Further, a tabletop-aware learning strategy is proposed for better perceiving the small-sized tabletop instances. Notably, we also provide a real scanned test set TO-Real to verify the practical value of TO-Scene. Experiments show that the algorithms trained on TO-Scene indeed work on the realistic test data, and our proposed tabletop-aware learning strategy greatly improves the state-of-the-art results on both 3D semantic segmentation and object detection tasks. Dataset and code are available at https://github.com/GAP-LAB-CUHK-SZ/TO-Scene.Comment: ECCV 2022 (Oral Presentation

arXiv.org e-Print Archive

Learning Geometry-Disentangled Representation for Complementary Understanding of 3D Object Point Cloud

Author: Qi Xiaojuan
Qiao Yu
Xu Mingye
Xu Mutian
Zhang Junhao
Zhou Zhipeng
Publication venue
Publication date: 07/02/2021
Field of study

In 2D image processing, some attempts decompose images into high and low frequency components for describing edge and smooth parts respectively. Similarly, the contour and flat area of 3D objects, such as the boundary and seat area of a chair, describe different but also complementary geometries. However, such investigation is lost in previous deep networks that understand point clouds by directly treating all points or local patches equally. To solve this problem, we propose Geometry-Disentangled Attention Network (GDANet). GDANet introduces Geometry-Disentangle Module to dynamically disentangle point clouds into the contour and flat part of 3D objects, respectively denoted by sharp and gentle variation components. Then GDANet exploits Sharp-Gentle Complementary Attention Module that regards the features from sharp and gentle variation components as two holistic representations, and pays different attentions to them while fusing them respectively with original point cloud features. In this way, our method captures and refines the holistic and complementary 3D geometric semantics from two distinct disentangled components to supplement the local information. Extensive experiments on 3D object classification and segmentation benchmarks demonstrate that GDANet achieves the state-of-the-arts with fewer parameters. Code is released on https://github.com/mutianxu/GDANet.Comment: Accepted by AAAI202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

HybridCap: Inertia-Aid Monocular Capture of Challenging Human Motions

Author: He Yannan
Li Mutian
Liang Han
Wang Jingya
Xu Lan
Yu Jingyi
Zhao Chengfeng
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 17/03/2022
Field of study

Monocular 3D motion capture (mocap) is beneficial to many applications. The use of a single camera, however, often fails to handle occlusions of different body parts and hence it is limited to capture relatively simple movements. We present a light-weight, hybrid mocap technique called HybridCap that augments the camera with only 4 Inertial Measurement Units (IMUs) in a novel learning-and-optimization framework. We first employ a weakly-supervised and hierarchical motion inference module based on cooperative pure residual recurrent blocks that serve as limb, body and root trackers as well as an inverse kinematics solver. Our network effectively narrows the search space of plausible motions via coarse-to-fine pose estimation and manages to tackle challenging movements with high efficiency. We further develop a hybrid optimization scheme that combines inertial feedback and visual cues to improve tracking accuracy. Extensive experiments on various datasets demonstrate HybridCap can robustly handle challenging movements ranging from fitness actions to Latin dance. It also achieves real-time performance up to 60 fps with state-of-the-art accuracy

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications