19 research outputs found
REC-MV: REconstructing 3D Dynamic Cloth from Monocular Videos
Reconstructing dynamic 3D garment surfaces with open boundaries from
monocular videos is an important problem as it provides a practical and
low-cost solution for clothes digitization. Recent neural rendering methods
achieve high-quality dynamic clothed human reconstruction results from
monocular video, but these methods cannot separate the garment surface from the
body. Moreover, despite existing garment reconstruction methods based on
feature curve representation demonstrating impressive results for garment
reconstruction from a single image, they struggle to generate temporally
consistent surfaces for the video input. To address the above limitations, in
this paper, we formulate this task as an optimization problem of 3D garment
feature curves and surface reconstruction from monocular video. We introduce a
novel approach, called REC-MV, to jointly optimize the explicit feature curves
and the implicit signed distance field (SDF) of the garments. Then the open
garment meshes can be extracted via garment template registration in the
canonical space. Experiments on multiple casually captured datasets show that
our approach outperforms existing methods and can produce high-quality dynamic
garment surfaces. The source code is available at
https://github.com/GAP-LAB-CUHK-SZ/REC-MV.Comment: CVPR2023; Project Page:https://lingtengqiu.github.io/2023/REC-MV
MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency
Masked Modeling (MM) has demonstrated widespread success in various vision
challenges, by reconstructing masked visual patches. Yet, applying MM for
large-scale 3D scenes remains an open problem due to the data sparsity and
scene complexity. The conventional random masking paradigm used in 2D images
often causes a high risk of ambiguity when recovering the masked region of 3D
scenes. To this end, we propose a novel informative-preserved reconstruction,
which explores local statistics to discover and preserve the representative
structured points, effectively enhancing the pretext masking task for 3D scene
understanding. Integrated with a progressive reconstruction manner, our method
can concentrate on modeling regional geometry and enjoy less ambiguity for
masked reconstruction. Besides, such scenes with progressive masking ratios can
also serve to self-distill their intrinsic spatial consistency, requiring to
learn the consistent representations from unmasked areas. By elegantly
combining informative-preserved reconstruction on masked areas and consistency
self-distillation from unmasked areas, a unified framework called MM-3DScene is
yielded. We conduct comprehensive experiments on a host of downstream tasks.
The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2%
mIoU on semantic segmentation) demonstrates the superiority of our approach
SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation
We introduce SAMPro3D for zero-shot 3D indoor scene segmentation. Given the
3D point cloud and multiple posed 2D frames of 3D scenes, our approach segments
3D scenes by applying the pretrained Segment Anything Model (SAM) to 2D frames.
Our key idea involves locating 3D points in scenes as natural 3D prompts to
align their projected pixel prompts across frames, ensuring frame-consistency
in both pixel prompts and their SAM-predicted masks. Moreover, we suggest
filtering out low-quality 3D prompts based on feedback from all 2D frames, for
enhancing segmentation quality. We also propose to consolidate different 3D
prompts if they are segmenting the same object, bringing a more comprehensive
segmentation. Notably, our method does not require any additional training on
domain-specific data, enabling us to preserve the zero-shot power of SAM.
Extensive qualitative and quantitative results show that our method
consistently achieves higher quality and more diverse segmentation than
previous zero-shot or fully supervised approaches, and in many cases even
surpasses human-level annotations. The project page can be accessed at
https://mutianxu.github.io/sampro3d/.Comment: Project page: https://mutianxu.github.io/sampro3d
Free-ATM: Exploring Unsupervised Learning on Diffusion-Generated Images with Free Attention Masks
Despite the rapid advancement of unsupervised learning in visual
representation, it requires training on large-scale datasets that demand costly
data collection, and pose additional challenges due to concerns regarding data
privacy. Recently, synthetic images generated by text-to-image diffusion
models, have shown great potential for benefiting image recognition. Although
promising, there has been inadequate exploration dedicated to unsupervised
learning on diffusion-generated images. To address this, we start by uncovering
that diffusion models' cross-attention layers inherently provide
annotation-free attention masks aligned with corresponding text inputs on
generated images. We then investigate the problems of three prevalent
unsupervised learning techniques ( i.e., contrastive learning, masked modeling,
and vision-language pretraining) and introduce customized solutions by fully
exploiting the aforementioned free attention masks. Our approach is validated
through extensive experiments that show consistent improvements in baseline
models across various downstream tasks, including image classification,
detection, segmentation, and image-text retrieval. By utilizing our method, it
is possible to close the performance gap between unsupervised pretraining on
synthetic data and real-world scenarios
Unveiling the comprehensive resources and environmental efficiency and its influencing factors: Within and across the five urban agglomerations in Northwest China
Promoting the comprehensive resources and environmental efficiency (CREE) in urban agglomerations (UAs) is of great practical significance for China’s sustainable development. However, CREE in UAs of underdeveloped regions has not received enough attention. Under this background, we constructed a systematic and coherent framework to study CREE and took the five UAs of Northwest China as a case. The super epsilon-based measure (EBM) model was performed to quantify CREE during 2000–2017. Subsequently, we analyzed the spatio-temporal patterns in detail. Through the Super-EBM and GTWR (geographically and temporally weighted regression) model, the endogenous components and exogenous determinants of CREE were examined. The results indicated that the CREE in the five UAs of Northwest China underwent a slight decrease as a whole, and showed an intensified spatial divergence. It exhibited an obvious discontinuity and path bifurcation while being negatively correlated with spatial imbalance across the UAs. The CREE of different UAs showed various spatial distribution characteristics. Regarding the endogenous mechanism, the UAs had certain commonalities and characteristics. The exogenous mechanism manifested certain spatial heterogeneity across UAs while it was generally consistent within each single UA. These results could provide insightful recommendations for the resources and environmental governance in the study area and other similar regions
China’s Land Uses in the Multi-Region Input–Output Framework
The finite resource of land is subject to competing pressures from food demand, urbanization, and ecosystem service provision. Linking the land resource use to the whole production chain and final consumption of various products and services offers a new perspective to understand and manage land uses. This study conducted a systematic analysis of land uses at the provincial level in China using the multi-region input–output model in 2012. Land use patterns related to the sectoral production and consumption in different provinces were examined. The results indicated that the land use transfers between different provinces in China have formed a highly interacting network. Products and services involved in the inter-provincial trades in China contained 2.3 million km2 land uses, which constituted approximately 40% of the total national land uses that were finally consumed in China. Agriculture was the most direct land use intensive sector, and industry was the most indirect land use intensive sector. Land resource-scarce provinces with low per capita land availability have outsourced parts of their land uses by net importing lands from other provinces. The results have important policy implications towards sustainable land uses in China
TO-Scene: A Large-scale Dataset for Understanding 3D Tabletop Scenes
Many basic indoor activities such as eating or writing are always conducted
upon different tabletops (e.g., coffee tables, writing desks). It is
indispensable to understanding tabletop scenes in 3D indoor scene parsing
applications. Unfortunately, it is hard to meet this demand by directly
deploying data-driven algorithms, since 3D tabletop scenes are rarely available
in current datasets. To remedy this defect, we introduce TO-Scene, a
large-scale dataset focusing on tabletop scenes, which contains 20,740 scenes
with three variants. To acquire the data, we design an efficient and scalable
framework, where a crowdsourcing UI is developed to transfer CAD objects from
ModelNet and ShapeNet onto tables from ScanNet, then the output tabletop scenes
are simulated into real scans and annotated automatically.
Further, a tabletop-aware learning strategy is proposed for better perceiving
the small-sized tabletop instances. Notably, we also provide a real scanned
test set TO-Real to verify the practical value of TO-Scene. Experiments show
that the algorithms trained on TO-Scene indeed work on the realistic test data,
and our proposed tabletop-aware learning strategy greatly improves the
state-of-the-art results on both 3D semantic segmentation and object detection
tasks. Dataset and code are available at
https://github.com/GAP-LAB-CUHK-SZ/TO-Scene.Comment: ECCV 2022 (Oral Presentation
Learning Geometry-Disentangled Representation for Complementary Understanding of 3D Object Point Cloud
In 2D image processing, some attempts decompose images into high and low
frequency components for describing edge and smooth parts respectively.
Similarly, the contour and flat area of 3D objects, such as the boundary and
seat area of a chair, describe different but also complementary geometries.
However, such investigation is lost in previous deep networks that understand
point clouds by directly treating all points or local patches equally. To solve
this problem, we propose Geometry-Disentangled Attention Network (GDANet).
GDANet introduces Geometry-Disentangle Module to dynamically disentangle point
clouds into the contour and flat part of 3D objects, respectively denoted by
sharp and gentle variation components. Then GDANet exploits Sharp-Gentle
Complementary Attention Module that regards the features from sharp and gentle
variation components as two holistic representations, and pays different
attentions to them while fusing them respectively with original point cloud
features. In this way, our method captures and refines the holistic and
complementary 3D geometric semantics from two distinct disentangled components
to supplement the local information. Extensive experiments on 3D object
classification and segmentation benchmarks demonstrate that GDANet achieves the
state-of-the-arts with fewer parameters. Code is released on
https://github.com/mutianxu/GDANet.Comment: Accepted by AAAI202
HybridCap: Inertia-Aid Monocular Capture of Challenging Human Motions
Monocular 3D motion capture (mocap) is beneficial to many applications. The use of a single camera, however, often fails to handle occlusions of different body parts and hence it is limited to capture relatively simple movements. We present a light-weight, hybrid mocap technique called HybridCap that augments the camera with only 4 Inertial Measurement Units (IMUs) in a novel learning-and-optimization framework. We first employ a weakly-supervised and hierarchical motion inference module based on cooperative pure residual recurrent blocks that serve as limb, body and root trackers as well as an inverse kinematics solver. Our network effectively narrows the search space of plausible motions via coarse-to-fine pose estimation and manages to tackle challenging movements with high efficiency. We further develop a hybrid optimization scheme that combines inertial feedback and visual cues to improve tracking accuracy. Extensive experiments on various datasets demonstrate HybridCap can robustly handle challenging movements ranging from fitness actions to Latin dance. It also achieves real-time performance up to 60 fps with state-of-the-art accuracy