160 research outputs found
Learning Independent Instance Maps for Crowd Localization
Accurately locating each head's position in the crowd scenes is a crucial
task in the field of crowd analysis. However, traditional density-based methods
only predict coarse prediction, and segmentation/detection-based methods cannot
handle extremely dense scenes and large-range scale-variations crowds. To this
end, we propose an end-to-end and straightforward framework for crowd
localization, named Independent Instance Map segmentation (IIM). Different from
density maps and boxes regression, each instance in IIM is non-overlapped. By
segmenting crowds into independent connected components, the positions and the
crowd counts (the centers and the number of components, respectively) are
obtained. Furthermore, to improve the segmentation quality for different
density regions, we present a differentiable Binarization Module (BM) to output
structured instance maps. BM brings two advantages into localization models: 1)
adaptively learn a threshold map for different images to detect each instance
more accurately; 2) directly train the model using loss on binary predictions
and labels. Extensive experiments verify the proposed method is effective and
outperforms the-state-of-the-art methods on the five popular crowd datasets.
Significantly, IIM improves F1-measure by 10.4\% on the NWPU-Crowd Localization
task. The source code and pre-trained models will be released at
\url{https://github.com/taohan10200/IIM}
CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting
Contrastive Language-Image Pre-training (CLIP) starts to emerge in many
computer vision tasks and has achieved promising performance. However, it
remains underexplored whether CLIP can be generalized to 3D hand pose
estimation, as bridging text prompts with pose-aware features presents
significant challenges due to the discrete nature of joint positions in 3D
space. In this paper, we make one of the first attempts to propose a novel 3D
hand pose estimator from monocular images, dubbed as CLIP-Hand3D, which
successfully bridges the gap between text prompts and irregular detailed pose
distribution. In particular, the distribution order of hand joints in various
3D space directions is derived from pose labels, forming corresponding text
prompts that are subsequently encoded into text representations.
Simultaneously, 21 hand joints in the 3D space are retrieved, and their spatial
distribution (in x, y, and z axes) is encoded to form pose-aware features.
Subsequently, we maximize semantic consistency for a pair of pose-text features
following a CLIP-based contrastive learning paradigm. Furthermore, a
coarse-to-fine mesh regressor is designed, which is capable of effectively
querying joint-aware cues from the feature pyramid. Extensive experiments on
several public hand benchmarks show that the proposed model attains a
significantly faster inference speed while achieving state-of-the-art
performance compared to methods utilizing the similar scale backbone.Comment: Accepted In Proceedings of the 31st ACM International Conference on
Multimedia (MM' 23
- …