163 research outputs found
SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data
Data mixing augmentation has proved effective in training deep models. Recent
methods mix labels mainly based on the mixture proportion of image pixels. As
the main discriminative information of a fine-grained image usually resides in
subtle regions, methods along this line are prone to heavy label noise in
fine-grained recognition. We propose in this paper a novel scheme, termed as
Semantically Proportional Mixing (SnapMix), which exploits class activation map
(CAM) to lessen the label noise in augmenting fine-grained data. SnapMix
generates the target label for a mixed image by estimating its intrinsic
semantic composition, and allows for asymmetric mixing operations and ensures
semantic correspondence between synthetic images and target labels. Experiments
show that our method consistently outperforms existing mixed-based approaches
on various datasets and under different network depths. Furthermore, by
incorporating the mid-level features, the proposed SnapMix achieves top-level
performance, demonstrating its potential to serve as a solid baseline for
fine-grained recognition. Our code is available at
https://github.com/Shaoli-Huang/SnapMix.git.Comment: Accepted by AAAI202
Deep representation learning for keypoint localization
University of Technology Sydney. Faculty of Engineering and Information Technology.Keypoint localization aims to locate points of interest from the input image. This technique has become an important tool for many computer vision tasks such as fine-grained visual categorization, object detection, and pose estimation. Tremendous effort, therefore, has been devoted to improving the performance of keypoint localization. However, most of the proposed methods supervise keypoint detectors using a confidence map generated from ground-truth keypoint locations. Furthermore, the maximum achievable localization accuracy differs from keypoint to keypoint, because it is determined by the underlying keypoint structures. Thus the keypoint detector often fails to detect ambiguous keypoints if trained with strict supervision, that is, permitting only a small localization error. Training with looser supervision could help detect the ambiguous keypoints, but this comes at a cost to localization accuracy for those keypoints with distinctive appearances. In this thesis, we propose hierarchically supervised nets (HSNs), a method that imposes hierarchical supervision within deep convolutional neural networks (CNNs) for keypoint localization. To achieve this, we firstly propose a fully convolutional Inception network with several branches of varying depths to obtain hierarchical feature representations. Then, we build a coarse part detector on top of each branch of features and a fine part detector which takes features from all the branches as the input.
Collecting image data with keypoint annotations is harder than with image labels. One may collect images from Flickr or Google images by searching keywords and then perform refinement processes to build a classification dataset, while keypoint annotation requires human to click the rough location of the keypoint for each image. To address the problem of insufficient part annotations, we propose a part detection framework that combines deep representation learning and domain adaptation within the same training process. We adopt one of the coarse detector from HSNs as the baseline and perform a quantitative evaluation on CUB200-2011 and BirdSnap dataset. Interestingly, our method trained on only 10 species images achieves 61.4% PCK accuracy on the testing set of 190 unseen species.
Finally, we explore the application of keypoint localization in the task of fine-grained visual categorization. We propose a new part-based model that consists of a localization module to detect object parts (where pathway) and a classification module to classify fine-grained categories at the subordinate level (what pathway). Experimental results reveal that our method with keypoint localization achieves the state-of-the-art performance on Caltech-UCSD Birds-200-2011 dataset
SignAvatars: A Large-scale 3D Sign Language Holistic Motion Dataset and Benchmark
In this paper, we present SignAvatars, the first large-scale multi-prompt 3D
sign language (SL) motion dataset designed to bridge the communication gap for
hearing-impaired individuals. While there has been an exponentially growing
number of research regarding digital communication, the majority of existing
communication technologies primarily cater to spoken or written languages,
instead of SL, the essential communication method for hearing-impaired
communities. Existing SL datasets, dictionaries, and sign language production
(SLP) methods are typically limited to 2D as the annotating 3D models and
avatars for SL is usually an entirely manual and labor-intensive process
conducted by SL experts, often resulting in unnatural avatars. In response to
these challenges, we compile and curate the SignAvatars dataset, which
comprises 70,000 videos from 153 signers, totaling 8.34 million frames,
covering both isolated signs and continuous, co-articulated signs, with
multiple prompts including HamNoSys, spoken language, and words. To yield 3D
holistic annotations, including meshes and biomechanically-valid poses of body,
hands, and face, as well as 2D and 3D keypoints, we introduce an automated
annotation pipeline operating on our large corpus of SL videos. SignAvatars
facilitates various tasks such as 3D sign language recognition (SLR) and the
novel 3D SL production (SLP) from diverse inputs like text scripts, individual
words, and HamNoSys notation. Hence, to evaluate the potential of SignAvatars,
we further propose a unified benchmark of 3D SL holistic motion production. We
believe that this work is a significant step forward towards bringing the
digital world to the hearing-impaired communities. Our project page is at
https://signavatars.github.io/Comment: 9 pages; Project page available at https://signavatars.github.io
Towards Hard-Positive Query Mining for DETR-based Human-Object Interaction Detection
Human-Object Interaction (HOI) detection is a core task for high-level image
understanding. Recently, Detection Transformer (DETR)-based HOI detectors have
become popular due to their superior performance and efficient structure.
However, these approaches typically adopt fixed HOI queries for all testing
images, which is vulnerable to the location change of objects in one specific
image. Accordingly, in this paper, we propose to enhance DETR's robustness by
mining hard-positive queries, which are forced to make correct predictions
using partial visual cues. First, we explicitly compose hard-positive queries
according to the ground-truth (GT) position of labeled human-object pairs for
each training image. Specifically, we shift the GT bounding boxes of each
labeled human-object pair so that the shifted boxes cover only a certain
portion of the GT ones. We encode the coordinates of the shifted boxes for each
labeled human-object pair into an HOI query. Second, we implicitly construct
another set of hard-positive queries by masking the top scores in
cross-attention maps of the decoder layers. The masked attention maps then only
cover partial important cues for HOI predictions. Finally, an alternate
strategy is proposed that efficiently combines both types of hard queries. In
each iteration, both DETR's learnable queries and one selected type of
hard-positive queries are adopted for loss computation. Experimental results
show that our proposed approach can be widely applied to existing DETR-based
HOI detectors. Moreover, we consistently achieve state-of-the-art performance
on three benchmarks: HICO-DET, V-COCO, and HOI-A. Code is available at
https://github.com/MuchHair/HQM.Comment: Accepted by ECCV202
HuTuMotion: Human-Tuned Navigation of Latent Motion Diffusion Models with Minimal Feedback
We introduce HuTuMotion, an innovative approach for generating natural human
motions that navigates latent motion diffusion models by leveraging few-shot
human feedback. Unlike existing approaches that sample latent variables from a
standard normal prior distribution, our method adapts the prior distribution to
better suit the characteristics of the data, as indicated by human feedback,
thus enhancing the quality of motion generation. Furthermore, our findings
reveal that utilizing few-shot feedback can yield performance levels on par
with those attained through extensive human feedback. This discovery emphasizes
the potential and efficiency of incorporating few-shot human-guided
optimization within latent diffusion models for personalized and style-aware
human motion generation applications. The experimental results show the
significantly superior performance of our method over existing state-of-the-art
approaches.Comment: Accepted by AAAI 2024 Main Trac
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues
Current techniques face difficulties in generating motions from intricate
semantic descriptions, primarily due to insufficient semantic annotations in
datasets and weak contextual understanding. To address these issues, we present
SemanticBoost, a novel framework that tackles both challenges simultaneously.
Our framework comprises a Semantic Enhancement module and a Context-Attuned
Motion Denoiser (CAMD). The Semantic Enhancement module extracts supplementary
semantics from motion data, enriching the dataset's textual description and
ensuring precise alignment between text and motion data without depending on
large language models. On the other hand, the CAMD approach provides an
all-encompassing solution for generating high-quality, semantically consistent
motion sequences by effectively capturing context information and aligning the
generated motion with the given textual descriptions. Distinct from existing
methods, our approach can synthesize accurate orientational movements, combined
motions based on specific body part descriptions, and motions generated from
complex, extended sentences. Our experimental results demonstrate that
SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based
techniques, achieving cutting-edge performance on the Humanml3D dataset while
maintaining realistic and smooth motion generation quality
TapMo: Shape-aware Motion Generation of Skeleton-free Characters
Previous motion generation methods are limited to the pre-rigged 3D human
model, hindering their applications in the animation of various non-rigged
characters. In this work, we present TapMo, a Text-driven Animation Pipeline
for synthesizing Motion in a broad spectrum of skeleton-free 3D characters. The
pivotal innovation in TapMo is its use of shape deformation-aware features as a
condition to guide the diffusion model, thereby enabling the generation of
mesh-specific motions for various characters. Specifically, TapMo comprises two
main components - Mesh Handle Predictor and Shape-aware Diffusion Module. Mesh
Handle Predictor predicts the skinning weights and clusters mesh vertices into
adaptive handles for deformation control, which eliminates the need for
traditional skeletal rigging. Shape-aware Motion Diffusion synthesizes motion
with mesh-specific adaptations. This module employs text-guided motions and
mesh features extracted during the first stage, preserving the geometric
integrity of the animations by accounting for the character's shape and
deformation. Trained in a weakly-supervised manner, TapMo can accommodate a
multitude of non-human meshes, both with and without associated text motions.
We demonstrate the effectiveness and generalizability of TapMo through rigorous
qualitative and quantitative experiments. Our results reveal that TapMo
consistently outperforms existing auto-animation methods, delivering
superior-quality animations for both seen or unseen heterogeneous 3D
characters
- …