57 research outputs found
Body-Part Joint Detection and Association via Extended Object Representation
The detection of human body and its related parts (e.g., face, head or hands)
have been intensively studied and greatly improved since the breakthrough of
deep CNNs. However, most of these detectors are trained independently, making
it a challenging task to associate detected body parts with people. This paper
focuses on the problem of joint detection of human body and its corresponding
parts. Specifically, we propose a novel extended object representation that
integrates the center location offsets of body or its parts, and construct a
dense single-stage anchor-based Body-Part Joint Detector (BPJDet). Body-part
associations in BPJDet are embedded into the unified representation which
contains both the semantic and geometric information. Therefore, BPJDet does
not suffer from error-prone association post-matching, and has a better
accuracy-speed trade-off. Furthermore, BPJDet can be seamlessly generalized to
jointly detect any body part. To verify the effectiveness and superiority of
our method, we conduct extensive experiments on the CityPersons, CrowdHuman and
BodyHands datasets. The proposed BPJDet detector achieves state-of-the-art
association performance on these three benchmarks while maintains high accuracy
of detection. Code is in https://github.com/hnuzhy/BPJDet.Comment: accepted by ICME202
SSDA-YOLO: Semi-supervised Domain Adaptive YOLO for Cross-Domain Object Detection
Domain adaptive object detection (DAOD) aims to alleviate transfer
performance degradation caused by the cross-domain discrepancy. However, most
existing DAOD methods are dominated by outdated and computationally intensive
two-stage Faster R-CNN, which is not the first choice for industrial
applications. In this paper, we propose a novel semi-supervised domain adaptive
YOLO (SSDA-YOLO) based method to improve cross-domain detection performance by
integrating the compact one-stage stronger detector YOLOv5 with domain
adaptation. Specifically, we adapt the knowledge distillation framework with
the Mean Teacher model to assist the student model in obtaining instance-level
features of the unlabeled target domain. We also utilize the scene style
transfer to cross-generate pseudo images in different domains for remedying
image-level differences. In addition, an intuitive consistency loss is proposed
to further align cross-domain predictions. We evaluate SSDA-YOLO on public
benchmarks including PascalVOC, Clipart1k, Cityscapes, and Foggy Cityscapes.
Moreover, to verify its generalization, we conduct experiments on yawning
detection datasets collected from various real classrooms. The results show
considerable improvements of our method in these DAOD tasks, which reveals both
the effectiveness of proposed adaptive modules and the urgency of applying more
advanced detectors in DAOD. Our code is available on
\url{https://github.com/hnuzhy/SSDA-YOLO}.Comment: submitted to CVI
DirectMHP: Direct 2D Multi-Person Head Pose Estimation with Full-range Angles
Existing head pose estimation (HPE) mainly focuses on single person with
pre-detected frontal heads, which limits their applications in real complex
scenarios with multi-persons. We argue that these single HPE methods are
fragile and inefficient for Multi-Person Head Pose Estimation (MPHPE) since
they rely on the separately trained face detector that cannot generalize well
to full viewpoints, especially for heads with invisible face areas. In this
paper, we focus on the full-range MPHPE problem, and propose a direct
end-to-end simple baseline named DirectMHP. Due to the lack of datasets
applicable to the full-range MPHPE, we firstly construct two benchmarks by
extracting ground-truth labels for head detection and head orientation from
public datasets AGORA and CMU Panoptic. They are rather challenging for having
many truncated, occluded, tiny and unevenly illuminated human heads. Then, we
design a novel end-to-end trainable one-stage network architecture by joint
regressing locations and orientations of multi-head to address the MPHPE
problem. Specifically, we regard pose as an auxiliary attribute of the head,
and append it after the traditional object prediction. Arbitrary pose
representation such as Euler angles is acceptable by this flexible design.
Then, we jointly optimize these two tasks by sharing features and utilizing
appropriate multiple losses. In this way, our method can implicitly benefit
from more surroundings to improve HPE accuracy while maintaining head detection
performance. We present comprehensive comparisons with state-of-the-art single
HPE methods on public benchmarks, as well as superior baseline results on our
constructed MPHPE datasets. Datasets and code are released in
https://github.com/hnuzhy/DirectMHP.Comment: 13 page
Joint Multi-Person Body Detection and Orientation Estimation via One Unified Embedding
Human body orientation estimation (HBOE) is widely applied into various
applications, including robotics, surveillance, pedestrian analysis and
autonomous driving. Although many approaches have been addressing the HBOE
problem from specific under-controlled scenes to challenging in-the-wild
environments, they assume human instances are already detected and take a well
cropped sub-image as the input. This setting is less efficient and prone to
errors in real application, such as crowds of people. In the paper, we propose
a single-stage end-to-end trainable framework for tackling the HBOE problem
with multi-persons. By integrating the prediction of bounding boxes and
direction angles in one embedding, our method can jointly estimate the location
and orientation of all bodies in one image directly. Our key idea is to
integrate the HBOE task into the multi-scale anchor channel predictions of
persons for concurrently benefiting from engaged intermediate features.
Therefore, our approach can naturally adapt to difficult instances involving
low resolution and occlusion as in object detection. We validated the
efficiency and effectiveness of our method in the recently presented benchmark
MEBOW with extensive experiments. Besides, we completed ambiguous instances
ignored by the MEBOW dataset, and provided corresponding weak body-orientation
labels to keep the integrity and consistency of it for supporting studies
toward multi-persons. Our work is available at
\url{https://github.com/hnuzhy/JointBDOE}
StuArt: Individualized Classroom Observation of Students with Automatic Behavior Recognition and Tracking
Each student matters, but it is hardly for instructors to observe all the
students during the courses and provide helps to the needed ones immediately.
In this paper, we present StuArt, a novel automatic system designed for the
individualized classroom observation, which empowers instructors to concern the
learning status of each student. StuArt can recognize five representative
student behaviors (hand-raising, standing, sleeping, yawning, and smiling) that
are highly related to the engagement and track their variation trends during
the course. To protect the privacy of students, all the variation trends are
indexed by the seat numbers without any personal identification information.
Furthermore, StuArt adopts various user-friendly visualization designs to help
instructors quickly understand the individual and whole learning status.
Experimental results on real classroom videos have demonstrated the superiority
and robustness of the embedded algorithms. We expect our system promoting the
development of large-scale individualized guidance of students.Comment: Novel pedagogical approaches in signal processing for K-12 educatio
BPJDet: Extended Object Representation for Generic Body-Part Joint Detection
Detection of human body and its parts (e.g., head or hands) has been
intensively studied. However, most of these CNNs-based detectors are trained
independently, making it difficult to associate detected parts with body. In
this paper, we focus on the joint detection of human body and its corresponding
parts. Specifically, we propose a novel extended object representation
integrating center-offsets of body parts, and construct a dense one-stage
generic Body-Part Joint Detector (BPJDet). In this way, body-part associations
are neatly embedded in a unified object representation containing both semantic
and geometric contents. Therefore, we can perform multi-loss optimizations to
tackle multi-tasks synergistically. BPJDet does not suffer from error-prone
post matching, and keeps a better trade-off between speed and accuracy.
Furthermore, BPJDet can be generalized to detect any one or more body parts. To
verify the superiority of BPJDet, we conduct experiments on three body-part
datasets (CityPersons, CrowdHuman and BodyHands) and one body-parts dataset
COCOHumanParts. While keeping high detection accuracy, BPJDet achieves
state-of-the-art association performance on all datasets comparing with its
counterparts. Besides, we show benefits of advanced body-part association
capability by improving performance of two representative downstream
applications: accurate crowd head detection and hand contact estimation. Code
is released in https://github.com/hnuzhy/BPJDet.Comment: 15 pages. arXiv admin note: text overlap with arXiv:2212.0765
LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification
Extreme Multi-label text Classification (XMC) is a task of finding the most
relevant labels from a large label set. Nowadays deep learning-based methods
have shown significant success in XMC. However, the existing methods (e.g.,
AttentionXML and X-Transformer etc) still suffer from 1) combining several
models to train and predict for one dataset, and 2) sampling negative labels
statically during the process of training label ranking model, which reduces
both the efficiency and accuracy of the model. To address the above problems,
we proposed LightXML, which adopts end-to-end training and dynamic negative
labels sampling. In LightXML, we use generative cooperative networks to recall
and rank labels, in which label recalling part generates negative and positive
labels, and label ranking part distinguishes positive labels from these labels.
Through these networks, negative labels are sampled dynamically during label
ranking part training by feeding with the same text representation. Extensive
experiments show that LightXML outperforms state-of-the-art methods in five
extreme multi-label datasets with much smaller model size and lower
computational complexity. In particular, on the Amazon dataset with 670K
labels, LightXML can reduce the model size up to 72% compared to AttentionXML
- …