165 research outputs found

    Hypergraph Transformer for Skeleton-based Action Recognition

    Full text link
    Skeleton-based action recognition aims to predict human actions given human joint coordinates with skeletal interconnections. To model such off-grid data points and their co-occurrences, Transformer-based formulations would be a natural choice. However, Transformers still lag behind state-of-the-art methods using graph convolutional networks (GCNs). Transformers assume that the input is permutation-invariant and homogeneous (partially alleviated by positional encoding), which ignores an important characteristic of skeleton data, i.e., bone connectivity. Furthermore, each type of body joint has a clear physical meaning in human motion, i.e., motion retains an intrinsic relationship regardless of the joint coordinates, which is not explored in Transformers. In fact, certain re-occurring groups of body joints are often involved in specific actions, such as the subconscious hand movement for keeping balance. Vanilla attention is incapable of describing such underlying relations that are persistent and beyond pair-wise. In this work, we aim to exploit these unique aspects of skeleton data to close the performance gap between Transformers and GCNs. Specifically, we propose a new self-attention (SA) extension, named Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order relations into the model. The K-hop relative positional embeddings are also employed to take bone connectivity into account. We name the resulting model Hyperformer, and it achieves comparable or better performance w.r.t. accuracy and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D 120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the significantly improved performance reached by our Hyperformer demonstrates the underestimated potential of Transformer models in this field

    Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation

    Full text link
    Accurately estimating the 3D pose of humans in video sequences requires both accuracy and a well-structured architecture. With the success of transformers, we introduce the Refined Temporal Pyramidal Compression-and-Amplification (RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends intra-block temporal modeling via its Temporal Pyramidal Compression-and-Amplification (TPCA) structure and refines inter-block feature interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA block exploits a temporal pyramid paradigm, reinforcing key and value representation capabilities and seamlessly extracting spatial semantics from motion sequences. We stitch these TPCA blocks with XLR that promotes rich semantic representation through continuous interaction of queries, keys, and values. This strategy embodies early-stage information with current flows, addressing typical deficits in detail and stability seen in other transformer-based methods. We demonstrate the effectiveness of RTPCA by achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP benchmarks with minimal computational overhead. The source code is available at https://github.com/hbing-l/RTPCA.Comment: 11 pages, 5 figure

    DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving

    Full text link
    Real-time perception, or streaming perception, is a crucial aspect of autonomous driving that has yet to be thoroughly explored in existing research. To address this gap, we present DAMO-StreamNet, an optimized framework that combines recent advances from the YOLO series with a comprehensive analysis of spatial and temporal perception mechanisms, delivering a cutting-edge solution. The key innovations of DAMO-StreamNet are: (1) A robust neck structure incorporating deformable convolution, enhancing the receptive field and feature alignment capabilities. (2) A dual-branch structure that integrates short-path semantic features and long-path temporal features, improving motion state prediction accuracy. (3) Logits-level distillation for efficient optimization, aligning the logits of teacher and student networks in semantic space. (4) A real-time forecasting mechanism that updates support frame features with the current frame, ensuring seamless streaming perception during inference. Our experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200, 1920)) sAP without using extra data. This work not only sets a new benchmark for real-time perception but also provides valuable insights for future research. Additionally, DAMO-StreamNet can be applied to various autonomous systems, such as drones and robots, paving the way for real-time perception. The code is available at https://github.com/zhiqic/DAMO-StreamNet

    Pollen morphology of selected tundra plants from the high Arctic of Ny-Ålesund, Svalbard

    Get PDF
    Documenting morphological features of modern pollen is fundamental for the identification of fossil pollen, which will assist researchers to reconstruct the vegetation and climate of a particular geologic period. This paper presents the pollen morphology of 20 species of tundra plants from the high Arctic of Ny-Ålesund, Svalbard, using light and scanning electron microscopy. The plants used in this study belong to 12 families: Brassicaceae, Caryophyllaceae, Cyperaceae, Ericaceae, Juncaceae, Papaveraceae, Poaceae, Polygonaceae, Ranunculaceae, Rosaceae, Salicaceae, and Scrophulariaceae. Pollen grain shapes included: spheroidal, subprolate, and prolate. Variable apertural patterns ranged from 2-syncolpate, 3-colpate, 3-(-4)-colpate, 3-(-5)-colpate, 3-colporate, 5-poroid, ulcerate, ulcus to pantoporate. Exine ornamentations comprised psilate, striate-perforate, reticulate, microechinate, microechinate-perforate, scabrate, granulate, and granulate-perforate. This study provided a useful reference for comparative studies of fossil pollen and for the reconstruction of paleovegetation and paleoclimate in Svalbard region of Arctic

    Effects of rotation angle and metal foam on natural convection of nanofluids in a cavity under an adjustable magnetic field

    Get PDF
    © 2019 Elsevier Ltd To investigate the natural convection heat transfer of Fe3O4-water nanofluids in a rectangular cavity under an adjustable magnetic field, two experimental systems are established. Meanwhile, several factors, such as nanoparticle mass fractions (ω = 0%, 0.1%, 0.3%, 0.5%), magnetic field directions (horizontal and vertical), magnetic field intensities (B = 0.0 T, 0.01 T, 0.02 T), rotation angles of the cavity (α = 0°, 45°, 90°, 135°), and PPI of Cu metal foam (PPI = 0, 5, 15) are taken into consideration to research the natural convection of Fe3O4-water nanofluids in a rectangular cavity. With the increasing nanoparticle mass fraction, Nusselt number firstly rises but then falls, and the maximum value of which appears at a nanoparticle mass fraction ω = 0.3%. Horizontal magnetic field is not significant to the thermal performance enhancement, but vertical magnetic field shows an opposite trend and makes a positive contribution to the thermal performance. The cavity with a rotation angle α = 90° shows the highest thermal performance. Nusselt number of the cavity filled with metal foam can be improved obviously compared with that without metal foam. But the increasing PPI of metal foam is disadvantageous to heat transfer performance

    PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation

    Full text link
    Existing 3D human pose estimators face challenges in adapting to new datasets due to the lack of 2D-3D pose pairs in training sets. To overcome this issue, we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis \textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge this data disparity gap in target domain. Typically, PoSynDA uses a diffusion-inspired structure to simulate 3D pose distribution in the target domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse pose hypotheses and aligns them with the target domain. To do this, it first utilizes target-specific source augmentation to obtain the target domain distribution data from the source domain by decoupling the scale and position parameters. The process is then further refined through the teacher-student paradigm and low-rank adaptation. With extensive comparison of benchmarks such as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance, even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This work paves the way for the practical application of 3D human pose estimation in unseen domains. The code is available at https://github.com/hbing-l/PoSynDA.Comment: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the code is at https://github.com/hbing-l/PoSynD

    KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration

    Full text link
    In the realm of facial analysis, accurate landmark detection is crucial for various applications, ranging from face recognition and expression analysis to animation. Conventional heatmap or coordinate regression-based techniques, however, often face challenges in terms of computational burden and quantization errors. To address these issues, we present the KeyPoint Positioning System (KeyPosS) - a groundbreaking facial landmark detection framework that stands out from existing methods. The framework utilizes a fully convolutional network to predict a distance map, which computes the distance between a Point of Interest (POI) and multiple anchor points. These anchor points are ingeniously harnessed to triangulate the POI's position through the True-range Multilateration algorithm. Notably, the plug-and-play nature of KeyPosS enables seamless integration into any decoding stage, ensuring a versatile and adaptable solution. We conducted a thorough evaluation of KeyPosS's performance by benchmarking it against state-of-the-art models on four different datasets. The results show that KeyPosS substantially outperforms leading methods in low-resolution settings while requiring a minimal time overhead. The code is available at https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the code is at https://github.com/zhiqic/KeyPos
    corecore