7 research outputs found

    Towards more efficient few-shot learning based human gesture recognition via dynamic vision sensors

    No full text
    For human gesture recognition task, recent fully supervised deep learning models have achieved impressive performance when sufficient samples of predefined gesture classes are provided. However, these model do not generalise well for new classes, thus limiting the model accuracy on unforeseen gesture categories. Few-shot learning based human gesture recognition (FSL-HGR) addresses this problem by supporting faster learning using only a few samples from new classes of gestures. In this paper, we aim to develop a novel FSL-HGR method that is suitable for deployment on affordable edge devices which enable energy-efficient inference across large number of classes. Specifically, we adapt a surrogate gradient-based spiking neural network model to efficiently process video sequences collected via dynamic vision sensors. With a focus on energy-efficiency, we design two novel strategies, spiking noise suppression and emission sparsity learning which significantly reduce the spike emission rate in all layers of the net work. In addition, we introduce a dual-speed stream contrastive learning algorithm to achieve higher performance without increasing the additional computational burden associated with inference using dual stream processing. Our experimental results on model performance and accuracy demonstrate the effectiveness of our approach. We achieved state-of-ate-art 84.75%, and 92.82% accuracy on 5way-1shot and 5way-5shot learning task with 60.02% and 58.21% reduced spike emission number respectively compared to a standard SNN architecture without the spiking noise suppression and emission sparsity learning strategies when processing the DVS128 Gesture dataset.</p

    Towards more efficient few-shot learning based human gesture recognition via dynamic vision sensors

    No full text
    For human gesture recognition task, recent fully supervised deep learning models have achieved impressive performance when sufficient samples of predefined gesture classes are provided. However, these model do not generalise well for new classes, thus limiting the model accuracy on unforeseen gesture categories. Few-shot learning based human gesture recognition (FSL-HGR) addresses this problem by supporting faster learning using only a few samples from new classes of gestures. In this paper, we aim to develop a novel FSL-HGR method that is suitable for deployment on affordable edge devices which enable energy-efficient inference across large number of classes. Specifically, we adapt a surrogate gradient-based spiking neural network model to efficiently process video sequences collected via dynamic vision sensors. With a focus on energy-efficiency, we design two novel strategies, spiking noise suppression and emission sparsity learning which significantly reduce the spike emission rate in all layers of the net work. In addition, we introduce a dual-speed stream contrastive learning algorithm to achieve higher performance without increasing the additional computational burden associated with inference using dual stream processing. Our experimental results on model performance and accuracy demonstrate the effectiveness of our approach. We achieved state-of-ate-art 84.75%, and 92.82% accuracy on 5way-1shot and 5way-5shot learning task with 60.02% and 58.21% reduced spike emission number respectively compared to a standard SNN architecture without the spiking noise suppression and emission sparsity learning strategies when processing the DVS128 Gesture dataset.</p

    CrossBind: Collaborative cross-modal identification of protein nucleic-acid-binding residues

    No full text
    Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross-modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large-scale protein language model. Specifically, our multi-modal approach leverages a contrastive learning technique and atom-wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine-grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state-of-the-art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1-Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively.We release the code at https://github.com/BEAM-Labs/CrossBind.</p

    CrossBind: Collaborative cross-modal identification of protein nucleic-acid-binding residues

    No full text
    Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross-modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large-scale protein language model. Specifically, our multi-modal approach leverages a contrastive learning technique and atom-wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine-grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state-of-the-art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1-Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively.We release the code at https://github.com/BEAM-Labs/CrossBind.</p

    X4D-SceneFormer: Enhanced scene understanding on 4D point cloud videos through cross-modal knowledge transfer

    No full text
    The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using singlemodal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin.We release the code at https://github.com/jinglinglingling/X4D</p

    HPL-ESS: hybrid pseudo-labeling for unsupervised event-based semantic segmentation

    No full text
    Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions, which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data, previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However, this will inevitably introduce noise, and learning from noisy pseudo labels, especially when generated from a single source, may reinforce the errors. This drawback is also called confirmation bias in pseudo-labeling. In this paper, we propose a novel hybrid pseudo-labeling framework for unsupervised event-based semantic segmentation, HPL-ESS, to alleviate the influence of noisy pseudo labels. Specifically, we first employ a plain unsupervised domain adaptation framework as our baseline, which can generate a set of pseudo labels through self-training. Then, we incorporate offline event-to-image reconstruction into the framework, and obtain another set of pseudo labels by predicting segmentation maps on the reconstructed images. A noisy label learning strategy is designed to mix the two sets of pseudo labels and enhance the quality. Moreover, we propose a soft prototypical alignment (SPA) module to further improve the consistency of target domain features. Extensive experiments show that the proposed method outperforms existing state-of-the-art methods by a large margin on benchmarks (e.g., +5.88% accuracy, +10.32% mIoU on DSEC-Semantic dataset), and even surpasses several supervised methods.</p

    X4D-SceneFormer: Enhanced scene understanding on 4D point cloud videos through cross-modal knowledge transfer

    No full text
    The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). The GIT combines visual texture and temporal correlation features to offer rich semantics and dynamics for better point cloud representation. During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using singlemodal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge, outperforming previous state-of-the-art by a large margin.We release the code at https://github.com/jinglinglingling/X4D</p
    corecore