18,755 research outputs found
Spiking Neural Network for Ultra-low-latency and High-accurate Object Detection
Spiking Neural Networks (SNNs) have garnered widespread interest for their
energy efficiency and brain-inspired event-driven properties. While recent
methods like Spiking-YOLO have expanded the SNNs to more challenging object
detection tasks, they often suffer from high latency and low detection
accuracy, making them difficult to deploy on latency sensitive mobile
platforms. Furthermore, the conversion method from Artificial Neural Networks
(ANNs) to SNNs is hard to maintain the complete structure of the ANNs,
resulting in poor feature representation and high conversion errors. To address
these challenges, we propose two methods: timesteps compression and
spike-time-dependent integrated (STDI) coding. The former reduces the timesteps
required in ANN-SNN conversion by compressing information, while the latter
sets a time-varying threshold to expand the information holding capacity. We
also present a SNN-based ultra-low latency and high accurate object detection
model (SUHD) that achieves state-of-the-art performance on nontrivial datasets
like PASCAL VOC and MS COCO, with about remarkable 750x fewer timesteps and 30%
mean average precision (mAP) improvement, compared to the Spiking-YOLO on MS
COCO datasets. To the best of our knowledge, SUHD is the deepest spike-based
object detection model to date that achieves ultra low timesteps to complete
the lossless conversion.Comment: 14 pages, 10 figure
Hierarchical Graph Neural Networks for Proprioceptive 6D Pose Estimation of In-hand Objects
Robotic manipulation, in particular in-hand object manipulation, often
requires an accurate estimate of the object's 6D pose. To improve the accuracy
of the estimated pose, state-of-the-art approaches in 6D object pose estimation
use observational data from one or more modalities, e.g., RGB images, depth,
and tactile readings. However, existing approaches make limited use of the
underlying geometric structure of the object captured by these modalities,
thereby, increasing their reliance on visual features. This results in poor
performance when presented with objects that lack such visual features or when
visual features are simply occluded. Furthermore, current approaches do not
take advantage of the proprioceptive information embedded in the position of
the fingers. To address these limitations, in this paper: (1) we introduce a
hierarchical graph neural network architecture for combining multimodal (vision
and touch) data that allows for a geometrically informed 6D object pose
estimation, (2) we introduce a hierarchical message passing operation that
flows the information within and across modalities to learn a graph-based
object representation, and (3) we introduce a method that accounts for the
proprioceptive information for in-hand object representation. We evaluate our
model on a diverse subset of objects from the YCB Object and Model Set, and
show that our method substantially outperforms existing state-of-the-art work
in accuracy and robustness to occlusion. We also deploy our proposed framework
on a real robot and qualitatively demonstrate successful transfer to real
settings
Advancing Adversarial Training by Injecting Booster Signal
Recent works have demonstrated that deep neural networks (DNNs) are highly
vulnerable to adversarial attacks. To defend against adversarial attacks, many
defense strategies have been proposed, among which adversarial training has
been demonstrated to be the most effective strategy. However, it has been known
that adversarial training sometimes hurts natural accuracy. Then, many works
focus on optimizing model parameters to handle the problem. Different from the
previous approaches, in this paper, we propose a new approach to improve the
adversarial robustness by using an external signal rather than model
parameters. In the proposed method, a well-optimized universal external signal
called a booster signal is injected into the outside of the image which does
not overlap with the original content. Then, it boosts both adversarial
robustness and natural accuracy. The booster signal is optimized in parallel to
model parameters step by step collaboratively. Experimental results show that
the booster signal can improve both the natural and robust accuracies over the
recent state-of-the-art adversarial training methods. Also, optimizing the
booster signal is general and flexible enough to be adopted on any existing
adversarial training methods.Comment: Accepted at IEEE Transactions on Neural Networks and Learning System
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Large vision-language models have achieved outstanding performance, but their
size and computational requirements make their deployment on
resource-constrained devices and time-sensitive tasks impractical. Model
distillation, the process of creating smaller, faster models that maintain the
performance of larger models, is a promising direction towards the solution.
This paper investigates the distillation of visual representations in large
teacher vision-language models into lightweight student models using a small-
or mid-scale dataset. Notably, this study focuses on open-vocabulary
out-of-distribution (OOD) generalization, a challenging problem that has been
overlooked in previous model distillation literature. We propose two principles
from vision and language modality perspectives to enhance student's OOD
generalization: (1) by better imitating teacher's visual representation space,
and carefully promoting better coherence in vision-language alignment with the
teacher; (2) by enriching the teacher's language representations with
informative and finegrained semantic attributes to effectively distinguish
between different labels. We propose several metrics and conduct extensive
experiments to investigate their techniques. The results demonstrate
significant improvements in zero-shot and few-shot student performance on
open-vocabulary out-of-distribution classification, highlighting the
effectiveness of our proposed approaches. Our code will be released at
https://github.com/xuanlinli17/large_vlm_distillation_oo
A case of pure apraxia of speech after left hemisphere stroke: behavioral findings and neural correlates
IntroductionApraxia of speech (AOS) is a motor speech disorder impairing the coordination of complex articulatory movements needed to produce speech. AOS typically co-occurs with a non-fluent aphasia, or language disorder, making it challenging to determine the specific brain structures that cause AOS. Cases of pure AOS without aphasia are rare but offer the best window into the neural correlates that support articulatory planning. The goal of the current study was to explore patterns of apraxic speech errors and their underlying neural correlates in a case of pure AOS.MethodsA 67-year-old right-handed man presented with severe AOS resulting from a fronto-insular lesion caused by an ischemic stroke. The participant’s speech and language were evaluated at 1-, 3- and 12-months post-onset. High resolution structural MRI, including diffusion weighted imaging, was acquired at 12 months post-onset.ResultsAt the first assessment, the participant made minor errors on the Comprehensive Aphasia Test, demonstrating mild deficits in writing, auditory comprehension, and repetition. By the second assessment, he no longer had aphasia. On the Motor Speech Evaluation, the severity of his AOS was initially rated as 5 (out of 7) and improved to a score of 4 by the second visit, likely due to training by his SLP at the time to slow his speech. Structural MRI data showed a fronto-insular lesion encompassing the superior precentral gyrus of the insula and portions of the inferior and middle frontal gyri and precentral gyrus. Tractography derived from diffusion MRI showed partial damage to the frontal aslant tract and arcuate fasciculus along the white matter projections to the insula.DiscussionThis pure case of severe AOS without aphasia affords a unique window into the behavioral and neural mechanisms of this motor speech disorder. The current findings support previous observations that AOS and aphasia are dissociable and confirm a role for the precentral gyrus of the insula and BA44, as well as underlying white matter in supporting the coordination of complex articulatory movements. Additionally, other regions including the precentral gyrus, Broca’s area, and Area 55b are discussed regarding their potential role in successful speech production
Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition
Most research on facial expression recognition (FER) is conducted in highly
controlled environments, but its performance is often unacceptable when applied
to real-world situations. This is because when unexpected objects occlude the
face, the FER network faces difficulties extracting facial features and
accurately predicting facial expressions. Therefore, occluded FER (OFER) is a
challenging problem. Previous studies on occlusion-aware FER have typically
required fully annotated facial images for training. However, collecting facial
images with various occlusions and expression annotations is time-consuming and
expensive. Latent-OFER, the proposed method, can detect occlusions, restore
occluded parts of the face as if they were unoccluded, and recognize them,
improving FER accuracy. This approach involves three steps: First, the vision
transformer (ViT)-based occlusion patch detector masks the occluded position by
training only latent vectors from the unoccluded patches using the support
vector data description algorithm. Second, the hybrid reconstruction network
generates the masking position as a complete image using the ViT and
convolutional neural network (CNN). Last, the expression-relevant latent vector
extractor retrieves and uses expression-related information from all latent
vectors by applying a CNN-based class activation map. This mechanism has a
significant advantage in preventing performance degradation from occlusion by
unseen objects. The experimental results on several databases demonstrate the
superiority of the proposed method over state-of-the-art methods.Comment: 11 pages, 8 figure
RoboChop: Autonomous Framework for Fruit and Vegetable Chopping Leveraging Foundational Models
With the goal of developing fully autonomous cooking robots, developing
robust systems that can chop a wide variety of objects is important. Existing
approaches focus primarily on the low-level dynamics of the cutting action,
which overlooks some of the practical real-world challenges of implementing
autonomous cutting systems. In this work we propose an autonomous framework to
sequence together action primitives for the purpose of chopping fruits and
vegetables on a cluttered cutting board. We present a novel technique to
leverage vision foundational models SAM and YOLO to accurately detect, segment,
and track fruits and vegetables as they visually change through the sequences
of chops, finetuning YOLO on a novel dataset of whole and chopped fruits and
vegetables. In our experiments, we demonstrate that our simple pipeline is able
to reliably chop a variety of fruits and vegetables ranging in size,
appearance, and texture, meeting a variety of chopping specifications,
including fruit type, number of slices, and types of slices
Segmentation of Pathology Images: A Deep Learning Strategy with Annotated Data
Cancer has significantly threatened human life and health for many years. In the clinic, histopathology image segmentation is the golden stand for evaluating the prediction of patient prognosis and treatment outcome. Generally, manually labelling tumour regions in hundreds of high-resolution histopathological images is time-consuming and expensive for pathologists. Recently, the advancements in hardware and computer vision have allowed deep-learning-based methods to become mainstream to segment tumours automatically, significantly reducing the workload of pathologists. However, most current methods rely on large-scale labelled histopathological images. Therefore, this research studies label-effective tumour segmentation methods using deep-learning paradigms to relieve the annotation limitations. Chapter 3 proposes an ensemble framework for fully-supervised tumour segmentation. Usually, the performance of an individual-trained network is limited by significant morphological variances in histopathological images. We propose a fully-supervised learning ensemble fusion model that uses both shallow and deep U-Nets, trained with images of different resolutions and subsets of images, for robust predictions of tumour regions. Noise elimination is achieved with Convolutional Conditional Random Fields. Two open datasets are used to evaluate the proposed method: the ACDC@LungHP challenge at ISBI2019 and the DigestPath challenge at MICCAI2019. With a dice coefficient of 79.7 %, the proposed method takes third place in ACDC@LungHP. In DigestPath 2019, the proposed method achieves a dice coefficient 77.3 %. Well-annotated images are an indispensable part of training fully-supervised segmentation strategies. However, large-scale histopathology images are hardly annotated finely in clinical practice. It is common for labels to be of poor quality or for only a few images to be manually marked by experts. Consequently, fully-supervised methods cannot perform well in these cases. Chapter 4 proposes a self-supervised contrast learning for tumour segmentation. A self-supervised cancer segmentation framework is proposed to reduce label dependency. An innovative contrastive learning scheme is developed to represent tumour features based on unlabelled images. Unlike a normal U-Net, the backbone is a patch-based segmentation network. Additionally, data augmentation and contrastive losses are applied to improve the discriminability of tumour features. A convolutional Conditional Random Field is used to smooth and eliminate noise. Three labelled, and fourteen unlabelled images are collected from a private skin cancer dataset called BSS. Experimental results show that the proposed method achieves better tumour segmentation performance than other popular self-supervised methods. However, by evaluated on the same public dataset as chapter 3, the proposed self-supervised method is hard to handle fine-grained segmentation around tumour boundaries compared to the supervised method we proposed. Chapter 5 proposes a sketch-based weakly-supervised tumour segmentation method. To segment tumour regions precisely with coarse annotations, a sketch-supervised method is proposed, containing a dual CNN-Transformer network and a global normalised class activation map. CNN-Transformer networks simultaneously model global and local tumour features. With the global normalised class activation map, a gradient-based tumour representation can be obtained from the dual network predictions. We invited experts to mark fine and coarse annotations in the private BSS and the public PAIP2019 datasets to facilitate reproducible performance comparisons. Using the BSS dataset, the proposed method achieves 76.686 % IOU and 86.6 % Dice scores, outperforming state-of-the-art methods. Additionally, the proposed method achieves a Dice gain of 8.372 % compared with U-Net on the PAIP2019 dataset. The thesis presents three approaches to segmenting cancers from histology images: fully-supervised, unsupervised, and weakly supervised methods. This research effectively segments tumour regions based on histopathological annotations and well-designed modules. Our studies comprehensively demonstrate label-effective automatic histopathological image segmentation. Experimental results prove that our works achieve state-of-the-art segmentation performances on private and public datasets. In the future, we plan to integrate more tumour feature representation technologies with other medical modalities and apply them to clinical research
Research progress on deep learning in magnetic resonance imaging–based diagnosis and treatment of prostate cancer: a review on the current status and perspectives
Multiparametric magnetic resonance imaging (mpMRI) has emerged as a first-line screening and diagnostic tool for prostate cancer, aiding in treatment selection and noninvasive radiotherapy guidance. However, the manual interpretation of MRI data is challenging and time-consuming, which may impact sensitivity and specificity. With recent technological advances, artificial intelligence (AI) in the form of computer-aided diagnosis (CAD) based on MRI data has been applied to prostate cancer diagnosis and treatment. Among AI techniques, deep learning involving convolutional neural networks contributes to detection, segmentation, scoring, grading, and prognostic evaluation of prostate cancer. CAD systems have automatic operation, rapid processing, and accuracy, incorporating multiple sequences of multiparametric MRI data of the prostate gland into the deep learning model. Thus, they have become a research direction of great interest, especially in smart healthcare. This review highlights the current progress of deep learning technology in MRI-based diagnosis and treatment of prostate cancer. The key elements of deep learning-based MRI image processing in CAD systems and radiotherapy of prostate cancer are briefly described, making it understandable not only for radiologists but also for general physicians without specialized imaging interpretation training. Deep learning technology enables lesion identification, detection, and segmentation, grading and scoring of prostate cancer, and prediction of postoperative recurrence and prognostic outcomes. The diagnostic accuracy of deep learning can be improved by optimizing models and algorithms, expanding medical database resources, and combining multi-omics data and comprehensive analysis of various morphological data. Deep learning has the potential to become the key diagnostic method in prostate cancer diagnosis and treatment in the future
- …