16,653 research outputs found
Vision- and tactile-based continuous multimodal intention and attention recognition for safer physical human-robot interaction
Employing skin-like tactile sensors on robots enhances both the safety and
usability of collaborative robots by adding the capability to detect human
contact. Unfortunately, simple binary tactile sensors alone cannot determine
the context of the human contact -- whether it is a deliberate interaction or
an unintended collision that requires safety manoeuvres. Many published methods
classify discrete interactions using more advanced tactile sensors or by
analysing joint torques. Instead, we propose to augment the intention
recognition capabilities of simple binary tactile sensors by adding a
robot-mounted camera for human posture analysis. Different interaction
characteristics, including touch location, human pose, and gaze direction, are
used to train a supervised machine learning algorithm to classify whether a
touch is intentional or not with an F1-score of 86%. We demonstrate that
multimodal intention recognition is significantly more accurate than monomodal
analyses with the collaborative robot Baxter. Furthermore, our method can also
continuously monitor interactions that fluidly change between intentional or
unintentional by gauging the user's attention through gaze. If a user stops
paying attention mid-task, the proposed intention and attention recognition
algorithm can activate safety features to prevent unsafe interactions. We also
employ a feature reduction technique that reduces the number of inputs to five
to achieve a more generalized low-dimensional classifier. This simplification
both reduces the amount of training data required and improves real-world
classification accuracy. It also renders the method potentially agnostic to the
robot and touch sensor architectures while achieving a high degree of task
adaptability.Comment: 11 pages, 8 figures, preprint under revie
The Metaverse: Survey, Trends, Novel Pipeline Ecosystem & Future Directions
The Metaverse offers a second world beyond reality, where boundaries are
non-existent, and possibilities are endless through engagement and immersive
experiences using the virtual reality (VR) technology. Many disciplines can
benefit from the advancement of the Metaverse when accurately developed,
including the fields of technology, gaming, education, art, and culture.
Nevertheless, developing the Metaverse environment to its full potential is an
ambiguous task that needs proper guidance and directions. Existing surveys on
the Metaverse focus only on a specific aspect and discipline of the Metaverse
and lack a holistic view of the entire process. To this end, a more holistic,
multi-disciplinary, in-depth, and academic and industry-oriented review is
required to provide a thorough study of the Metaverse development pipeline. To
address these issues, we present in this survey a novel multi-layered pipeline
ecosystem composed of (1) the Metaverse computing, networking, communications
and hardware infrastructure, (2) environment digitization, and (3) user
interactions. For every layer, we discuss the components that detail the steps
of its development. Also, for each of these components, we examine the impact
of a set of enabling technologies and empowering domains (e.g., Artificial
Intelligence, Security & Privacy, Blockchain, Business, Ethics, and Social) on
its advancement. In addition, we explain the importance of these technologies
to support decentralization, interoperability, user experiences, interactions,
and monetization. Our presented study highlights the existing challenges for
each component, followed by research directions and potential solutions. To the
best of our knowledge, this survey is the most comprehensive and allows users,
scholars, and entrepreneurs to get an in-depth understanding of the Metaverse
ecosystem to find their opportunities and potentials for contribution
ADS_UNet: A Nested UNet for Histopathology Image Segmentation
The UNet model consists of fully convolutional network (FCN) layers arranged
as contracting encoder and upsampling decoder maps. Nested arrangements of
these encoder and decoder maps give rise to extensions of the UNet model, such
as UNete and UNet++. Other refinements include constraining the outputs of the
convolutional layers to discriminate between segment labels when trained end to
end, a property called deep supervision. This reduces feature diversity in
these nested UNet models despite their large parameter space. Furthermore, for
texture segmentation, pixel correlations at multiple scales contribute to the
classification task; hence, explicit deep supervision of shallower layers is
likely to enhance performance. In this paper, we propose ADS UNet, a stage-wise
additive training algorithm that incorporates resource-efficient deep
supervision in shallower layers and takes performance-weighted combinations of
the sub-UNets to create the segmentation model. We provide empirical evidence
on three histopathology datasets to support the claim that the proposed ADS
UNet reduces correlations between constituent features and improves performance
while being more resource efficient. We demonstrate that ADS_UNet outperforms
state-of-the-art Transformer-based models by 1.08 and 0.6 points on CRAG and
BCSS datasets, and yet requires only 37% of GPU consumption and 34% of training
time as that required by Transformers.Comment: To be published in Expert Systems With Application
Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in
Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors
mounted at different locations to monitor the driver and the vehicle's interior
scene and employ decision-level fusion to integrate these heterogenous data.
However, this fusion method may not fully utilize the complementarity of
different data sources and may overlook their relative importance. To address
these limitations, we propose a novel multiview multimodal driver monitoring
system based on feature-level fusion through multi-head self-attention (MHSA).
We demonstrate its effectiveness by comparing it against four alternative
fusion strategies (Sum, Conv, SE, and AFF). We also present a novel
GPU-friendly supervised contrastive learning framework SuMoCo to learn better
representations. Furthermore, We fine-grained the test split of the DAD dataset
to enable the multi-class recognition of drivers' activities. Experiments on
this enhanced database demonstrate that 1) the proposed MHSA-based fusion
method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and
2) training MHSA with patch masking can improve its robustness against
modality/view collapses. The code and annotations are publicly available.Comment: 9 pages (1 for reference); accepted by the 6th Multimodal Learning
and Applications Workshop (MULA) at CVPR 202
CLIP-Guided Vision-Language Pre-training for Question Answering in 3D Scenes
Training models to apply linguistic knowledge and visual concepts from 2D
images to 3D world understanding is a promising direction that researchers have
only recently started to explore. In this work, we design a novel 3D
pre-training Vision-Language method that helps a model learn semantically
meaningful and transferable 3D scene point cloud representations. We inject the
representational power of the popular CLIP model into our 3D encoder by
aligning the encoded 3D scene features with the corresponding 2D image and text
embeddings produced by CLIP. To assess our model's 3D world reasoning
capability, we evaluate it on the downstream task of 3D Visual Question
Answering. Experimental quantitative and qualitative results show that our
pre-training method outperforms state-of-the-art works in this task and leads
to an interpretable representation of 3D scene features.Comment: CVPRW 2023. Code will be made publicly available:
https://github.com/AlexDelitzas/3D-VQ
TransFusionOdom: Interpretable Transformer-based LiDAR-Inertial Fusion Odometry Estimation
Multi-modal fusion of sensors is a commonly used approach to enhance the
performance of odometry estimation, which is also a fundamental module for
mobile robots. However, the question of \textit{how to perform fusion among
different modalities in a supervised sensor fusion odometry estimation task?}
is still one of challenging issues remains. Some simple operations, such as
element-wise summation and concatenation, are not capable of assigning adaptive
attentional weights to incorporate different modalities efficiently, which make
it difficult to achieve competitive odometry results. Recently, the Transformer
architecture has shown potential for multi-modal fusion tasks, particularly in
the domains of vision with language. In this work, we propose an end-to-end
supervised Transformer-based LiDAR-Inertial fusion framework (namely
TransFusionOdom) for odometry estimation. The multi-attention fusion module
demonstrates different fusion approaches for homogeneous and heterogeneous
modalities to address the overfitting problem that can arise from blindly
increasing the complexity of the model. Additionally, to interpret the learning
process of the Transformer-based multi-modal interactions, a general
visualization approach is introduced to illustrate the interactions between
modalities. Moreover, exhaustive ablation studies evaluate different
multi-modal fusion strategies to verify the performance of the proposed fusion
strategy. A synthetic multi-modal dataset is made public to validate the
generalization ability of the proposed fusion strategy, which also works for
other combinations of different modalities. The quantitative and qualitative
odometry evaluations on the KITTI dataset verify the proposed TransFusionOdom
could achieve superior performance compared with other related works.Comment: Submitted to IEEE Sensors Journal with some modifications. This work
has been submitted to the IEEE for possible publication. Copyright may be
transferred without notice, after which this version may no longer be
accessibl
ARA-net: an attention-aware retinal atrophy segmentation network coping with fundus images
BackgroundAccurately detecting and segmenting areas of retinal atrophy are paramount for early medical intervention in pathological myopia (PM). However, segmenting retinal atrophic areas based on a two-dimensional (2D) fundus image poses several challenges, such as blurred boundaries, irregular shapes, and size variation. To overcome these challenges, we have proposed an attention-aware retinal atrophy segmentation network (ARA-Net) to segment retinal atrophy areas from the 2D fundus image.MethodsIn particular, the ARA-Net adopts a similar strategy as UNet to perform the area segmentation. Skip self-attention connection (SSA) block, comprising a shortcut and a parallel polarized self-attention (PPSA) block, has been proposed to deal with the challenges of blurred boundaries and irregular shapes of the retinal atrophic region. Further, we have proposed a multi-scale feature flow (MSFF) to challenge the size variation. We have added the flow between the SSA connection blocks, allowing for capturing considerable semantic information to detect retinal atrophy in various area sizes.ResultsThe proposed method has been validated on the Pathological Myopia (PALM) dataset. Experimental results demonstrate that our method yields a high dice coefficient (DICE) of 84.26%, Jaccard index (JAC) of 72.80%, and F1-score of 84.57%, which outperforms other methods significantly.ConclusionOur results have demonstrated that ARA-Net is an effective and efficient approach for retinal atrophic area segmentation in PM
Neural Architecture Search: Insights from 1000 Papers
In the past decade, advances in deep learning have resulted in breakthroughs
in a variety of areas, including computer vision, natural language
understanding, speech recognition, and reinforcement learning. Specialized,
high-performing neural architectures are crucial to the success of deep
learning in these areas. Neural architecture search (NAS), the process of
automating the design of neural architectures for a given task, is an
inevitable next step in automating machine learning and has already outpaced
the best human-designed architectures on many tasks. In the past few years,
research in NAS has been progressing rapidly, with over 1000 papers released
since 2020 (Deng and Lindauer, 2021). In this survey, we provide an organized
and comprehensive guide to neural architecture search. We give a taxonomy of
search spaces, algorithms, and speedup techniques, and we discuss resources
such as benchmarks, best practices, other surveys, and open-source libraries
Loop Closure Detection Based on Object-level Spatial Layout and Semantic Consistency
Visual simultaneous localization and mapping (SLAM) systems face challenges
in detecting loop closure under the circumstance of large viewpoint changes. In
this paper, we present an object-based loop closure detection method based on
the spatial layout and semanic consistency of the 3D scene graph. Firstly, we
propose an object-level data association approach based on the semantic
information from semantic labels, intersection over union (IoU), object color,
and object embedding. Subsequently, multi-view bundle adjustment with the
associated objects is utilized to jointly optimize the poses of objects and
cameras. We represent the refined objects as a 3D spatial graph with semantics
and topology. Then, we propose a graph matching approach to select
correspondence objects based on the structure layout and semantic property
similarity of vertices' neighbors. Finally, we jointly optimize camera
trajectories and object poses in an object-level pose graph optimization, which
results in a globally consistent map. Experimental results demonstrate that our
proposed data association approach can construct more accurate 3D semantic
maps, and our loop closure method is more robust than point-based and
object-based methods in circumstances with large viewpoint changes
Joint Video Multi-Frame Interpolation and Deblurring under Unknown Exposure Time
Natural videos captured by consumer cameras often suffer from low framerate
and motion blur due to the combination of dynamic scene complexity, lens and
sensor imperfection, and less than ideal exposure setting. As a result,
computational methods that jointly perform video frame interpolation and
deblurring begin to emerge with the unrealistic assumption that the exposure
time is known and fixed. In this work, we aim ambitiously for a more realistic
and challenging task - joint video multi-frame interpolation and deblurring
under unknown exposure time. Toward this goal, we first adopt a variant of
supervised contrastive learning to construct an exposure-aware representation
from input blurred frames. We then train two U-Nets for intra-motion and
inter-motion analysis, respectively, adapting to the learned exposure
representation via gain tuning. We finally build our video reconstruction
network upon the exposure and motion representation by progressive
exposure-adaptive convolution and motion refinement. Extensive experiments on
both simulated and real-world datasets show that our optimized method achieves
notable performance gains over the state-of-the-art on the joint video x8
interpolation and deblurring task. Moreover, on the seemingly implausible x16
interpolation task, our method outperforms existing methods by more than 1.5 dB
in terms of PSNR.Comment: Accepted by CVPR 2023, available at
https://github.com/shangwei5/VIDU
- …