19,697 research outputs found
Rehabilitation Exercise Repetition Segmentation and Counting using Skeletal Body Joints
Physical exercise is an essential component of rehabilitation programs that
improve quality of life and reduce mortality and re-hospitalization rates. In
AI-driven virtual rehabilitation programs, patients complete their exercises
independently at home, while AI algorithms analyze the exercise data to provide
feedback to patients and report their progress to clinicians. To analyze
exercise data, the first step is to segment it into consecutive repetitions.
There has been a significant amount of research performed on segmenting and
counting the repetitive activities of healthy individuals using raw video data,
which raises concerns regarding privacy and is computationally intensive.
Previous research on patients' rehabilitation exercise segmentation relied on
data collected by multiple wearable sensors, which are difficult to use at home
by rehabilitation patients. Compared to healthy individuals, segmenting and
counting exercise repetitions in patients is more challenging because of the
irregular repetition duration and the variation between repetitions. This paper
presents a novel approach for segmenting and counting the repetitions of
rehabilitation exercises performed by patients, based on their skeletal body
joints. Skeletal body joints can be acquired through depth cameras or computer
vision techniques applied to RGB videos of patients. Various sequential neural
networks are designed to analyze the sequences of skeletal body joints and
perform repetition segmentation and counting. Extensive experiments on three
publicly available rehabilitation exercise datasets, KIMORE, UI-PRMD, and
IntelliRehabDS, demonstrate the superiority of the proposed method compared to
previous methods. The proposed method enables accurate exercise analysis while
preserving privacy, facilitating the effective delivery of virtual
rehabilitation programs.Comment: 8 pages, 1 figure, 2 table
H-TSP: Hierarchically Solving the Large-Scale Travelling Salesman Problem
We propose an end-to-end learning framework based on hierarchical
reinforcement learning, called H-TSP, for addressing the large-scale Travelling
Salesman Problem (TSP). The proposed H-TSP constructs a solution of a TSP
instance starting from the scratch relying on two components: the upper-level
policy chooses a small subset of nodes (up to 200 in our experiment) from all
nodes that are to be traversed, while the lower-level policy takes the chosen
nodes as input and outputs a tour connecting them to the existing partial route
(initially only containing the depot). After jointly training the upper-level
and lower-level policies, our approach can directly generate solutions for the
given TSP instances without relying on any time-consuming search procedures. To
demonstrate effectiveness of the proposed approach, we have conducted extensive
experiments on randomly generated TSP instances with different numbers of
nodes. We show that H-TSP can achieve comparable results (gap 3.42% vs. 7.32%)
as SOTA search-based approaches, and more importantly, we reduce the time
consumption up to two orders of magnitude (3.32s vs. 395.85s). To the best of
our knowledge, H-TSP is the first end-to-end deep reinforcement learning
approach that can scale to TSP instances of up to 10000 nodes. Although there
are still gaps to SOTA results with respect to solution quality, we believe
that H-TSP will be useful for practical applications, particularly those that
are time-sensitive e.g., on-call routing and ride hailing service.Comment: Accepted by AAAI 2023, February 202
Learning Robust Visual-Semantic Embedding for Generalizable Person Re-identification
Generalizable person re-identification (Re-ID) is a very hot research topic
in machine learning and computer vision, which plays a significant role in
realistic scenarios due to its various applications in public security and
video surveillance. However, previous methods mainly focus on the visual
representation learning, while neglect to explore the potential of semantic
features during training, which easily leads to poor generalization capability
when adapted to the new domain. In this paper, we propose a Multi-Modal
Equivalent Transformer called MMET for more robust visual-semantic embedding
learning on visual, textual and visual-textual tasks respectively. To further
enhance the robust feature learning in the context of transformer, a dynamic
masking mechanism called Masked Multimodal Modeling strategy (MMM) is
introduced to mask both the image patches and the text tokens, which can
jointly works on multimodal or unimodal data and significantly boost the
performance of generalizable person Re-ID. Extensive experiments on benchmark
datasets demonstrate the competitive performance of our method over previous
approaches. We hope this method could advance the research towards
visual-semantic representation learning. Our source code is also publicly
available at https://github.com/JeremyXSC/MMET
The Metaverse: Survey, Trends, Novel Pipeline Ecosystem & Future Directions
The Metaverse offers a second world beyond reality, where boundaries are
non-existent, and possibilities are endless through engagement and immersive
experiences using the virtual reality (VR) technology. Many disciplines can
benefit from the advancement of the Metaverse when accurately developed,
including the fields of technology, gaming, education, art, and culture.
Nevertheless, developing the Metaverse environment to its full potential is an
ambiguous task that needs proper guidance and directions. Existing surveys on
the Metaverse focus only on a specific aspect and discipline of the Metaverse
and lack a holistic view of the entire process. To this end, a more holistic,
multi-disciplinary, in-depth, and academic and industry-oriented review is
required to provide a thorough study of the Metaverse development pipeline. To
address these issues, we present in this survey a novel multi-layered pipeline
ecosystem composed of (1) the Metaverse computing, networking, communications
and hardware infrastructure, (2) environment digitization, and (3) user
interactions. For every layer, we discuss the components that detail the steps
of its development. Also, for each of these components, we examine the impact
of a set of enabling technologies and empowering domains (e.g., Artificial
Intelligence, Security & Privacy, Blockchain, Business, Ethics, and Social) on
its advancement. In addition, we explain the importance of these technologies
to support decentralization, interoperability, user experiences, interactions,
and monetization. Our presented study highlights the existing challenges for
each component, followed by research directions and potential solutions. To the
best of our knowledge, this survey is the most comprehensive and allows users,
scholars, and entrepreneurs to get an in-depth understanding of the Metaverse
ecosystem to find their opportunities and potentials for contribution
Multi-Graph Convolution Network for Pose Forecasting
Recently, there has been a growing interest in predicting human motion, which
involves forecasting future body poses based on observed pose sequences. This
task is complex due to modeling spatial and temporal relationships. The most
commonly used models for this task are autoregressive models, such as recurrent
neural networks (RNNs) or variants, and Transformer Networks. However, RNNs
have several drawbacks, such as vanishing or exploding gradients. Other
researchers have attempted to solve the communication problem in the spatial
dimension by integrating Graph Convolutional Networks (GCN) and Long Short-Term
Memory (LSTM) models. These works deal with temporal and spatial information
separately, which limits the effectiveness. To fix this problem, we propose a
novel approach called the multi-graph convolution network (MGCN) for 3D human
pose forecasting. This model simultaneously captures spatial and temporal
information by introducing an augmented graph for pose sequences. Multiple
frames give multiple parts, joined together in a single graph instance.
Furthermore, we also explore the influence of natural structure and
sequence-aware attention to our model. In our experimental evaluation of the
large-scale benchmark datasets, Human3.6M, AMSS and 3DPW, MGCN outperforms the
state-of-the-art in pose prediction.Comment: arXiv admin note: text overlap with arXiv:2110.04573 by other author
Ambiguous Medical Image Segmentation using Diffusion Models
Collective insights from a group of experts have always proven to outperform
an individual's best diagnostic for clinical tasks. For the task of medical
image segmentation, existing research on AI-based alternatives focuses more on
developing models that can imitate the best individual rather than harnessing
the power of expert groups. In this paper, we introduce a single diffusion
model-based approach that produces multiple plausible outputs by learning a
distribution over group insights. Our proposed model generates a distribution
of segmentation masks by leveraging the inherent stochastic sampling process of
diffusion using only minimal additional learning. We demonstrate on three
different medical image modalities- CT, ultrasound, and MRI that our model is
capable of producing several possible variants while capturing the frequencies
of their occurrences. Comprehensive results show that our proposed approach
outperforms existing state-of-the-art ambiguous segmentation networks in terms
of accuracy while preserving naturally occurring variation. We also propose a
new metric to evaluate the diversity as well as the accuracy of segmentation
predictions that aligns with the interest of clinical practice of collective
insights
One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC Era
OpenAI has recently released GPT-4 (a.k.a. ChatGPT plus), which is
demonstrated to be one small step for generative AI (GAI), but one giant leap
for artificial general intelligence (AGI). Since its official release in
November 2022, ChatGPT has quickly attracted numerous users with extensive
media coverage. Such unprecedented attention has also motivated numerous
researchers to investigate ChatGPT from various aspects. According to Google
scholar, there are more than 500 articles with ChatGPT in their titles or
mentioning it in their abstracts. Considering this, a review is urgently
needed, and our work fills this gap. Overall, this work is the first to survey
ChatGPT with a comprehensive review of its underlying technology, applications,
and challenges. Moreover, we present an outlook on how ChatGPT might evolve to
realize general-purpose AIGC (a.k.a. AI-generated content), which will be a
significant milestone for the development of AGI.Comment: A Survey on ChatGPT and GPT-4, 29 pages. Feedback is appreciated
([email protected]
Wav2code: Restore Clean Speech Representations via Codebook Lookup for Noise-Robust ASR
Automatic speech recognition (ASR) has gained a remarkable success thanks to
recent advances of deep learning, but it usually degrades significantly under
real-world noisy conditions. Recent works introduce speech enhancement (SE) as
front-end to improve speech quality, which is proved effective but may not be
optimal for downstream ASR due to speech distortion problem. Based on that,
latest works combine SE and currently popular self-supervised learning (SSL) to
alleviate distortion and improve noise robustness. Despite the effectiveness,
the speech distortion caused by conventional SE still cannot be completely
eliminated. In this paper, we propose a self-supervised framework named
Wav2code to implement a generalized SE without distortions for noise-robust
ASR. First, in pre-training stage the clean speech representations from SSL
model are sent to lookup a discrete codebook via nearest-neighbor feature
matching, the resulted code sequence are then exploited to reconstruct the
original clean representations, in order to store them in codebook as prior.
Second, during finetuning we propose a Transformer-based code predictor to
accurately predict clean codes by modeling the global dependency of input noisy
representations, which enables discovery and restoration of high-quality clean
representations without distortions. Furthermore, we propose an interactive
feature fusion network to combine original noisy and the restored clean
representations to consider both fidelity and quality, resulting in even more
informative features for downstream ASR. Finally, experiments on both synthetic
and real noisy datasets demonstrate that Wav2code can solve the speech
distortion and improve ASR performance under various noisy conditions,
resulting in stronger robustness.Comment: 12 pages, 7 figures, Submitted to IEEE/ACM TASL
Quantifying and Explaining Machine Learning Uncertainty in Predictive Process Monitoring: An Operations Research Perspective
This paper introduces a comprehensive, multi-stage machine learning
methodology that effectively integrates information systems and artificial
intelligence to enhance decision-making processes within the domain of
operations research. The proposed framework adeptly addresses common
limitations of existing solutions, such as the neglect of data-driven
estimation for vital production parameters, exclusive generation of point
forecasts without considering model uncertainty, and lacking explanations
regarding the sources of such uncertainty. Our approach employs Quantile
Regression Forests for generating interval predictions, alongside both local
and global variants of SHapley Additive Explanations for the examined
predictive process monitoring problem. The practical applicability of the
proposed methodology is substantiated through a real-world production planning
case study, emphasizing the potential of prescriptive analytics in refining
decision-making procedures. This paper accentuates the imperative of addressing
these challenges to fully harness the extensive and rich data resources
accessible for well-informed decision-making
CoRe-Sleep: A Multimodal Fusion Framework for Time Series Robust to Imperfect Modalities
Sleep abnormalities can have severe health consequences. Automated sleep
staging, i.e. labelling the sequence of sleep stages from the patient's
physiological recordings, could simplify the diagnostic process. Previous work
on automated sleep staging has achieved great results, mainly relying on the
EEG signal. However, often multiple sources of information are available beyond
EEG. This can be particularly beneficial when the EEG recordings are noisy or
even missing completely. In this paper, we propose CoRe-Sleep, a Coordinated
Representation multimodal fusion network that is particularly focused on
improving the robustness of signal analysis on imperfect data. We demonstrate
how appropriately handling multimodal information can be the key to achieving
such robustness. CoRe-Sleep tolerates noisy or missing modalities segments,
allowing training on incomplete data. Additionally, it shows state-of-the-art
performance when testing on both multimodal and unimodal data using a single
model on SHHS-1, the largest publicly available study that includes sleep stage
labels. The results indicate that training the model on multimodal data does
positively influence performance when tested on unimodal data. This work aims
at bridging the gap between automated analysis tools and their clinical
utility.Comment: 10 pages, 4 figures, 2 tables, journa
- …