43,380 research outputs found
Rehabilitation robot cell for multimodal standing-up motion augmentation
The paper presents a robot cell for multimodal standing-up motion augmentation. The robot cell is aimed at augmenting the standing-up capabilities of impaired or paraplegic subjects. The setup incorporates the rehabilitation robot device, functional electrical stimulation system, measurement instrumentation and cognitive feedback system. For controlling the standing-up process a novel approach was developed integrating the voluntary activity of a person in the control scheme of the rehabilitation robot. The simulation results demonstrate the possibility of “patient-driven” robot-assisted standing-up training. Moreover, to extend the system capabilities, the audio cognitive feedback is aimed to guide the subject throughout rising. For the feedback generation a granular synthesis method is utilized displaying high-dimensional, dynamic data. The principle of operation and example sonification in standing-up are presented. In this manner, by integrating the cognitive feedback and “patient-driven” actuation systems, an effective motion augmentation system is proposed in which the motion coordination is under the voluntary control of the user
Fast Predictive Multimodal Image Registration
We introduce a deep encoder-decoder architecture for image deformation
prediction from multimodal images. Specifically, we design an image-patch-based
deep network that jointly (i) learns an image similarity measure and (ii) the
relationship between image patches and deformation parameters. While our method
can be applied to general image registration formulations, we focus on the
Large Deformation Diffeomorphic Metric Mapping (LDDMM) registration model. By
predicting the initial momentum of the shooting formulation of LDDMM, we
preserve its mathematical properties and drastically reduce the computation
time, compared to optimization-based approaches. Furthermore, we create a
Bayesian probabilistic version of the network that allows evaluation of
registration uncertainty via sampling of the network at test time. We evaluate
our method on a 3D brain MRI dataset using both T1- and T2-weighted images. Our
experiments show that our method generates accurate predictions and that
learning the similarity measure leads to more consistent registrations than
relying on generic multimodal image similarity measures, such as mutual
information. Our approach is an order of magnitude faster than
optimization-based LDDMM.Comment: Accepted as a conference paper for ISBI 201
Deep-Learning-Driven Techniques for Real-Time Multimodal Health and Physical Data Synthesis
With the advent of Artificial Intelligence for healthcare, data synthesis methods present crucial benefits in facilitating the fast development of AI models while protecting data subjects and bypassing the need to engage with the complexity of data sharing and processing agreements. Existing technologies focus on synthesising real-time physiological and physical records based on regular time intervals. Real health data are, however, characterised by irregularities and multimodal variables that are still hard to reproduce, preserving the correlation across time and different dimensions. This paper presents two novel techniques for synthetic data generation of real-time multimodal electronic health and physical records, (a) the Temporally Correlated Multimodal Generative Adversarial Network and (b) the Document Sequence Generator. The paper illustrates the need and use of these techniques through a real use case, the H2020 GATEKEEPER project of AI for healthcare. Furthermore, the paper presents the evaluation for both individual cases and a discussion about the comparability between techniques and their potential applications of synthetic data at the different stages of the software development life-cycle
Speech-driven Animation with Meaningful Behaviors
Conversational agents (CAs) play an important role in human computer
interaction. Creating believable movements for CAs is challenging, since the
movements have to be meaningful and natural, reflecting the coupling between
gestures and speech. Studies in the past have mainly relied on rule-based or
data-driven approaches. Rule-based methods focus on creating meaningful
behaviors conveying the underlying message, but the gestures cannot be easily
synchronized with speech. Data-driven approaches, especially speech-driven
models, can capture the relationship between speech and gestures. However, they
create behaviors disregarding the meaning of the message. This study proposes
to bridge the gap between these two approaches overcoming their limitations.
The approach builds a dynamic Bayesian network (DBN), where a discrete variable
is added to constrain the behaviors on the underlying constraint. The study
implements and evaluates the approach with two constraints: discourse functions
and prototypical behaviors. By constraining on the discourse functions (e.g.,
questions), the model learns the characteristic behaviors associated with a
given discourse class learning the rules from the data. By constraining on
prototypical behaviors (e.g., head nods), the approach can be embedded in a
rule-based system as a behavior realizer creating trajectories that are timely
synchronized with speech. The study proposes a DBN structure and a training
approach that (1) models the cause-effect relationship between the constraint
and the gestures, (2) initializes the state configuration models increasing the
range of the generated behaviors, and (3) captures the differences in the
behaviors across constraints by enforcing sparse transitions between shared and
exclusive states per constraint. Objective and subjective evaluations
demonstrate the benefits of the proposed approach over an unconstrained model.Comment: 13 pages, 12 figures, 5 table
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding
Modeling virtual agents with behavior style is one factor for personalizing
human agent interaction. We propose an efficient yet effective machine learning
approach to synthesize gestures driven by prosodic features and text in the
style of different speakers including those unseen during training. Our model
performs zero shot multimodal style transfer driven by multimodal data from the
PATS database containing videos of various speakers. We view style as being
pervasive while speaking, it colors the communicative behaviors expressivity
while speech content is carried by multimodal signals and text. This
disentanglement scheme of content and style allows us to directly infer the
style embedding even of speaker whose data are not part of the training phase,
without requiring any further training or fine tuning. The first goal of our
model is to generate the gestures of a source speaker based on the content of
two audio and text modalities. The second goal is to condition the source
speaker predicted gestures on the multimodal behavior style embedding of a
target speaker. The third goal is to allow zero shot style transfer of speakers
unseen during training without retraining the model. Our system consists of:
(1) a speaker style encoder network that learns to generate a fixed dimensional
speaker embedding style from a target speaker multimodal data and (2) a
sequence to sequence synthesis network that synthesizes gestures based on the
content of the input modalities of a source speaker and conditioned on the
speaker style embedding. We evaluate that our model can synthesize gestures of
a source speaker and transfer the knowledge of target speaker style variability
to the gesture generation task in a zero shot setup. We convert the 2D gestures
to 3D poses and produce 3D animations. We conduct objective and subjective
evaluations to validate our approach and compare it with a baseline
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
In this paper, we propose an Attentional Generative Adversarial Network
(AttnGAN) that allows attention-driven, multi-stage refinement for fine-grained
text-to-image generation. With a novel attentional generative network, the
AttnGAN can synthesize fine-grained details at different subregions of the
image by paying attentions to the relevant words in the natural language
description. In addition, a deep attentional multimodal similarity model is
proposed to compute a fine-grained image-text matching loss for training the
generator. The proposed AttnGAN significantly outperforms the previous state of
the art, boosting the best reported inception score by 14.14% on the CUB
dataset and 170.25% on the more challenging COCO dataset. A detailed analysis
is also performed by visualizing the attention layers of the AttnGAN. It for
the first time shows that the layered attentional GAN is able to automatically
select the condition at the word level for generating different parts of the
image
- …