2,353 research outputs found
Robot Learning from Human Demonstration: Interpretation, Adaptation, and Interaction
Robot Learning from Demonstration (LfD) is a research area that focuses on how robots can learn new skills by observing how people perform various activities. As humans, we have a remarkable ability to imitate other human’s behaviors and adapt to new situations. Endowing robots with these critical capabilities is a significant but very challenging problem considering the complexity and variation of human activities in highly dynamic environments.
This research focuses on how robots can learn new skills by interpreting human activities, adapting the learned skills to new situations, and naturally interacting with humans. This dissertation begins with a discussion of challenges in each of these three problems. A new unified representation approach is introduced to enable robots to simultaneously interpret the high-level semantic meanings and generalize the low-level trajectories of a broad range of human activities. An adaptive framework based on feature space decomposition is then presented for robots to not only reproduce skills, but also autonomously and efficiently adjust the learned skills to new environments that are significantly different from demonstrations. To achieve natural Human Robot Interaction (HRI), this dissertation presents a Recurrent Neural Network based deep perceptual control approach, which is capable of integrating multi-modal perception sequences with actions for robots to interact with humans in long-term tasks.
Overall, by combining the above approaches, an autonomous system is created for robots to acquire important skills that can be applied to human-centered applications. Finally, this dissertation concludes with a discussion of future directions that could accelerate the upcoming technological revolution of robot learning from human demonstration
An Outlook into the Future of Egocentric Vision
What will the future be? We wonder! In this survey, we explore the gap
between current research in egocentric vision and the ever-anticipated future,
where wearable computing, with outward facing cameras and digital overlays, is
expected to be integrated in our every day lives. To understand this gap, the
article starts by envisaging the future through character-based stories,
showcasing through examples the limitations of current technology. We then
provide a mapping between this future and previously defined research tasks.
For each task, we survey its seminal works, current state-of-the-art
methodologies and available datasets, then reflect on shortcomings that limit
its applicability to future research. Note that this survey focuses on software
models for egocentric vision, independent of any specific hardware. The paper
concludes with recommendations for areas of immediate explorations so as to
unlock our path to the future always-on, personalised and life-enhancing
egocentric vision.Comment: We invite comments, suggestions and corrections here:
https://openreview.net/forum?id=V3974SUk1
WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition
Though research has shown the complementarity of camera- and inertial-based
data, datasets which offer both modalities remain scarce. In this paper we
introduce WEAR, a multimodal benchmark dataset for both vision- and
wearable-based Human Activity Recognition (HAR). The dataset comprises data
from 18 participants performing a total of 18 different workout activities with
untrimmed inertial (acceleration) and camera (egocentric video) data recorded
at 10 different outside locations. WEAR features a diverse set of activities
which are low in inter-class similarity and, unlike previous egocentric
datasets, not defined by human-object-interactions nor originate from
inherently distinct activity categories. Provided benchmark results reveal that
single-modality architectures have different strengths and weaknesses in their
prediction performance. Further, in light of the recent success of
transformer-based video action detection models, we demonstrate their
versatility by applying them in a plain fashion using vision, inertial and
combined (vision + inertial) features as input. Results show that vision
transformers are not only able to produce competitive results using only
inertial data, but also can function as an architecture to fuse both modalities
by means of simple concatenation, with the multimodal approach being able to
produce the highest average mAP, precision and close-to-best F1-scores. Up
until now, vision-based transformers have neither been explored in inertial nor
in multimodal human activity recognition, making our approach the first to do
so. The dataset and code to reproduce experiments is publicly available via:
mariusbock.github.io/wearComment: 12 pages, 2 figures, 2 table
Human behavior understanding for worker-centered intelligent manufacturing
“In a worker-centered intelligent manufacturing system, sensing and understanding of the worker’s behavior are the primary tasks, which are essential for automatic performance evaluation & optimization, intelligent training & assistance, and human-robot collaboration. In this study, a worker-centered training & assistant system is proposed for intelligent manufacturing, which is featured with self-awareness and active-guidance. To understand the hand behavior, a method is proposed for complex hand gesture recognition using Convolutional Neural Networks (CNN) with multiview augmentation and inference fusion, from depth images captured by Microsoft Kinect. To sense and understand the worker in a more comprehensive way, a multi-modal approach is proposed for worker activity recognition using Inertial Measurement Unit (IMU) signals obtained from a Myo armband and videos from a visual camera. To automatically learn the importance of different sensors, a novel attention-based approach is proposed to human activity recognition using multiple IMU sensors worn at different body locations. To deploy the developed algorithms to the factory floor, a real-time assembly operation recognition system is proposed with fog computing and transfer learning. The proposed worker-centered training & assistant system has been validated and demonstrated the feasibility and great potential for applying to the manufacturing industry for frontline workers. Our developed approaches have been evaluated: 1) the multi-view approach outperforms the state-of-the-arts on two public benchmark datasets, 2) the multi-modal approach achieves an accuracy of 97% on a worker activity dataset including 6 activities and achieves the best performance on a public dataset, 3) the attention-based method outperforms the state-of-the-art methods on five publicly available datasets, and 4) the developed transfer learning model achieves a real-time recognition accuracy of 95% on a dataset including 10 worker operations”--Abstract, page iv
Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition
Recently, wearable emotion recognition based on peripheral physiological
signals has drawn massive attention due to its less invasive nature and its
applicability in real-life scenarios. However, how to effectively fuse
multimodal data remains a challenging problem. Moreover, traditional
fully-supervised based approaches suffer from overfitting given limited labeled
data. To address the above issues, we propose a novel self-supervised learning
(SSL) framework for wearable emotion recognition, where efficient multimodal
fusion is realized with temporal convolution-based modality-specific encoders
and a transformer-based shared encoder, capturing both intra-modal and
inter-modal correlations. Extensive unlabeled data is automatically assigned
labels by five signal transforms, and the proposed SSL model is pre-trained
with signal transformation recognition as a pretext task, allowing the
extraction of generalized multimodal representations for emotion-related
downstream tasks. For evaluation, the proposed SSL model was first pre-trained
on a large-scale self-collected physiological dataset and the resulting encoder
was subsequently frozen or fine-tuned on three public supervised emotion
recognition datasets. Ultimately, our SSL-based method achieved
state-of-the-art results in various emotion classification tasks. Meanwhile,
the proposed model proved to be more accurate and robust compared to
fully-supervised methods on low data regimes.Comment: Accepted IEEE Transactions On Affective Computin
Low-Cost Indoor Localisation Based on Inertial Sensors, Wi-Fi and Sound
The average life expectancy has been increasing in the last decades, creating the need for
new technologies to improve the quality of life of the elderly. In the Ambient Assisted
Living scope, indoor location systems emerged as a promising technology capable of sup porting the elderly, providing them a safer environment to live in, and promoting their
autonomy. Current indoor location technologies are divided into two categories, depend ing on their need for additional infrastructure. Infrastructure-based solutions require
expensive deployment and maintenance. On the other hand, most infrastructure-free
systems rely on a single source of information, being highly dependent on its availability.
Such systems will hardly be deployed in real-life scenarios, as they cannot handle the
absence of their source of information. An efficient solution must, thus, guarantee the
continuous indoor positioning of the elderly.
This work proposes a new room-level low-cost indoor location algorithm. It relies
on three information sources: inertial sensors, to reconstruct users’ trajectories; environ mental sound, to exploit the unique characteristics of each home division; and Wi-Fi,
to estimate the distance to the Access Point in the neighbourhood. Two data collection
protocols were designed to resemble a real living scenario, and a data processing stage
was applied to the collected data. Then, each source was used to train individual Ma chine Learning (including Deep Learning) algorithms to identify room-level positions.
As each source provides different information to the classification, the data were merged
to produce a more robust localization. Three data fusion approaches (input-level, early,
and late fusion) were implemented for this goal, providing a final output containing
complementary contributions from all data sources.
Experimental results show that the performance improved when more than one source
was used, attaining a weighted F1-score of 81.8% in the localization between seven home
divisions. In conclusion, the evaluation of the developed algorithm shows that it can
achieve accurate room-level indoor localization, being, thus, suitable to be applied in
Ambient Assisted Living scenarios.O aumento da esperança média de vida nas últimas décadas, criou a necessidade de desenvolvimento de tecnologias que permitam melhorar a qualidade de vida dos idosos.
No âmbito da Assistência à Autonomia no Domicílio, sistemas de localização indoor têm
emergido como uma tecnologia promissora capaz de acompanhar os idosos e as suas atividades, proporcionando-lhes um ambiente seguro e promovendo a sua autonomia. As
tecnologias de localização indoor atuais podem ser divididas em duas categorias, aquelas
que necessitam de infrastruturas adicionais e aquelas que não. Sistemas dependentes de
infrastrutura necessitam de implementação e manutenção que são muitas vezes dispendiosas. Por outro lado, a maioria das soluções que não requerem infrastrutura, dependem
de apenas uma fonte de informação, sendo crucial a sua disponibilidade. Um sistema que
não consegue lidar com a falta de informação de um sensor dificilmente será implementado em cenários reais. Uma solução eficiente deverá assim garantir o acompanhamento
contínuo dos idosos.
A solução proposta consiste no desenvolvimento de um algoritmo de localização indoor de baixo custo, baseando-se nas seguintes fontes de informação: sensores inerciais,
capazes de reconstruir a trajetória do utilizador; som, explorando as características dis tintas de cada divisão da casa; e Wi-Fi, responsável pela estimativa da distância entre o
ponto de acesso e o smartphone. Cada fonte sensorial, extraída dos sensores incorpora dos no dispositivo, foi, numa primeira abordagem, individualmente otimizada através de
algoritmos de Machine Learning (incluindo Deep Learning). Como os dados das diversas
fontes contêm informação diferente acerca das mesmas características do sistema, a sua
fusão torna a classificação mais informada e robusta. Com este objetivo, foram implementadas três abordagens de fusão de dados (input data, early and late fusion), fornecendo um
resultado final derivado de contribuições complementares de todas as fontes de dados.
Os resultados experimentais mostram que o desempenho do algoritmo desenvolvido
melhorou com a inclusão de informação multi-sensor, alcançando um valor para F1-
score de 81.8% na distinção entre sete divisões domésticas. Concluindo, o algoritmo de
localização indoor, combinando informações de três fontes diferentes através de métodos
de fusão de dados, alcançou uma localização room-level e está apto para ser aplicado num
cenário de Assistência à Autonomia no Domicílio
- …