2,353 research outputs found

    Robot Learning from Human Demonstration: Interpretation, Adaptation, and Interaction

    Get PDF
    Robot Learning from Demonstration (LfD) is a research area that focuses on how robots can learn new skills by observing how people perform various activities. As humans, we have a remarkable ability to imitate other human’s behaviors and adapt to new situations. Endowing robots with these critical capabilities is a significant but very challenging problem considering the complexity and variation of human activities in highly dynamic environments. This research focuses on how robots can learn new skills by interpreting human activities, adapting the learned skills to new situations, and naturally interacting with humans. This dissertation begins with a discussion of challenges in each of these three problems. A new unified representation approach is introduced to enable robots to simultaneously interpret the high-level semantic meanings and generalize the low-level trajectories of a broad range of human activities. An adaptive framework based on feature space decomposition is then presented for robots to not only reproduce skills, but also autonomously and efficiently adjust the learned skills to new environments that are significantly different from demonstrations. To achieve natural Human Robot Interaction (HRI), this dissertation presents a Recurrent Neural Network based deep perceptual control approach, which is capable of integrating multi-modal perception sequences with actions for robots to interact with humans in long-term tasks. Overall, by combining the above approaches, an autonomous system is created for robots to acquire important skills that can be applied to human-centered applications. Finally, this dissertation concludes with a discussion of future directions that could accelerate the upcoming technological revolution of robot learning from human demonstration

    An Outlook into the Future of Egocentric Vision

    Full text link
    What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.Comment: We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1

    WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition

    Full text link
    Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both modalities remain scarce. In this paper we introduce WEAR, a multimodal benchmark dataset for both vision- and wearable-based Human Activity Recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. WEAR features a diverse set of activities which are low in inter-class similarity and, unlike previous egocentric datasets, not defined by human-object-interactions nor originate from inherently distinct activity categories. Provided benchmark results reveal that single-modality architectures have different strengths and weaknesses in their prediction performance. Further, in light of the recent success of transformer-based video action detection models, we demonstrate their versatility by applying them in a plain fashion using vision, inertial and combined (vision + inertial) features as input. Results show that vision transformers are not only able to produce competitive results using only inertial data, but also can function as an architecture to fuse both modalities by means of simple concatenation, with the multimodal approach being able to produce the highest average mAP, precision and close-to-best F1-scores. Up until now, vision-based transformers have neither been explored in inertial nor in multimodal human activity recognition, making our approach the first to do so. The dataset and code to reproduce experiments is publicly available via: mariusbock.github.io/wearComment: 12 pages, 2 figures, 2 table

    Human behavior understanding for worker-centered intelligent manufacturing

    Get PDF
    “In a worker-centered intelligent manufacturing system, sensing and understanding of the worker’s behavior are the primary tasks, which are essential for automatic performance evaluation & optimization, intelligent training & assistance, and human-robot collaboration. In this study, a worker-centered training & assistant system is proposed for intelligent manufacturing, which is featured with self-awareness and active-guidance. To understand the hand behavior, a method is proposed for complex hand gesture recognition using Convolutional Neural Networks (CNN) with multiview augmentation and inference fusion, from depth images captured by Microsoft Kinect. To sense and understand the worker in a more comprehensive way, a multi-modal approach is proposed for worker activity recognition using Inertial Measurement Unit (IMU) signals obtained from a Myo armband and videos from a visual camera. To automatically learn the importance of different sensors, a novel attention-based approach is proposed to human activity recognition using multiple IMU sensors worn at different body locations. To deploy the developed algorithms to the factory floor, a real-time assembly operation recognition system is proposed with fog computing and transfer learning. The proposed worker-centered training & assistant system has been validated and demonstrated the feasibility and great potential for applying to the manufacturing industry for frontline workers. Our developed approaches have been evaluated: 1) the multi-view approach outperforms the state-of-the-arts on two public benchmark datasets, 2) the multi-modal approach achieves an accuracy of 97% on a worker activity dataset including 6 activities and achieves the best performance on a public dataset, 3) the attention-based method outperforms the state-of-the-art methods on five publicly available datasets, and 4) the developed transfer learning model achieves a real-time recognition accuracy of 95% on a dataset including 10 worker operations”--Abstract, page iv

    Transformer-based Self-supervised Multimodal Representation Learning for Wearable Emotion Recognition

    Full text link
    Recently, wearable emotion recognition based on peripheral physiological signals has drawn massive attention due to its less invasive nature and its applicability in real-life scenarios. However, how to effectively fuse multimodal data remains a challenging problem. Moreover, traditional fully-supervised based approaches suffer from overfitting given limited labeled data. To address the above issues, we propose a novel self-supervised learning (SSL) framework for wearable emotion recognition, where efficient multimodal fusion is realized with temporal convolution-based modality-specific encoders and a transformer-based shared encoder, capturing both intra-modal and inter-modal correlations. Extensive unlabeled data is automatically assigned labels by five signal transforms, and the proposed SSL model is pre-trained with signal transformation recognition as a pretext task, allowing the extraction of generalized multimodal representations for emotion-related downstream tasks. For evaluation, the proposed SSL model was first pre-trained on a large-scale self-collected physiological dataset and the resulting encoder was subsequently frozen or fine-tuned on three public supervised emotion recognition datasets. Ultimately, our SSL-based method achieved state-of-the-art results in various emotion classification tasks. Meanwhile, the proposed model proved to be more accurate and robust compared to fully-supervised methods on low data regimes.Comment: Accepted IEEE Transactions On Affective Computin

    Low-Cost Indoor Localisation Based on Inertial Sensors, Wi-Fi and Sound

    Get PDF
    The average life expectancy has been increasing in the last decades, creating the need for new technologies to improve the quality of life of the elderly. In the Ambient Assisted Living scope, indoor location systems emerged as a promising technology capable of sup porting the elderly, providing them a safer environment to live in, and promoting their autonomy. Current indoor location technologies are divided into two categories, depend ing on their need for additional infrastructure. Infrastructure-based solutions require expensive deployment and maintenance. On the other hand, most infrastructure-free systems rely on a single source of information, being highly dependent on its availability. Such systems will hardly be deployed in real-life scenarios, as they cannot handle the absence of their source of information. An efficient solution must, thus, guarantee the continuous indoor positioning of the elderly. This work proposes a new room-level low-cost indoor location algorithm. It relies on three information sources: inertial sensors, to reconstruct users’ trajectories; environ mental sound, to exploit the unique characteristics of each home division; and Wi-Fi, to estimate the distance to the Access Point in the neighbourhood. Two data collection protocols were designed to resemble a real living scenario, and a data processing stage was applied to the collected data. Then, each source was used to train individual Ma chine Learning (including Deep Learning) algorithms to identify room-level positions. As each source provides different information to the classification, the data were merged to produce a more robust localization. Three data fusion approaches (input-level, early, and late fusion) were implemented for this goal, providing a final output containing complementary contributions from all data sources. Experimental results show that the performance improved when more than one source was used, attaining a weighted F1-score of 81.8% in the localization between seven home divisions. In conclusion, the evaluation of the developed algorithm shows that it can achieve accurate room-level indoor localization, being, thus, suitable to be applied in Ambient Assisted Living scenarios.O aumento da esperança média de vida nas últimas décadas, criou a necessidade de desenvolvimento de tecnologias que permitam melhorar a qualidade de vida dos idosos. No âmbito da Assistência à Autonomia no Domicílio, sistemas de localização indoor têm emergido como uma tecnologia promissora capaz de acompanhar os idosos e as suas atividades, proporcionando-lhes um ambiente seguro e promovendo a sua autonomia. As tecnologias de localização indoor atuais podem ser divididas em duas categorias, aquelas que necessitam de infrastruturas adicionais e aquelas que não. Sistemas dependentes de infrastrutura necessitam de implementação e manutenção que são muitas vezes dispendiosas. Por outro lado, a maioria das soluções que não requerem infrastrutura, dependem de apenas uma fonte de informação, sendo crucial a sua disponibilidade. Um sistema que não consegue lidar com a falta de informação de um sensor dificilmente será implementado em cenários reais. Uma solução eficiente deverá assim garantir o acompanhamento contínuo dos idosos. A solução proposta consiste no desenvolvimento de um algoritmo de localização indoor de baixo custo, baseando-se nas seguintes fontes de informação: sensores inerciais, capazes de reconstruir a trajetória do utilizador; som, explorando as características dis tintas de cada divisão da casa; e Wi-Fi, responsável pela estimativa da distância entre o ponto de acesso e o smartphone. Cada fonte sensorial, extraída dos sensores incorpora dos no dispositivo, foi, numa primeira abordagem, individualmente otimizada através de algoritmos de Machine Learning (incluindo Deep Learning). Como os dados das diversas fontes contêm informação diferente acerca das mesmas características do sistema, a sua fusão torna a classificação mais informada e robusta. Com este objetivo, foram implementadas três abordagens de fusão de dados (input data, early and late fusion), fornecendo um resultado final derivado de contribuições complementares de todas as fontes de dados. Os resultados experimentais mostram que o desempenho do algoritmo desenvolvido melhorou com a inclusão de informação multi-sensor, alcançando um valor para F1- score de 81.8% na distinção entre sete divisões domésticas. Concluindo, o algoritmo de localização indoor, combinando informações de três fontes diferentes através de métodos de fusão de dados, alcançou uma localização room-level e está apto para ser aplicado num cenário de Assistência à Autonomia no Domicílio
    corecore