8 research outputs found
CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction
Human-robot interaction (HRI) is a rapidly growing field that encompasses
social and industrial applications. Machine learning plays a vital role in
industrial HRI by enhancing the adaptability and autonomy of robots in complex
environments. However, data privacy is a crucial concern in the interaction
between humans and robots, as companies need to protect sensitive data while
machine learning algorithms require access to large datasets. Federated
Learning (FL) offers a solution by enabling the distributed training of models
without sharing raw data. Despite extensive research on Federated learning (FL)
for tasks such as natural language processing (NLP) and image classification,
the question of how to use FL for HRI remains an open research problem. The
traditional FL approach involves transmitting large neural network parameter
matrices between the server and clients, which can lead to high communication
costs and often becomes a bottleneck in FL. This paper proposes a
communication-efficient FL framework for human-robot interaction (CEFHRI) to
address the challenges of data heterogeneity and communication costs. The
framework leverages pre-trained models and introduces a trainable
spatiotemporal adapter for video understanding tasks in HRI. Experimental
results on three human-robot interaction benchmark datasets: HRI30, InHARD, and
COIN demonstrate the superiority of CEFHRI over full fine-tuning in terms of
communication costs. The proposed methodology provides a secure and efficient
approach to HRI federated learning, particularly in industrial environments
with data privacy concerns and limited communication bandwidth. Our code is
available at
https://github.com/umarkhalidAI/CEFHRI-Efficient-Federated-Learning.Comment: Accepted in IROS 202
GraphVid: It Only Takes a Few Nodes to Understand a Video
We propose a concise representation of videos that encode perceptually
meaningful features into graphs. With this representation, we aim to leverage
the large amount of redundancies in videos and save computations. First, we
construct superpixel-based graph representations of videos by considering
superpixels as graph nodes and create spatial and temporal connections between
adjacent superpixels. Then, we leverage Graph Convolutional Networks to process
this representation and predict the desired output. As a result, we are able to
train models with much fewer parameters, which translates into short training
periods and a reduction in computation resource requirements. A comprehensive
experimental study on the publicly available datasets Kinetics-400 and Charades
shows that the proposed method is highly cost-effective and uses limited
commodity hardware during training and inference. It reduces the computational
requirements 10-fold while achieving results that are comparable to
state-of-the-art methods. We believe that the proposed approach is a promising
direction that could open the door to solving video understanding more
efficiently and enable more resource limited users to thrive in this research
field.Comment: Accepted to ECCV2022 (Oral