53 research outputs found
An FPGA-Based On-Device Reinforcement Learning Approach using Online Sequential Learning
DQN (Deep Q-Network) is a method to perform Q-learning for reinforcement
learning using deep neural networks. DQNs require a large buffer and batch
processing for an experience replay and rely on a backpropagation based
iterative optimization, making them difficult to be implemented on
resource-limited edge devices. In this paper, we propose a lightweight
on-device reinforcement learning approach for low-cost FPGA devices. It
exploits a recently proposed neural-network based on-device learning approach
that does not rely on the backpropagation method but uses OS-ELM (Online
Sequential Extreme Learning Machine) based training algorithm. In addition, we
propose a combination of L2 regularization and spectral normalization for the
on-device reinforcement learning so that output values of the neural network
can be fit into a certain range and the reinforcement learning becomes stable.
The proposed reinforcement learning approach is designed for PYNQ-Z1 board as a
low-cost FPGA platform. The evaluation results using OpenAI Gym demonstrate
that the proposed algorithm and its FPGA implementation complete a CartPole-v0
task 29.77x and 89.40x faster than a conventional DQN-based approach when the
number of hidden-layer nodes is 64
Communication Size Reduction of Federated Learning using Neural ODE Models
Federated learning is a machine learning approach in which data is not
aggregated on a server, but is trained at clients locally, in consideration of
security and privacy. ResNet is a classic but representative neural network
that succeeds in deepening the neural network by learning a residual function
that adds the inputs and outputs together. In federated learning, communication
is performed between the server and clients to exchange weight parameters.
Since ResNet has deep layers and a large number of parameters, the
communication size becomes large. In this paper, we use Neural ODE as a
lightweight model of ResNet to reduce communication size in federated learning.
In addition, we newly introduce a flexible federated learning using Neural ODE
models with different number of iterations, which correspond to ResNet models
with different depths. Evaluation results using CIFAR-10 dataset show that the
use of Neural ODE reduces communication size by up to 92.4% compared to ResNet.
We also show that the proposed flexible federated learning can merge models
with different iteration counts or depths
FPGA-Accelerated Correspondence-free Point Cloud Registration with PointNet Features
Point cloud registration serves as a basis for vision and robotic
applications including 3D reconstruction and mapping. Despite significant
improvements on the quality of results, recent deep learning approaches are
computationally expensive and power-hungry, making them difficult to deploy on
resource-constrained edge devices. To tackle this problem, in this paper, we
propose a fast, accurate, and robust registration for low-cost embedded FPGAs.
Based on a parallel and pipelined PointNet feature extractor, we develop custom
accelerator cores namely PointLKCore and ReAgentCore, for two different
learning-based methods. They are both correspondence-free and computationally
efficient as they avoid the costly feature matching step involving
nearest-neighbor search. The proposed cores are implemented on the Xilinx
ZCU104 board and evaluated using both synthetic and real-world datasets,
showing the substantial improvements in the trade-offs between runtime and
registration quality. They run 44.08-45.75x faster than ARM Cortex-A53 CPU and
offer 1.98-11.13x speedups over Intel Xeon CPU and Nvidia Jetson boards, while
consuming less than 1W and achieving 163.11-213.58x energy-efficiency compared
to Nvidia GeForce GPU. The proposed cores are more robust to noise and large
initial misalignments than the classical methods and quickly find reasonable
solutions in less than 15ms, demonstrating the real-time performance.Comment: 27 pages, 19 figure
An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm
An efficient hardware implementation for Simultaneous Localization and
Mapping (SLAM) methods is of necessity for mobile autonomous robots with
limited computational resources. In this paper, we propose a resource-efficient
FPGA implementation for accelerating scan matching computations, which
typically cause a major bottleneck in 2D LiDAR SLAM methods. Scan matching is a
process of correcting a robot pose by aligning the latest LiDAR measurements
with an occupancy grid map, which encodes the information about the surrounding
environment. We exploit an inherent parallelism in the Rao-Blackwellized
Particle Filter (RBPF) based algorithms to perform scan matching computations
for multiple particles in parallel. In the proposed design, several techniques
are employed to reduce the resource utilization and to achieve the maximum
throughput. Experimental results using the benchmark datasets show that the
scan matching is accelerated by 5.31-8.75x and the overall throughput is
improved by 3.72-5.10x without seriously degrading the quality of the final
outputs. Furthermore, our proposed IP core requires only 44% of the total
resources available in the TUL Pynq-Z2 FPGA board, thus facilitating the
realization of SLAM applications on indoor mobile robots
An On-Device Federated Learning Approach for Cooperative Anomaly Detection
Most edge AI focuses on prediction tasks on resource-limited edge devices
while the training is done at server machines. However, retraining or
customizing a model is required at edge devices as the model is becoming
outdated due to environmental changes over time. To follow such a concept
drift, a neural-network based on-device learning approach is recently proposed,
so that edge devices train incoming data at runtime to update their model. In
this case, since a training is done at distributed edge devices, the issue is
that only a limited amount of training data can be used for each edge device.
To address this issue, one approach is a cooperative learning or federated
learning, where edge devices exchange their trained results and update their
model by using those collected from the other devices. In this paper, as an
on-device learning algorithm, we focus on OS-ELM (Online Sequential Extreme
Learning Machine) to sequentially train a model based on recent samples and
combine it with autoencoder for anomaly detection. We extend it for an
on-device federated learning so that edge devices can exchange their trained
results and update their model by using those collected from the other edge
devices. This cooperative model update is one-shot while it can be repeatedly
applied to synchronize their model. Our approach is evaluated with anomaly
detection tasks generated from a driving dataset of cars, a human activity
dataset, and MNIST dataset. The results demonstrate that the proposed on-device
federated learning can produce a merged model by integrating trained results
from multiple edge devices as accurately as traditional backpropagation based
neural networks and a traditional federated learning approach with lower
computation or communication cost
Recommended from our members
LaKe: The Power of In-Network Computing
In-network computing accelerates applications natively running on the host by executing them within network devices. While in-network computing offers significant performance improvements, its limitations and design trade-offs have not been explored. To usefully and efficiently run applications within the network, we first need to understand the implications of their design. In this work we introduce LaKe, a Layered Key-Value Store design, running as an in-network application. LaKe is a scalable design, enabling the exploration of design decisions and their effect on throughput, latency and power efficiency. LaKe achieves full line rate throughput, while maintaining a latency of 1.1μs and better power efficiency than existing hardware based memcached designs.This work was supported by JSPS Research Fellowship and Keio University Research Grant for Young Researcher’s Program. This work was supported by JST CREST Grant Number JPMJCR1785, Japan. We acknowledge the support of the Leverhulme Trust (ECF-2016-289) and the Isaac Newton Trust
- …