1,069 research outputs found
LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft
Accelerating Neural Network Inference with Processing-in-DRAM: From the Edge to the Cloud
Neural networks (NNs) are growing in importance and complexity. A neural
network's performance (and energy efficiency) can be bound either by
computation or memory resources. The processing-in-memory (PIM) paradigm, where
computation is placed near or within memory arrays, is a viable solution to
accelerate memory-bound NNs. However, PIM architectures vary in form, where
different PIM approaches lead to different trade-offs. Our goal is to analyze,
discuss, and contrast DRAM-based PIM architectures for NN performance and
energy efficiency. To do so, we analyze three state-of-the-art PIM
architectures: (1) UPMEM, which integrates processors and DRAM arrays into a
single 2D chip; (2) Mensa, a 3D-stack-based PIM architecture tailored for edge
devices; and (3) SIMDRAM, which uses the analog principles of DRAM to execute
bit-serial operations. Our analysis reveals that PIM greatly benefits
memory-bound NNs: (1) UPMEM provides 23x the performance of a high-end GPU when
the GPU requires memory oversubscription for a general matrix-vector
multiplication kernel; (2) Mensa improves energy efficiency and throughput by
3.0x and 3.1x over the Google Edge TPU for 24 Google edge NN models; and (3)
SIMDRAM outperforms a CPU/GPU by 16.7x/1.4x for three binary NNs. We conclude
that the ideal PIM architecture for NN models depends on a model's distinct
attributes, due to the inherent architectural design choices.Comment: This is an extended and updated version of a paper published in IEEE
Micro, pp. 1-14, 29 Aug. 2022. arXiv admin note: text overlap with
arXiv:2109.1432
A Locality-based Neural Solver for Optical Motion Capture
We present a novel locality-based learning method for cleaning and solving
optical motion capture data. Given noisy marker data, we propose a new
heterogeneous graph neural network which treats markers and joints as different
types of nodes, and uses graph convolution operations to extract the local
features of markers and joints and transform them to clean motions. To deal
with anomaly markers (e.g. occluded or with big tracking errors), the key
insight is that a marker's motion shows strong correlations with the motions of
its immediate neighboring markers but less so with other markers, a.k.a.
locality, which enables us to efficiently fill missing markers (e.g. due to
occlusion). Additionally, we also identify marker outliers due to tracking
errors by investigating their acceleration profiles. Finally, we propose a
training regime based on representation learning and data augmentation, by
training the model on data with masking. The masking schemes aim to mimic the
occluded and noisy markers often observed in the real data. Finally, we show
that our method achieves high accuracy on multiple metrics across various
datasets. Extensive comparison shows our method outperforms state-of-the-art
methods in terms of prediction accuracy of occluded marker position error by
approximately 20%, which leads to a further error reduction on the
reconstructed joint rotations and positions by 30%. The code and data for this
paper are available at https://github.com/non-void/LocalMoCap.Comment: Siggraph Asia 2023 Conference Pape
Transformer-based models and hardware acceleration analysis in autonomous driving: A survey
Transformer architectures have exhibited promising performance in various
autonomous driving applications in recent years. On the other hand, its
dedicated hardware acceleration on portable computational platforms has become
the next critical step for practical deployment in real autonomous vehicles.
This survey paper provides a comprehensive overview, benchmark, and analysis of
Transformer-based models specifically tailored for autonomous driving tasks
such as lane detection, segmentation, tracking, planning, and decision-making.
We review different architectures for organizing Transformer inputs and
outputs, such as encoder-decoder and encoder-only structures, and explore their
respective advantages and disadvantages. Furthermore, we discuss
Transformer-related operators and their hardware acceleration schemes in depth,
taking into account key factors such as quantization and runtime. We
specifically illustrate the operator level comparison between layers from
convolutional neural network, Swin-Transformer, and Transformer with 4D
encoder. The paper also highlights the challenges, trends, and current insights
in Transformer-based models, addressing their hardware deployment and
acceleration issues within the context of long-term autonomous driving
applications
- …