2,120 research outputs found
Multi-Tenant Cloud FPGA: A Survey on Security
With the exponentially increasing demand for performance and scalability in
cloud applications and systems, data center architectures evolved to integrate
heterogeneous computing fabrics that leverage CPUs, GPUs, and FPGAs. FPGAs
differ from traditional processing platforms such as CPUs and GPUs in that they
are reconfigurable at run-time, providing increased and customized performance,
flexibility, and acceleration. FPGAs can perform large-scale search
optimization, acceleration, and signal processing tasks compared with power,
latency, and processing speed. Many public cloud provider giants, including
Amazon, Huawei, Microsoft, Alibaba, etc., have already started integrating
FPGA-based cloud acceleration services. While FPGAs in cloud applications
enable customized acceleration with low power consumption, it also incurs new
security challenges that still need to be reviewed. Allowing cloud users to
reconfigure the hardware design after deployment could open the backdoors for
malicious attackers, potentially putting the cloud platform at risk.
Considering security risks, public cloud providers still don't offer
multi-tenant FPGA services. This paper analyzes the security concerns of
multi-tenant cloud FPGAs, gives a thorough description of the security problems
associated with them, and discusses upcoming future challenges in this field of
study
Will SDN be part of 5G?
For many, this is no longer a valid question and the case is considered
settled with SDN/NFV (Software Defined Networking/Network Function
Virtualization) providing the inevitable innovation enablers solving many
outstanding management issues regarding 5G. However, given the monumental task
of softwarization of radio access network (RAN) while 5G is just around the
corner and some companies have started unveiling their 5G equipment already,
the concern is very realistic that we may only see some point solutions
involving SDN technology instead of a fully SDN-enabled RAN. This survey paper
identifies all important obstacles in the way and looks at the state of the art
of the relevant solutions. This survey is different from the previous surveys
on SDN-based RAN as it focuses on the salient problems and discusses solutions
proposed within and outside SDN literature. Our main focus is on fronthaul,
backward compatibility, supposedly disruptive nature of SDN deployment,
business cases and monetization of SDN related upgrades, latency of general
purpose processors (GPP), and additional security vulnerabilities,
softwarization brings along to the RAN. We have also provided a summary of the
architectural developments in SDN-based RAN landscape as not all work can be
covered under the focused issues. This paper provides a comprehensive survey on
the state of the art of SDN-based RAN and clearly points out the gaps in the
technology.Comment: 33 pages, 10 figure
The future of computing beyond Moore's Law.
Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
TinyVers: A Tiny Versatile System-on-chip with State-Retentive eMRAM for ML Inference at the Extreme Edge
Extreme edge devices or Internet-of-thing nodes require both ultra-low power
always-on processing as well as the ability to do on-demand sampling and
processing. Moreover, support for IoT applications like voice recognition,
machine monitoring, etc., requires the ability to execute a wide range of ML
workloads. This brings challenges in hardware design to build flexible
processors operating in ultra-low power regime. This paper presents TinyVers, a
tiny versatile ultra-low power ML system-on-chip to enable enhanced
intelligence at the Extreme Edge. TinyVers exploits dataflow reconfiguration to
enable multi-modal support and aggressive on-chip power management for
duty-cycling to enable smart sensing applications. The SoC combines a RISC-V
host processor, a 17 TOPS/W dataflow reconfigurable ML accelerator, a 1.7
W deep sleep wake-up controller, and an eMRAM for boot code and ML
parameter retention. The SoC can perform up to 17.6 GOPS while achieving a
power consumption range from 1.7 W-20 mW. Multiple ML workloads aimed for
diverse applications are mapped on the SoC to showcase its flexibility and
efficiency. All the models achieve 1-2 TOPS/W of energy efficiency with power
consumption below 230 W in continuous operation. In a duty-cycling use
case for machine monitoring, this power is reduced to below 10 W.Comment: Accepted in IEEE Journal of Solid-State Circuit
DLAS: An Exploration and Assessment of the Deep Learning Acceleration Stack
Deep Neural Networks (DNNs) are extremely computationally demanding, which
presents a large barrier to their deployment on resource-constrained devices.
Since such devices are where many emerging deep learning applications lie
(e.g., drones, vision-based medical technology), significant bodies of work
from both the machine learning and systems communities have attempted to
provide optimizations to accelerate DNNs. To help unify these two perspectives,
in this paper we combine machine learning and systems techniques within the
Deep Learning Acceleration Stack (DLAS), and demonstrate how these layers can
be tightly dependent on each other with an across-stack perturbation study. We
evaluate the impact on accuracy and inference time when varying different
parameters of DLAS across two datasets, seven popular DNN architectures, four
DNN compression techniques, three algorithmic primitives with sparse and dense
variants, untuned and auto-scheduled code generation, and four hardware
platforms. Our evaluation highlights how perturbations across DLAS parameters
can cause significant variation and across-stack interactions. The highest
level observation from our evaluation is that the model size, accuracy, and
inference time are not guaranteed to be correlated. Overall we make 13 key
observations, including that speedups provided by compression techniques are
very hardware dependent, and that compiler auto-tuning can significantly alter
what the best algorithm to use for a given configuration is. With DLAS, we aim
to provide a reference framework to aid machine learning and systems
practitioners in reasoning about the context in which their respective DNN
acceleration solutions exist in. With our evaluation strongly motivating the
need for co-design, we believe that DLAS can be a valuable concept for
exploring the next generation of co-designed accelerated deep learning
solutions
Deployment of Deep Neural Networks on Dedicated Hardware Accelerators
Deep Neural Networks (DNNs) have established themselves as powerful tools for
a wide range of complex tasks, for example computer vision or natural language
processing. DNNs are notoriously demanding on compute resources and as a
result, dedicated hardware accelerators for all use cases are developed. Different
accelerators provide solutions from hyper scaling cloud environments for the
training of DNNs to inference devices in embedded systems. They implement
intrinsics for complex operations directly in hardware. A common example
are intrinsics for matrix multiplication. However, there exists a gap between
the ecosystems of applications for deep learning practitioners and hardware
accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics
is still mainly defined by human hardware and software experts.
Methods to automatically utilize hardware intrinsics in DNN operators are a
subject of active research. Existing literature often works with transformationdriven
approaches, which aim to establish a sequence of program rewrites and
data-layout transformations such that the hardware intrinsic can be used to
compute the operator. However, the complexity this of task has not yet been
explored, especially for less frequently used operators like Capsule Routing. And
not only the implementation of DNN operators with intrinsics is challenging,
also their optimization on the target device is difficult. Hardware-in-the-loop
tools are often used for this problem. They use latency measurements of implementations
candidates to find the fastest one. However, specialized accelerators
can have memory and programming limitations, so that not every arithmetically
correct implementation is a valid program for the accelerator. These invalid
implementations can lead to unnecessary long the optimization time.
This work investigates the complexity of transformation-driven processes to
automatically embed hardware intrinsics into DNN operators. It is explored
with a custom, graph-based intermediate representation (IR). While operators
like Fully Connected Layers can be handled with reasonable effort, increasing
operator complexity or advanced data-layout transformation can lead to scaling issues.
Building on these insights, this work proposes a novel method to embed
hardware intrinsics into DNN operators. It is based on a dataflow analysis.
The dataflow embedding method allows the exploration of how intrinsics and
operators match without explicit transformations. From the results it can derive
the data layout and program structure necessary to compute the operator with
the intrinsic. A prototype implementation for a dedicated hardware accelerator
demonstrates state-of-the art performance for a wide range of convolutions, while
being agnostic to the data layout. For some operators in the benchmark, the
presented method can also generate alternative implementation strategies to
improve hardware utilization, resulting in a geo-mean speed-up of Ă—2.813 while
reducing the memory footprint. Lastly, by curating the initial set of possible
implementations for the hardware-in-the-loop optimization, the median timeto-
solution is reduced by a factor of Ă—2.40. At the same time, the possibility to
have prolonged searches due a bad initial set of implementations is reduced,
improving the optimization’s robustness by ×2.35
- …