38 research outputs found
Multi-Tenant Cloud FPGA: A Survey on Security
With the exponentially increasing demand for performance and scalability in
cloud applications and systems, data center architectures evolved to integrate
heterogeneous computing fabrics that leverage CPUs, GPUs, and FPGAs. FPGAs
differ from traditional processing platforms such as CPUs and GPUs in that they
are reconfigurable at run-time, providing increased and customized performance,
flexibility, and acceleration. FPGAs can perform large-scale search
optimization, acceleration, and signal processing tasks compared with power,
latency, and processing speed. Many public cloud provider giants, including
Amazon, Huawei, Microsoft, Alibaba, etc., have already started integrating
FPGA-based cloud acceleration services. While FPGAs in cloud applications
enable customized acceleration with low power consumption, it also incurs new
security challenges that still need to be reviewed. Allowing cloud users to
reconfigure the hardware design after deployment could open the backdoors for
malicious attackers, potentially putting the cloud platform at risk.
Considering security risks, public cloud providers still don't offer
multi-tenant FPGA services. This paper analyzes the security concerns of
multi-tenant cloud FPGAs, gives a thorough description of the security problems
associated with them, and discusses upcoming future challenges in this field of
study
Distributed Processing in FPGA Accelerated Cloud
Motivated by the need of cost reduction, better energy efficiency and agile update and deployment of new services, telecommunication industry is moving towards virtualization, which lead to Network Function Virtualization (NFV) standard. NFV leverages cloud technologies to deploy network functions that are traditionally implemented using dedicated proprietary hardware. Still, the performance provided by current cloud infrastructure does not fulfill the requirements for demanding NFV's use cases. Thus, hardware acceleration should be deployed.
The hardware programmability of FPGAs allows them to adapt well to many type of workloads, placing them as good candidates to be used as hardware accelerators in virtualized environments. In this thesis, the CRUN framework is proposed to provide FPGA as hardware accelerator resources in cloud, abstracting the integration complexity while enabling sharable and scalable use of such devices.
CRUN architecture allow user's acceleration hardware to be accessed locally and through the datacenter's network. The latter provide flexible connectivity by following the Software-defined Networking (SDN) principles. The architecture enables the same sharable FPGA to be used simultaneously as a co-processor, a network accelerator or as a distributed accelerator in a scalable scenario over several FPGAs.
In its current development state, CRUN was leveraged for inference of a machine learning application composed of a fully connected neural network. The main performance target was to achieve ultra-low latency, less than 40μs, for each inference at software level. Only CRUN fulfilled the requirement among the analyzed alternatives, where the architecture is capable of providing latency in the 30μs range in average. For context, high-end General-Purpose Processor (GPP) and Graphics Processing Unit (GPU) provided latency values of 798μs and 1 897μs respectively for the same application
VNF-AAPC : accelerator-aware VNF placement and chaining
In recent years, telecom operators have been migrating towards network architectures based on Network Function Virtualization in order to reduce their high Capital Expenditure (CAPEX) and Operational Expenditure (OPEX). However, virtualization of some network functions is accompanied by a significant degradation of Virtual Network Function (VNF) performance in terms of their throughput or energy consumption. To address these challenges, use of hardware-accelerators, e.g. FPGAs, GPUs, to offload CPU-intensive operations from performance-critical VNFs has been proposed. Allocation of NFV infrastructure (NFVi) resources for VNF placement and chaining (VNF-PC) has been a major area of research recently. A variety of resources allocation models have been proposed to achieve various operator's objectives i.e. minimizing CAPEX, OPEX, latency, etc. However, the VNF-PC resource allocation problem for the case when NFVi incorporates hardware-accelerators remains unaddressed. Ignoring hardware-accelerators in NFVi while performing resource allocation for VNF-chains can nullify the advantages resulting from the use of hardware-accelerators. Therefore, accurate models and techniques for the accelerator-aware VNF-PC (VNF-AAPC) are needed in order to achieve the overall efficient utilization of all NFVi resources including hardware-accelerators. This paper investigates the problem of VNF-AAPC, i.e., how to allocate usual NFVi resources along-with hardware-accelerators to VNF-chains in a cost-efficient manner. Particularly, we propose two methods to tackle the VNF-AAPC problem. The first approach is based on Integer Linear Programming (ILP) which jointly optimizes VNF placement, chaining and accelerator allocation while concurring to all NFVi constraints. The second approach is a heuristic-based method that addresses the scalability issue of the ILP approach. The heuristic addresses the VNF-AAPC problem by following a two-step algorithm. The experimental evaluations indicate that incorporating accelerator-awareness in VNF-PC strategies can help operators to achieve additional cost-savings from the efficient allocation of hardware-accelerator resources
Modelling and characterisation of distributed hardware acceleration
Hardware acceleration has become more commonly utilised in networked computing systems. The growing complexity of applications mean that traditional CPU architectures can no longer meet stringent latency constraints. Alternative computing
architectures such as GPUs and FPGAs are increasingly available, along with simpler, more software-like development flows. The work presented in this thesis characterises the overheads associated with these accelerator architectures. A holistic view encompassing both computation and communication latency must be considered. Experimental results obtained through this work show that networkattached accelerators scale better than server-hosted deployments, and that host ingestion overheads are comparable to network traversal times in some cases. Along with the choice of processing platforms, it is becoming more important to consider
how workloads are partitioned and where in the network tasks are being performed. Manual allocation and evaluation of tasks to network nodes does not scale with network and workload complexity. A mathematical formulation of this problem is presented within this thesis that takes into account all relevant performance metrics.
Unlike other works, this model takes into account growing hardware heterogeneity and workload complexity, and is generalisable to a range of scenarios. This model can be used in an optimisation that generates lower cost results with latency performance close to theoretical maximums compared to naive placement approaches.
With the mathematical formulation and experimental results that characterise hardware accelerator overheads, the work presented in this thesis can be used to make informed design decisions about both where to allocate tasks and deploy accelerators in the network, and the associated costs
Distribution of Low Latency Machine Learning Algorithm
Mobile networks are evolving towards centralization and cloudification while bringing computing power to the edge, opening its scope to a new range of applications. Ultra-low latency is one of the requirements of such applications in the next generation of mobile networks (5G), where deep learning is expected to play a big role. Hence, to enable the usage of deep learning solutions on the edge cloud, ultra-low latency inference must be investigated.
The study presented here relies on the usage of an in-house framework (CRUN) that enables the distribution of acceleration on data center environment. The objective of this thesis is to leverage the best solution for the inference of a machine learning algorithm for an anomaly detection application using neural networks in the edge cloud context. To evaluate the obtained results with CRUN a comparison work is also carried out. Five inference solutions were compared using CPU, GPU and FPGA.
The results show a superior performance in terms of latency for all CRUN experiments, that basically comprehends three cases. The first one utilizing the RTL anomaly detection neural network as a baseline solution, the second using the same baseline code but unrolling the biggest layer for obtaining reduced latency and the third by distributing the neural network in two FPGAs. The requirements for this solution were to obtain latency between 20 μs to 40 μs for inference time and at least 20000 inferences per second. These goals were categorically fulfilled for all CRUN experiments, providing 30 μs latency in average, while the second best solution provided 272 μs
DRAM Bender: An Extensible and Versatile FPGA-based Infrastructure to Easily Test State-of-the-art DRAM Chips
To understand and improve DRAM performance, reliability, security and energy
efficiency, prior works study characteristics of commodity DRAM chips.
Unfortunately, state-of-the-art open source infrastructures capable of
conducting such studies are obsolete, poorly supported, or difficult to use, or
their inflexibility limit the types of studies they can conduct.
We propose DRAM Bender, a new FPGA-based infrastructure that enables
experimental studies on state-of-the-art DRAM chips. DRAM Bender offers three
key features at the same time. First, DRAM Bender enables directly interfacing
with a DRAM chip through its low-level interface. This allows users to issue
DRAM commands in arbitrary order and with finer-grained time intervals compared
to other open source infrastructures. Second, DRAM Bender exposes easy-to-use
C++ and Python programming interfaces, allowing users to quickly and easily
develop different types of DRAM experiments. Third, DRAM Bender is easily
extensible. The modular design of DRAM Bender allows extending it to (i)
support existing and emerging DRAM interfaces, and (ii) run on new commercial
or custom FPGA boards with little effort.
To demonstrate that DRAM Bender is a versatile infrastructure, we conduct
three case studies, two of which lead to new observations about the DRAM
RowHammer vulnerability. In particular, we show that data patterns supported by
DRAM Bender uncovers a larger set of bit-flips on a victim row compared to the
data patterns commonly used by prior work. We demonstrate the extensibility of
DRAM Bender by implementing it on five different FPGAs with DDR4 and DDR3
support. DRAM Bender is freely and openly available at
https://github.com/CMU-SAFARI/DRAM-Bender.Comment: To appear in TCAD 202