ABSTRACT About a decade ago, people concerned about the risks of adopting cloud computing. It was an unproven new thing that raised more questions than it answered. Nowadays, we hear more about the risks of not adopting the cloud. Three of the leading cloud players, Amazon Web Services, Microsoft Azure, and Google Cloud Platform, and other participants have developed complex cloud platforms that are driving the cloud agenda and launching innovative new products to meet the needs of modern businesses. When looking at processors, core components of the cloud, there is a trend for hyperscale data centers is to move beyond the CPUs and turn to dedicated chips, such as graphics processing units, field programmable gating arrays, and application specific integrated circuits. We think it is an artificial intelligence (AI) realization process and provide a detailed survey about hardware server design in this process. After discussing and summarizing various disclosed techniques and platforms, we conceived a hybrid hardware structure for efficient AI applications.
I. INTRODUCTION
Back in 2006, when Amazon released the first ever public cloud platform, few could predict the impact cloud computing would have on the IT industry. The market has rapidly expanded and more and more new customers have entered into it. At first, the original idea was to save cost, avoiding hardware costs, ongoing maintenance and software upgrades. But looking beyond the operational concerns with cloud computing, it is interesting to see that cloud really begins to shine when it becomes an enabler for increased business opportunities and expanded horizons [1] . With the rising popularity of cloud computing, some related fields have been studied, such as energy-aware resource allocation [2] - [4] and intelligent security protection [5] - [7] in cloud computing.
In operating at an enormous scale and cutting hardware costs, cloud players can improve their bottom lines. But during the process of cloud formation, when a data center reaches a certain size, power and space eat up enormous amounts of money. This is the same problem every cloud player has faced. Take Amazon as an example: Servers always make up the bulk of the data center spending. In the early days of its cloud platform, Amazon bought servers from major vendors. But as its business grew, Amazon followed Google's lead [8] and started to create custom hardware for its data centers [9] . This allows Amazon to fine-tune its servers, storage and network devices with greater control over both performance and cost. Amazon worked with Intel to make the household processor run faster to support very specific workloads [10] . Different chip configurations offered optimizations for compute intensive and memory intensive applications.
Like Google, Amazon did not reveal the specific designs of its custom gear because those are trade secrets that give the company a competitive edge. But there was a little show and tell, which always disclosed some useful things. Facebook now designs its own data centers and servers [11] , and as a direct response to Google's approach, the social-networking outfit has ''open sourced'' its designs, hoping to encourage collaboration on designs across the industry. Several companies have already embraced this effort, but others still prefer to keep their secret hardware secret.
This paper aims at mining information about open and disclosed server hardware designs, and analyzing the nature of current workloads to predict the future server characteristics. This work has never been done so special and complete as in this paper. The main contributions of this paper include the following:
1. We explored the sign that the development of AI had an effect on server design, and combined these two fields into a special research. We investigated the mainstream servers in the cloud, and compared existing designs.
2. We researched and answered two important questions: ''How better performance is extracted from existing chipsets?'' and ''How energy consumption is affected by advancements in technology?' ' 3. We offered an insight into the relationship between processors and workloads to predict the evolution trend of future server, and proposed a hybrid hardware structure. We conducted a series of experiments to confirm our proposal.
The rest of this paper is organized as follows: Section II describes AI-related technologies that can be used in hardware; Section III illustrates the current computing chipsets including CPU, GPU, FPGA, and ASIC; Section IV introduces popular high-speed bus standards for interconnecting hardware components; Section V provides a survey about hardware servers of the leading cloud players; Section VI discusses applications run in the hardware servers with some examples; Section VII provides some experiments and discusses the experimental results; Section VIII demonstrates the conceived hardware server architecture; Section IX gives an overview of similar work; Section X summarizes this paper in conclusion.
II. BACKGROUND TECHNOLOGIES
A. GEMM GEMM (GEneral Matrix Multiplication) is part of the BLAS (Basic Linear Algebra Subprograms) library that was first created in 1979. Due to the ubiquity of matrix multiplications in many scientific applications, GEMM is a prime target of optimization for BLAS implementers. It can be observed that CPU and GPU platforms spend most computation time on dense convolution and fully connected layers in a typical deep convolutional neural network. GEMM provides the efficient implementation for all of the layers, fully-connected layers or convolution layers [12] .
B. QUANTIZATION
Quantization is an umbrella term that covers many different techniques for storing number and calculating them in more compact formats rather than only in 32-bit floating-point format [13] - [16] . The first motivation for quantization is to reduce the data size by storing minimum and maximum of each layer and then converting each floating-point value to an 8-bit integer mapping real number range linearly in the range of 0-255. For example, with the range of −6.0 to 6.0, 0 byte represents −6.0, 255 represents 6.0, and 128 represents 0.
Another reason to quantize is to reduce computing resources needed to do the inference calculations, by using the 8-bit integer as inputs and outputs.
Using pre-trained models and running inferences is very different from neural network training, in which many tiny pieces are applied to weights, and these small changes need floating-point precision to work. An amazing feature of deep networks is that they tend to work well in inputs with high levels of noise. When recognizing an object just taken in a photo, the network must ignore all the CCD (ChargeCoupled Device) noise, lighting changes and other trivial differences between it and the training examples have ever seen before, and put emphasis on important similarities. This ability means that they seem to regard low-precision computing as another source of noise, and even with the use of less information in digital formats, precise results still be achieved.
C. EMERGING DNNS
With the rapid evolution of DNN (Deep Neural Network) algorithms, recent developments point to next-generation DNNs [17] - [27] using compact data types and exploiting network sparsity. Comparing to classic DNNs that rely on dense GEMM using FP32 (single precision floating point) data type, these emerging DNNs offer improved efficiency by using irregular parallelism and custom data types. Due to their extreme customizability, FPGAs are more suitable than GPUs to handle these sort of things.
In BNNs (Binarized DNNs), both weight and neuron values are restricted to +1 or −1. Therefore, the key operation in BNNs is 1-bit matrix multiplication. In TNNs (Ternarized DNNs), weights are restricted to 0, +1, or −1, but neurons are still using N-bit precision. As we have known, FPGAs are flexible enough to implement customized N-bit data. In the presence of weights of zero values, the computation becomes sparse matrix multiplication, which requires fewer operations than the dense matrix. Furthermore, it is feasible to prune weights that are deemed to be not important for certain layers with only tiny degradation in accuracy.
III. CHIPSETS
Driving the evolution of AI, deep learning technology has been one of the hottest topics of discussion within the technology world and beyond. In the context of rapid technological development, the current market climate is ripe for innovation in general hardware and specific chipsets. The chipset market was led by CPUs and GPUs, but during the coming years, other chipset types including FPGAs, ASICs, and other emerging chipsets can play an important role [28] .
A. CPU AND GPU
CPU performs all types of data processing operations with three components, Memory or Storage Unit, Control Unit and ALU (Arithmetic Logic Unit). A multi-core processor is a single computing component with two or more independent actual processing units (cores), which are units that read and execute program instructions. CPU is latency optimized for general purpose. The use of a multi-core processor can improve performance which depends very much on the software algorithms used and their implementation. In particular, possible performance gains are limited by the fraction of the software that can run in parallel simultaneously on multiple cores. The theoretical speedup is described by Amdahl's law. When implemented in the AI field, the primary weakness of a CPU is that it usually has no more than tens of cores.
GPU is used together with a CPU to accelerate deep learning, analytics, and engineering applications. Designed for computing in parallel, GPU plays a huge role in computeintensive applications [29] , [30] , for example, DNN is structured in a very uniform manner such that at each layer of the network up to thousands of identical artificial neurons perform the same kind of computation. Therefore the efficient computations of a GPU fit quite well with the structure of a DNN.
NVIDIA GPU contains some largely independent processors called SM (Streaming Multiprocessor), as shown in Fig. 1(a) , each SM hosts several SP (Streaming Processor) ''cores'', and each SP runs a thread. For instance, Fermi has up to 16 SMs with 32 cores per SM, so up to 512 threads can run in parallel. GPU architectures use SIMD (SingleInstruction, Multiple-Data) hardware for enhancing computational efficiency. Rather than exposing the SIMD hardware directly to the programmer, GPUs employ a SIMT (SingleInstruction, Multiple-Threads) execution model to improve flexibility. SIMT groups scalar threads specified by the programmer into SIMD execution warps. Threads in a warp execute in lockstep on the SIMD hardware. Several warps, making up a ''block'', are mapped to an SM, and an SM instantaneously switches between the warps of a block.
B. FPGA
FPGAs consist of an array of programmable logic blocks of potentially different types, including general logic, memory and multiplier blocks, accompanied with programmable routing fabrics that allow customized blocks to be interconnected. The array is surrounded by programmable input/output blocks that connect the chip to the outside circuit.
FPGA can be electrically programmed to become almost any kind of digital circuit. By taking advantage of neuron value sparsity, PE (Processing Element) design for handling sparse data is proposed in [31] and [32] , it uses a dense format, but checked/tracked zeros dynamically and skipped zero computations. Specifically, prior to feeding data to GEMM unit, on-chip data manager checks for zero values to determine the locations of theirs in the block of the data. Each PE inside the GEMM unit will read a set of matrix elements. Those zero elements are not scheduled into the multiply-accumulate computing units inside the PE, therefore reducing the number of cycles required to complete the matrix operations and improving overall performance. The design of the PE to support sparsity is shown in Fig. 1(b) .
An engine for binarized GEMM consists of an XNOR unit as well as a set of lookup tables and adders to calculate the popcount (the number of 1s) for binarized multiplication. In the engine, −1 can be represented as zeros for improving computation efficiency, and computation can be done using an XNOR followed by a bit counting operation. The bit counting itself can further be implemented using a lookup table. N-bit dot products can be accomplished as a customization option for a PE using the aforesaid approach.
C. ASIC
An ASIC is designed for a specific use or application, rather than intended for general purpose. The cost of an ASIC design is high, and therefore they tend to be reserved for high-volume products. But ASICs have advantages in area, delay, and power consumption, that's why Google built TPU (Tensor Processing Unit) [33] when they realized that the fastgrowing computational demands of neural networks could require them to double the number of data centers they operated. VOLUME 6, 2018 In the heart of the TPU, as shown in Fig. 1(c) , the matrix multiplier unit implements an architecture known as systolic array that is drastically different from typical CPUs and GPUs. The systolic array is not well suited for generalpurpose computation but is optimized to particular kind for power and area efficiency in performing matrix multiplications. It makes an engineering tradeoff: limiting registers, control and operational flexibility in return for efficiency and much higher operation density.
Quantization is a powerful tool for reducing the cost of neural network inferences. The ability to use integer instead of floating point operations greatly reduces the hardware footprint and energy consumption of the TPU. The TPU matrix multiplication unit has a systolic array mechanism that contains total 65,536 (256 × 256) ALUs. That means a TPU can process 65,536 multiply-and-adds for 8-bit integers every clock cycle. Because of its 700MHz clock frequency, a TPU can compute 65,536 × 700,000,000 multiply-and-add operations or 92 Teraops per second in the matrix unit.
IV. ADVANCED SERVER BUSES A. PCIE
When applications are becoming more complex, more computing resources are demanded. If some algorithms are accelerated, applications can run faster. PCIe-based accelerators are popular because they provide great flexibility for adding capability to existing systems, allowing many more options in terms of types of accelerators, algorithms, performance characteristics and implementation devices (GPUs, FPGAs, and ASICs). However, as an I/O device, PCIe-based accelerator communicate with the processor and main memory through an I/O subsystem, therefore OS device drivers for the accelerator are needed and data needs to be copied from/to main memory to accelerator memory, which limit data transfer performance.
GPU-accelerated computing offers unprecedented application performance by offloading compute-intensive portions of the application from CPU to GPU, while the remainder of the code still runs on the CPU. From a user's perspective, applications simply run significantly faster. In order to reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus rather than be tightly integrated with a CPU, allowing it to plug into existing servers just as a GPU does. FPGAs are also integrated into mainstream compute systems as a ''GPU form factor'' PCIe card along with a server CPU.
B. CAPI
CAPI (Coherent Accelerator Processor Interface) [34] is delivered by IBM as a scale-out solution around the POWER8 platform. CAPI resides directly on the POWER8 board and works with the same memory addresses that the processor uses, and the accelerator becomes part of a coherent memory fabric, making it easy to exchange data at uber-high speed. In effect, CAPI removes OS and device driver overhead by presenting an efficient, robust, durable and, most importantly, a direct interface.
Prior to CAPI, an application called a device driver to utilize an FPGA accelerator. The device driver performed a memory mapping operation. With CAPI, the FPGA shares memory with the cores, which greatly simplifies programming model. As shown in Fig. 2 , CAPI can reduce the typical seven-step I/O model flow (1-Device Driver Call, 2-Copy or Pin Source Data, 3-MMIO Notify Accelerator, 4-Acceleration, 5-Poll/Int Completion, 6-Copy or Unpin Result Data, 7-Return From Device Driver Completion) to just three steps (1-shared memory/notify accelerator, 2-acceleration, and 3-shared memory completion). 
C. NVLINK
NVLink [35] is NVIDIA's new high-speed interconnect technology using NVHS (NVIDIA's High-Speed Signaling interconnect) for GPU-accelerated computing. NVHS transmits data up to 20 Gb/s over a differential pair. Eight of these pairs form a ''Sub-Link'' that sends data in one direction, and two sub-links, one for each direction, form a ''Link'' that connects two processors in GPU-to-GPU or GPU-to-CPU connections. A single link supports up to 40 GB/s of bidirectional bandwidth between the endpoints.
Instruction set architecture of GPU supports those programs running on NVLink-connected GPUs can execute directly on data in the memory of another GPU as well as in local memory. GPUs can also perform atomic memory operations on remote GPU memory addresses, enabling much tighter data sharing and improved application scaling. However, due to the new high-speed NVLink connection, there is only one server on the market with both GPU-to-CPU and GPU-to-GPU NVLink connectivity. This system, leveraging IBM's POWER8 CPUs and innovation from the OpenPOWER foundation (including NVIDIA and Mellanox), began shipments in fall 2016. Therefore, on systems with x86 CPUs, such as Intel Xeon, the connectivity to the GPU is only through PCIe although the GPUs connect to each other through NVLink. On systems with POWER8 CPUs, the connectivity to the GPU is through NVLink in addition to the NVLink between GPUs.
D. CCIX
CCIX (Cache Coherent Interconnect for Accelerators) [36] was founded to enable a new class of interconnect focused on emerging acceleration applications such as machine learning, network processing, storage off-load, in-memory database and 4G/5G wireless technology. The standard allows processors based on different instruction set architectures to extend the benefits of cache coherent, peer processing to a number of acceleration devices including FPGAs, GPUs, network/storage adapters, intelligent networks and custom ASICs.
CCIX simplifies the development and adoption by extending well-established data center hardware and software infrastructure. This ultimately allows system designers to seamlessly integrate the right combination of heterogeneous components to meet their specific system requirements. One of the biggest advantages of the CCIX specification is that it builds on the PCIe specifications. CCIX's coherence protocol can be carried across PCIe links with little or no modification.
V. SERVERS A. TWO SOCKET SERVER
The typical two-socket server architecture is composed of CPU and a peripheral chipset (PCH -Platform Controller Hub) that supports a variety of features and capabilities, such as USB, GPIO, and SPI. Fig. 3(a) provides a representation of the typical computing complex referred to as a dual processor or two-socket server. In this case, the CPU provides both the integrated memory controller and integrated I/O capability. In addition, a mechanism for inter-processor communication for shared cache and memory is provided via high speed, highly reliable interconnect QPI (QuickPath Interconnect).
B. MICROSOFT'S SERVERS
In January 2014, Microsoft announced that it would join the Open Compute Project launched by Facebook and contribute its intellectual property of the Open Cloud Server design to the open hardware community. The open cloud server has a 12U high rack that supports two dozen half-width trays that can be configured for either computing or storage. The basic computing element is a two-socket Xeon E5 server with four 3.5-inch SATA disk drives and an M.2 flash memory for local operating system image support, an alternative sled with ten 3.5-inch drives is used to extend the storage of the local nodes. The server supports Microsoft's FPGA accelerator and its GPU accelerator. Catapult, which uses FPGA as ''programmable silicon,'' can be integrated into Microsoft's Open Cloud Servers.
Microsoft's Project Olympus [37] , the next generation cloud hardware design and a new model for open source hardware development, was introduced in November 2016. In this solution, Microsoft created a free-standing, rack-based server node that can add storage extension shelves to it for a much more intense storage-to-compute ratio where necessary. The Olympus server is presumably designed to have Xeon E5 processors from Intel as its centerpiece, but with a new universal motherboard design (shown in Fig. 3(b) ) that can support other kinds of processors. Moreover, because the server node is not clogged up with disk drives or flash SSDs, it has sufficient space to add various kinds of accelerators with standard PCIe form factors. This is in contrast to the Open Cloud Servers, which use a mezzanine card in each two-socket sled for a pretty modestly powered GPU or FPGA accelerator.
The universal motherboard [38] supports up to 2 CPUs, up to 32 DIMMs, up to 12 SATA devices, up to 3 FHHL (Full Height Half Length) PCIe x16 slots, x8 PCIe cabling (OCuLink), up to two PCIe x8 slots each capable of supporting up to two M.2 modules through an interposer board, up to four M.2 modules via direct-attach to the motherboard.
The universal motherboard will support the latest server chips, including Intel's Skylake and AMD's Naples. Project Olympus represents something that is rarely seen in servers: span from x86 to ARM with support for Qualcomm's Centriq 2400 or Cavium's Thunder X2 chips. Microsoft also announced a GPU accelerator with Nvidia and Ingrasys called HGX-1, which can be scaled up to link 32 GPUs together. The Project Olympus HGX-1 supports eight Nvidia Pascal GPUs via the NVLink interconnect technology. Four HGX-1 AI accelerators can be linked together to create a massive machine learning cluster of 32 GPUs. Some features are summarized in Table 1 , which is also available to the following servers.
C. FACEBOOK'S SERVERS
Leopard two-socket server is used for a variety of compute services at Facebook. Its successor Tioga Pass has a similar dual-socket motherboard to Microsoft's universal motherboard but uses a 6.5 inches by 20 inches form factor, which allows Facebook to slide three servers into the Open Rack enclosure, side by side. Tioga Pass supports both single-sided and double-sided designs with DIMMs on both PCB sides. Tioga Pass upgrades the PCIe slot from x24 to x32, which allows for two x16 slots, or one x16 slot and two x8 slots, to make the server more flexible. This doubles the available PCIe bandwidth when accessing either GPUs or flash. This is also Facebook's first dual-CPU server to use OpenBMC framework for management.
Yosemite is the sled that incorporates four 1S (singlesocket) boards and the associated NIC. As shown in Fig. 3(c) , the Yosemite V2 Platform [39] , a refresh of Yosemite Platform, can have two configurations: first, it can host four 1S server cards in all four slots as in the Yosemite Platform; the second option is to configure two 1S servers in slot 2 and 4 with device cards in slot 1 and 3, which are connected to a 1S server in slot 2 and 4 through 6 x4 PCIe Gen3 links. The device card is based on PCIe linked to the corresponding 1S server and it can be a flash card, a GPU card, a FPGA card, and so on.
Big Basin is the successor to Facebook's Big Sur GPU server, the first widely deployed, high-performance compute platform that used to train larger and deeper neural networks at Facebook. Big Basin can train models that are 30 percent larger because of the availability of greater arithmetic throughput and a memory size increase from 12 GB to 16 GB. Big Basin is designed as a JBOG (just a bunch of GPUs) with 8 GPU's flexible interconnect internally, and externally to host, to allow for the complete disaggregation of the CPU compute from the GPUs. It does not have compute and networking built in, so it requires an external server head node. By designing in this way, it is possible to connect Open Compute servers as a separate block from the Big Basin unit and scale each block independently as new CPUs and GPUs are released.
D. AMAZON'S SERVERS
Amazon builds their own servers because the servers bought off the shelf are very expensive and for general purpose. Amazon worked together with Intel to make household processors run at much higher clock rate, which is suitable to build custom server types to support very specific workloads.
At the re:Invent 2016 conference, James Hamilton, VP and distinguished engineer at AWS, showed a type of the company's custom compute server, each occupying one slot on a rack, sparsely populated, but optimized for thermal and power efficiency. The custom compute server's power supplies and voltage regulators operate at greater than 90 percent efficiency, which is much better than the dense server sold to customers by OEMs. In the case of AWS's spending hundreds of millions of dollars on electricity, if this power supply is 1 percent better, that gets to be a pretty interesting number. Amazon also has hyperscale, minimalist design. A slide presented by Amazon revealed a pair of server racks that looks like the server enclosure has room for ten server nodes -five left and five right -that are split by what are presumed two power supplies or peripheral nodes in the center.
In F1 FPGA instance types on EC2 (Elastic Compute Cloud) [40] , the instance size of f1.16xlarge provides 8 FPGAs, each with over 2 million customer-accessible FPGA programmable logic cells and over 5000 programmable DSP blocks. Instead of throwing a bunch of FPGA cards into a PCIe server, Amazon has designed a custom server with a fabric of pooled accelerators that interconnects up to 8 FPGAs. This allows the chips to share memory and improve the bandwidth and latency of inter-chip communication. This tells us that Amazon is likely to see large scale requirements for applications such as inference engines for deep learning and other workloads.
P2 GPU instance types of EC2 offer up to 16 NVIDIA K80 GPUs (8 K80 cards) in a single instance, p2.16xlarge size provides a combined 192 GB of GPU memory, 40 thousand CUDA cores, supporting for 70 teraflops of single precision floating point performance, over 23 teraflops of double precision floating point performance. P2 GPU instances are obviously used for high-performance deep learning training and inference.
E. GOOGLE'S SERVERS
In the early days of Google, the company's servers in a data center were stripped-down servers slotted into extremely tight spaces. They didn't even have cases. This allows the company to buy only the components they need and nothing more [41] . A relatively new message about Google's server is that Google and cloud builder Rackspace Hosting have agreed to partner on a future server design based on IBM's future Power9 processor. The new machine will have two server chips with both NVLink interconnects and PCIe 4.0 peripheral controllers, which will support IBM's CAPI overlay to hook accelerators and other devices directly in the Power processor complex and allowing them to share memory.
AlphaGo is powered by TPU, which is built on a 28nm process, runs at 700MHz and consumes 40W when running. Because TPU is needed to deploy to Google's existing servers as fast as possible, the company chose to package the processor as an external accelerator card that fits into a SATA hard disk slot for drop-in installation. The TPU is connected to its host via a PCIe Gen3 x16 bus that provides 12.5GB/s of effective bandwidth. The secret of TPU's outstanding performance is its dedication to neural network inference. The quantization choices, CISC instruction set, matrix processor and minimal design all became possible when it was decided to focus on neural network inference. Google also announced that its second-generation TPUs were coming to Google Cloud to accelerate a wide range of machine learning workloads, including both training and inference.
Google announced in 2016 that Google Cloud Platform will offer GPUs worldwide in 2017 for Google Compute Engine and Google Cloud Machine Learning users. Google Cloud will offer AMD FirePro S9300 x2 that supports powerful, GPU-based remote workstations. Google will also offer NVIDIA Tesla P100 and K80 GPUs for deep learning, AI and HPC applications that require powerful computation and analysis. GPUs are offered in passthrough mode to provide bare metal performance. Up to 8 GPU dies can be attached per VM instance including custom machine types.
F. OTHERS
Besides the fore-mentioned companies, Baidu, Tencent, Alibaba, and a handful of others, all reacted to server evolvement in their own ways. To deploy white boxes is normal, because Tencent, Baidu, and Alibaba, for example, are all members of the Facebook-led Open Compute Project for designing webscale hardware. Tencent, Baidu, and Alibaba actually launched their own rack-design specification, which is Project Scorpio, now called ODCC (Open Data Center Committee) [42] .
There was industry speculation in 2015 that servers compliant with the ODCC's ''Scorpian'' specs would gain a significant share of Chinese server purchases in 2016 and that Alibaba will have migrated all its servers to Scorpioncompliant units by the end of 2017. We can't find confirmation of either of these above predictions, but we do know they are using GPUs and FPGAs in their clouds [43] , [44] .
The Baidu FPGA Cloud Server, a new service in Baidu Cloud, features highly efficient Xilinx Kintex FPGAs, tools, and the software needed to develop and deploy hardwareaccelerated data center applications such as machine learning and data security. The new Baidu Cloud offers the latest GPU computing technology, including Pascal architecture-based NVIDIA Tesla P40 GPUs and NVIDIA deep learning software. It provides both training and inference acceleration for open-source deep learning frameworks, such as TensorFlow and PaddlePaddle. Alibaba's Aliyun and Tencent can also provide similar services.
VI. APPLICATIONS
Amazon has GPU-Accelerated Computing on AWS with a marketplace for GPU-related applications. NVIDIA maintains AMI (Amazon Machine Image) with CUDA Toolkit 7.5 on Amazon Linux 2016.03 (64-bit architecture) operating system. The CUDA Toolkit provides a development environment for C/C++ developers building GPU-accelerated applications. It includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of users' applications. Users will also find programming guides, user manuals and API references to help get started quickly with GPUs. They can use this free AMI to prototype, test and deploy their algorithms on single and multi-GPU configurations.
Amazon also has FPGA-Accelerated Computing on AWS for FPGA-related applications, providing more choice and easy access for all AWS customers. F1 is the accelerated instance for FPGA, meaning it offers FPGA devices to help developers accelerate specific types of applications using FPGA-based computing methods. Different AFIs (Amazon FPGA Images) delivered by Amazon's partners can be loaded and reloaded on an F1 instance. FPGA developers get all needed FPGA design and programming software through AMI provided by AWS.
One of those AMIs is called Zebra [45] from Mipsology to accelerate neural network inference using FPGA. User-defined neural networks are computed by Zebra just as they would be by a GPU or a CPU. Zebra is fully integrated with the traditional deep learning infrastructures, like Caffe, MXNet or TensorFlow. There is zero FPGA knowledge required nor a single line of code to write to use Zebra. Simply link to the Zebra library to switch from CPU or GPU to FPGA in minutes. Zebra includes the FPGA image and the software stack, there is no FPGA compilation or FPGA tools to use.
The two above examples are customer-oriented applications provided by Amazon. Other applications from the others include but is not limited to the following: Facebook's servers use machine learning technologies to improve services, with one visible example being image recognition. Facebook also uses artificial intelligence to power services like speech and text translations [46] , photo classifiers, and real-time video classification. Microsoft uses FPGAs to deliver faster Bing results [47] , and its data centers apply machine learning for natural language services like Cortana. AlphaGo is famous for the matches against Go world champion, Lee Sedol. It is powered by Google's TPU ASIC.
VII. EXPERIMENTS AND DISCUSSION
It is no doubt that these two years are amazing for the awareness of AI. Since machine learning is a main sub-field within AI, the sheer volume of applications being built using machine learning is truly explosive. There are two aspects of machine learning, one is training the neural network with massive amounts of sample data, and the other is using the trained network to infer some attribute about a new data sample. In the cloud, CPU, GPU, ASIC and FPGA each has its advantages for a specific type of application or in a specific environment. The application complexity and data velocity determine how much and which kind of processing is needed.
While the training portion of machine learning has benefited enormously from GPUs, FPGAs and ASICs are expanding in the inference portion for better performance and power budget.
To evaluate performance and power consumption of different kinds of chipsets, we designed three experiments to compare them. Experiment I is to compare power consumption between FPGA and GPU. Experiment II is to compare training time between CPU and FPGA. Experiment III is to compare inference time between GPU and FPGA.
In experiment I, we deployed a simple 6-layer DNN to infer handwritten text classification for the MNIST dataset. The inference could run on FPGA and GPU. When conducting experiment for FPGA, we used a Xilinx test board with Zynq-7000 XC7Z010CLG400-1 FPGA onboard. This board was compared with a low-end GPU card GeForce GT 430 whose power is about 50W. The experimental result is shown in Fig. 4(a) , the power consumed by GPU and FPGA are 50W and 2W respectively.
In experiment II, we trained AlexNet to do image classification for the CIFAR-10 dataset. The training ran on both CPU (Pentium G2020 2.9GHz) and FPGA. For FPGA training, it trained the AlexNet model using Xilinx PCIe accelerator card KCU1500, which featured with a Kintex UltraScale XCKU115-2FLVB2104E FPGA. When configuring the number of iteration to 12,000, the experimental result is shown in Fig. 4(b) , the running time of CPU and FPGA are 150 minutes and 110 minutes respectively.
In experiment III, we deployed VGG-16 model to infer image classification for an image with 224 by 224 pixels and 3 color channels from ImageNet dataset. The inference ran on both GPU and FPGA. When conducting experiment for GPU, we used the GPU accelerator card whose model is GeForce GTX 1080 Ti. For FPGA, we used the test result from [48] . The experimental result is shown in Fig. 4(c) , the running time of GPU and FPGA are 150ms and 263ms respectively. Regarding power consumption, we noticed the difference between power consumed by FPGA and power consumed by GPU, that is below 100W versus above 300W.
During above experiments, we evaluated the power consumption and performance of different chipsets. We can observe the following:
1. When deploying some simple DNNs to execute inference operations, performance of no significant difference can be achieved by FPGA, which only consumes as low as a few watts (shown in Fig. 4(a) ). 2. For training tasks, there are not much difference in performance between FPGA and CPU (shown in Fig. 4(b) ). Therefore FPGA is not suitable for training. At least, that is the case at present. 3. When using some complex DNNs to infer standard dataset, large scale FPGA can achieve performance close to GPU's (shown in Fig. 4(c) ), while its power consumption is more advantageous than GPU. 
VIII. FUTURE ARCHITECTURE
Next-generation servers are expected to keep pace with the seemingly insatiable demand for more performance, more resiliency, and better power usage effectiveness. In our conceived cloud server architecture, as shown in Fig. 5 , there are two planes for control and data respectively. The control plane dispatches various applications and configures the corresponding processors, including GPUs, FPGAs, and ASICs. The data plane fabric interconnects the processors within a chassis or a rack, providing high bandwidth, high-integrity data paths between processors in a peer-to-peer network. As Application BCD (Balancer, Classifier, Dispatcher) separates task and data flows into CPU and FPGA, the system can realize the benefit by using separate control, data planes. The CPU controls the configuration and indicates the processor targets for each workload, such that the workloads can reach the processors directly and do not affect CPU, guaranteeing maximum performance for both of them.
Analyzing applications is an important first step towards choosing the appropriate processor(s) for each workload.
Based on this analysis, the application dispatcher can assign the workload to the most suitable processor(s). The combination of GPUs, FPGAs and ASICs is nimble and adaptable enough to react quickly to maintain efficient delivery of compute critical services, making it possible to extract the maximum amount of compute capacity. The balancer improves the distribution of workloads across multiple same purposed processors.
Due to different applications have different processing requirements and a cloud infrastructure provider should make decisions by the classifier and dispatcher on what the application is like and how the resources are used. Deep learning training is often limited to a single choice -GPUs, since that market domination appears to be pretty stable, at least for the time being. In deep learning inference applications, it's preferred to use ASICs like TPU when workloads are dominated by 8-bit integer operations. FPGAs are suitable for those emerging DNNs, such as spare DNN, BNN, TNN, and other flexible customized N-bit operations. GPUs are also good at classic DNNs that rely on dense GEMM using FP32 data type.
According to the design, storage is disaggregated from the compute elements. The advantages of disaggregating the storage are significant. By separating storage from the compute, we will have the flexibility to upgrade, replace, or add individual resources instead of the entire system. This also enables us to plan better for future growth, adding storage only when necessary, and makes better use of available storage space.
Disaggregation of storage from the compute elements considerably reduces the total cost of investment in the cloud in various ways, increases the efficiency of the storage utilization, improves the resiliency of the storage stacks, and allows for pay-as-you-grow planning for the future of the cloud infrastructure. The type of storage used now can be adjusted according to the data types. Using SSD can make it faster to store and get data than SAS or SATA drives, but they are advantageous for the infrequently accessed store because of their cheaper prices. It is worth noting that, with the availability of high-speed interconnect technologies, storage disaggregation causes no penalty to performance.
When considering the trade-offs and relative differences between GPUs, FPGAs, and ASICs in terms of energy consumption, performance density, and other factors, we think it is possible to balance the framework to favor the better device for different workloads. ASICs and FPGAs provide superior energy efficiency (Performance/Watt) compared to high-end GPUs, they can achieve at least satisfying performance with lower power consumption.
The system is designed and configured with modular components, which enable the system to quickly add additional processing power or replace damaged hardware without affecting services. Each processor dynamically adjusts its energy consumption based on real-time loads, allowing them to operate with as low as possible power consumption. Particularly, the processors can be set to standby state to save power in case there are no so many workloads.
The conception is expected to tackle the ever-increasing demand for artificial intelligence and machine learning, aiming to provide more effective platform and consume less power energy. One concerned issue may be the fact that the scale of the design is too big for a small-sized enterprise or private cloud. But it would be adopted by many cloud platforms through appropriate modular configuration to improve the capabilities of their AI-oriented cloud services, such as image recognition, text translation and fraud detection.
IX. SIMILAR WORK
Sze et al. [49] provided a comprehensive overview of the basic components of DNNs, popular DNN models currently in use, and a number of the current AI applications, surveyed the techniques that enable efficient processing of DNN. Among those techniques, they described the various resources used for DNN research and development, the various hardware platforms used to process DNN and the various optimizations to improve throughput and energy without impacting performance accuracy, highlighted the various joint algorithm and hardware optimizations that can be performed on DNNs to improve both throughput and energy while trying to minimize impact on performance accuracy. They focused on computer vision applications, the associated algorithms, and the data being used to drive the algorithms, described the key metrics that should be considered when comparing various DNN designs.
Kachris and Soudris [50] provided a survey of the frameworks for the efficient utilization of the FPGAs in the data centers. They first introduced cloud applications and the main characteristics of these applications, and described the frameworks for the efficient deployment and virtualization of hardware accelerators in data centers, then presented hardware accelerators for the most widely used cloud computing applications such as MapReduce, Spark, Memcached, etc. Furthermore, the paper provided a qualitative categorization and comparison of the proposed schemes based on their main features such as speedup and energy efficiency.
Deng [51] presented a brief history of deep learning from the perspective of signal and information processing, and developed a categorization scheme to analyze the existing deep architectures in the literature into generative, discriminative, and hybrid classes. For each of the three categories, a tutorial example was chosen to provide more detailed treatment. The paper discussed the applications of deep learning for information processing in five broad areas, which are 1) Speech and audio, 2) Image, video, and multimodality, 3) Language modeling, 4) Natural language processing, 5) Information Retrieval.
X. CONCLUSION
When it comes to cloud servers, the fact is that different players have their own custom designs. We have done a thorough survey of these designs through open materials spread over papers, documents and websites. We believe that to expound the design points of the servers, some useful knowledge should be explained in advance. We begin with sorting out the key technologies related to deep learning, GEMM, quantization and some emerging DNNs are described in order. The chipset market was once led by CPUs and GPUs, but FPGAs and ASICs are getting more and more attention. We illustrate all of them for feature comparing. The effectiveness of the chipsets are inseparable from advanced server buses, therefore some typical buses are introduced. As shown in Table 2 , through the investigation, we can answer the two questions raised in the introduction section.
Although there are differences between servers, the developing direction is definite, all players have their own GPU and FPGA servers, Google even builds ASIC into its servers.
After delving into the servers, confirmed with experiments, we predict a new framework for next-generation cloud server, which has flexibility in using GPU, FPGA or ASIC, making it possible to extract the maximum performance. At the same time, power consumption is taken into consideration for better power usage effectiveness. In conclusion, this survey can broaden the horizon and lighten the way for future AI-able server designs. It also provides a thorough server tutorial for cloud-related researchers to understand cloud infrastructure and service more deeply.
