Innovating the Delivery of Server Technology with Kaleao KMAX
John Goodacre | University of Manchester W herever we turn these days, we find another end of the world as we know it in computing, typically followed by why some neurological braininspired or physical quantum effect will save the day. But what about finding a new way to assemble the components of a computing system that can still leverage the colossal investment in existing software? It's easy to find silicon technologists claiming 2.5 and 3D ICs can extend Moore's law into the third dimension (www.ieee.org/conferences_events /conferences/conferencedetails/index.html?Conf_ID=37452), and software architects are constantly creating new abstractions on top of virtualized systems, but where's the guide for how to use 3D to deliver software abstraction in hardware?
A Growing Problem
As markets that require high-investment engineering continue to grow and expand, various forces are driving both consolidation among suppliers and solution generalization to maximize return on investment. The semiconductor market has converged around two major suppliers: ARM, with its licensed IP suitable for embedding in a system on a chip (SoC), and Intel, with its roadmap of end devices born from the PC revolution. The legacy of the desktop architecture, in which a processor controls an extension bus, can still be seen across today's servers, with the processor at the center of control, whether it's delivering storage services or network content. Any specialization or scalability is limited by the general-purpose peripheral bus and plugins of adaptors and accelerators (such as PCIe). The integration of all of this in a single die as a SoC is becoming common across the server/desktop market, albeit still following the same system architecture. There are various quoted and estimated levels of investment required to deliver a competitive processor device-for example, Gartner suggests more than US$270 million for a next-generation design (https:// semiengineering.com/10nm-versus-7nm/). However, with only around 10 to 15 million servers sold in 2016 (www .gartner.com/newsroom/id/3626117), even accounting for multisocket servers, there isn't an easy return on investment to be had, all of which further entrenches servers into consolidation and generalization.
Design isolation among a computer's various components has now reached a point of significant imbalance, with, for example, the GPGPU's accelerator capability being starved of host data, flash drives exceeding the bandwidth, and network packet rates of a single processor's processing capability. In addition to the debilitating disadvantage that comes with scaling, each node in such a system is also independent, without the ability to share and partition more capable components with less capable ones.
In addition, computers and servers now extend across myriad application scenarios, with each scenario requiring a different configuration or balance of compute resources. A one-size-fits-all generalization doesn't always work and, in fact, causes significant losses in cost and energy efficiency. The concept of being able to compose a set of computing resources for a specific application has been an industry goal, with various vendors looking for a mechanism to disaggregate resources into separated pools only to converge at the point of use. Initially, this meant physically separating resources, but today it means using additional layers of software above a virtualized abstraction of a computer, providing the illusion of resource composability through the execution of software and the evolution of software-defined everything. This article describes a solution to this problem: Kaleao.
The Foundation of a New System Architecture The key foundation of the Kaleao technology is a new system architecture that allows a computer system's different resources to exist independently of other resources. Whether it's focused on storage, networking, or processing, each unit is capable of operating as a peer for the others, while also sharing common components and providing the opportunity to increase the density of one aspect of the platform to match the shared capability of another, one example being high-speed networking sharing its bandwidth among multiple processors or directly with a more capable storage system. While the system architecture is physically converged on the circuit board and SoC with all components of a system physically close together, the architecture presents each one logically as if it were disaggregated. However, thanks to the use of a common global addressable communication interconnection fabric, the components are composed only at the point of use. In effect, a Kaleao compute node is no longer following a von Neumann model of a central processing unit that owns a memory unit and adapts through an external bus to other resources through an input/output device. Each resource acts as an independent compute unit that can be interconnected across any topology or transport through which the common and global address scheme of a compute unit are interconnected. Existing components can be bridged and shared, while new components can be developed independently of a specific bus standard.
As Figure 1 shows, the hardware interconnects to form pools of compute resources. Physically converged at the board and SoC levels to ensure the benefits of resource locality's lower latencies and higher bandwidths, the compute node is defined by a collection of compute units and resource elements. Each of these components is physically interconnected across what Kaleao calls an architectural fabric. A small software module within a thin virtualization layer can then compose a machine's memory map using the hardware-level addresses of the architectural fabric the resources required for the instantiation of a specific class of machine.
Unlike the software methods deployed by abstractions running on commodity hardware, the microvisor layer on Kaleao physically allocates the hardware resources required to create mappings in the fabric and build a physicalized machine. After this point, the microvisor has little if any interaction between the hardware and the host machine: no virtual soft-switching of network traffic, no logical-tophysical mapping of storage blocks, no share forwarding, and no noisy neighbors, all of which are significant limitations in delivering the expected host performance to an application. The first Kaleao product, KMAX, has realized both network and storage composable resource pools, with R&D developments demonstrating the formation of composable pools of processors, memory, and FPGA accelerators (http://pure .qub.ac.uk/portal/files/17613806/ecoscale_date_2016_v4 .1_151127.pdf). As is typical with today's plugin adaptors, there's no cache coherence requirement between the processor and an interface resource: any coherence requirement is managed explicitly by the software driver or the protocols used to support communication.
Units of Compute
The concept of a unit of compute ( Figure 2 ) was discussed as early as 2013 (www.techdesignforums.com/blog /2013/03/22/date-arm-unit-of-compute-energy-efficient -systems/). There have been various innovations around the "backside bus" (the side a processor uses to create a multicore) and the extended interconnect of a small number of memory-coherent processors so as to provide a coherent nonuniform memory access (cNUMA) processing unit, but the architectural concept that such a unit is still connected to the outside through a "front-side bus" is well engrained in almost all of today's computers. The Kaleao compute unit, however, can encapsulate any structure for compute, from a single CPU with a local memory to a multisocket, multicore, fully heterogeneous, coherent processor. The only requirement is that any compute unit must be able to both address local memory and a global address space. (This can be accomplished with existing processors simply by using the most-significant-bit of the memory address or more flexibly through future interconnect and memory segmentation schemes.) Likewise, it must accept the ability of another compute unit to address its local memory with coherence if it wants to share its local resources, or without coherence if it's only going to present its resources to others. Various configurations of global address virtualization, remapping, and memory coherence also exist, but their description is outside this article's scope.
Given one or more compute units, a system can share and use various resources owned by each specific compute unit. Although this might raise concerns about circular paths and deadlocks, ensuring that a unit honors certain rules for freeflowing data is how Kaleao avoids such issues. As Figure 3 shows, a system effectively has two address spaces: each unit has a local address space, and all units share a global address space. Because the fabric used to implement the global address space is directly addressable by each compute unit, it's easy to share resources between units by placing the addressable resources in the global address map (contrast this with the local address map of a single unit's processor, as is traditional today). Transporting transactions in the global address space can happen through an on-chip interconnect, wires between units within a package, or between boards using a high-speed backplane link or an associated network protocol.
The first-generation Kaleao KMAX compute node was defined with six separate compute units, four to provide the general-purpose compute and the other two to provide the networking and storage resources to be physicalized. There's no cache coherence between the units, thus only the sharing of resources from one node to another is exposed, to enable sharing. Kaleao uses FPGA technology in storage and network units to accelerate hardware-class performance; it removes the management and data movement costs that currently limit the performance of virtualized and hyperconverged platforms, where compute resources are brought together in a single physical box and exposed to the virtual machine as a per-node partition of resources through software.
Compute Node
On first inspection, the KMAX compute node doesn't look like a typical computer motherboard. It uses manufacturing methodologies that provide the high reliability and robustness of an embedded computer, as opposed to the less robust adaptor-based, plugin, moving parts approaches of traditional server platforms. This tradeoff can be made because the KMAX node provides inherent resource composability, rather than needing to select a different adaptor, for example, a different networking capability. Also, the core-to-DRAM capacity ratio has been set with a per-thread capacity generally accepted by target market applications, offering additional robustness and significant cost savings while also increasing the core-to-byte bandwidth ratio significantly. Figure 4 shows the first KMAX node. On the left are the four general-purpose compute nodes, the DRAM, and a local nonvolatile caching drive. These units are then interconnected using the Kaleao architectural fabric to the two devices on the right of the board. The top device is a unit that manages network resources; the lower device handles storage resources. The actual storage capacity is provided by a NVMe SSD mounted directly on the rear of the board. The storage processing unit provides the capability to deliver a fully distributed and resilient storage array when multiple nodes are installed in a single system. Because each storage unit on each compute node creates the distributed storage provision directly, the KMAX Appliance Edition is able to deliver a system with distributed storage without needing to run a storage application on the host processors. Also, because the architectural fabric connects the network and storage units, distributed storage can be delivered without host processors being powered on.
Blades
Having now created the physically converged but logically disaggregated compute node, Kaleao can now extend the networking capability among the four compute nodes to form a physical blade. Future designs can also extend the architectural fabric across this network, allowing composability of resources using pools larger than those of a single board.
The KMAX blade embeds a multiport 10/40 Gbyte/s switch, which provides each blade with dual 40 Gbyte/s Ethernet ports at the front of the blade. The rear provides the interface for an independent 1 Gbyte management network, and the distribution of the 48 Vdc supplied at chassis level to the point of load. Using a 48 V power distribution strategy also provides power savings associated with the removal of various DC/AC and DC/DC conversion, enabling the delivery of high amps at less space and cost. Each blade provides four SSD slots, each capable of hosting up to a 7.68 Tbyte NVMe drive.
Chassis Scalability
Within the standard 19-inch 3-RU (rack unit) KMAX chassis, 12 blades deliver 960 Gbyte/s of aggregate Ethernet bandwidth through 24 QSFP ports, 48 SSD slots supplying up to 360 Tbyte of NVMe flash storage, and 192 processors, each processor supporting 128 Gbyte of local NV cache, 4 Gbyte of DDR4, and 8 ARM Cortex cores arranged as an 8-core symmetric multiprocessor (SMP) that consists of 4 Cortex-A57 and 4 Cortex-A53 in the ARM big.LITTLE arrangement (https://www.arm.com/products/processors /technologies/biglittleprocessing.php), thereby effectively providing each main application thread with 1 Gbyte of memory, a typical configuration for web-scale applications.
Each QSFP port can be lagged and configured with different network routes, with a blade also able to support integrated port-stacking and create a single multiport switch across the chassis, offering up to a 12:1 port-blocking ratio without needing to supply an additional rack-mounted switch.
The chassis includes additional functionality in terms of management, offering local screen/keyboard access to the management console in addition to remote network access from an independent 1 Gbit/s management network. This chassis manager also provides control over the PSU rectifiers and power state of chassis components, thereby controlling the overall power consumed by the system during inactive periods. The unobstructed paneling on each blade also allows for high air flows through the chassis, which in turn significantly reduces demands on inlet air temperature.
The Metric of Scale-Out
The concept of scale-out processing (https://infoscience .epf l.ch/record/176330/files/sop_isca12.pdf ) and its applicability and performance for cloud server workloads is well researched (https://scholar.google.com/scholar?oi =bibs&hl=en&cites=668093864912588690). The KMAX approach further optimizes the concept through its system architecture by balancing resources and removing unnecessary cost and inefficiencies from the duplication of common infrastructure such as power delivery, boot, and management.
Because the KMAX architecture views each processor as a compute unit, only the core, memory, and high-speed I/O links to the other node units are powered and active. In common with the Samsung Galaxy S6, the KMAX compute unit utilizes the Samsung Exynos 7420. Anandtech provides a thorough deep-dive on this processor, showing both core performance and power consumption (http:// www.anandtech.com/show/9330/exynos-7420-deep-dive). The KMAX node has four Exynos subsystems, each supporting the advanced power management unit, a 4 Gbyte PoP LPDDR4 memory, and a 128 Gbyte UFS2 local flash storage device; these are the unit of scale-out. The KMAX design delivers a nonblocking network to each processor, while splitting the available NVMe bandwidth between the processors. There are two 5 Gbyte/s links between the processor to the network and storage units, and two 10 Gbyte/s links from the Network Processing Unit to the blade-level switch that connects the four nodes per blade to the front panel at two 40 Gbyte/s. This arrangement allows each compute unit to scale out its performance in a linear manner.
The general-purpose Exynos 7420's performance can be seen through the Geekbench benchmark, with a multicore score of around 4,200 (https://browser.primatelabs.com/v4 /cpu/search?q=Galaxy+S6). The advanced KMAX cooling approach enabling the processor to run without thermal throttling capabilities as seen in a phone, maintaining 2 GHz at 5 W consumption. Together, the TDP of the entire KMAX node is 60 W, with each node providing a scale-out aggregate Geekmark of 16,800, or 280 "Geeks" per Watt. A major benefit of this benchmark is it can be used to compare across platforms, and as such with a traditional-based server platform. To compare performance, we can take the performance quad-socket Xeon E5-4669 server with a total of 88 cores (https://www .supermicro.nl/products/system/2U/2048/SYS-2048U-RTR4 .cfm) providing a multicore Geekmark of 69,256 and a node power of 800 W leading to 86 Geeks per Watt (https://browser. primatelabs.com/v4/cpu/2092496). For performance density, KMAX provides 48 nodes, for an aggregate Geekmark of 806,400 in 3 rack units, or 268,800 per rack unit; the Xeon system is 2 U and therefore 34,628 per rack unit.
The scale-out approach also scales the number of channels to memory, with each Exynos processor offering a LPDDR4 device with a theoretical peak of 24.88 Gbyte/s, which scales out to 2 Tbyte/s per rack unit compared to the typical 68 Gbyte/s quad-channel Xeon E5 at 136 Gbyte/s per rack unit even in the quad-socket design.
Software Model
For a software engineer, the easiest way to think of a KMAX node is to see it as four separate operating system (OS) targets. Each target in that regard is an 8-core ARMv8 64-bit processor providing four main application hardware threads and four threads for any management or less critical task; scheduling is automatic through the big.LITTLE scheduler extensions in the Linux kernel (https://www.arm.com /files/pdf/big_LITTLE_technology_moves_towards_fully _heterogeneous_Global_Task_Scheduling.pdf).
Each OS target is presented with network and storage resources dependent on the Kaleao product edition. In the server edition, each OS is supported by a platform network driver capable of being configured with a flexible number of network adaptor interfaces. Each network interface shares the two 10 Gbit/s node network uplinks, with the associated networks able to switch to a flexible topology exposed through the two QSFP ports (2 3 4 3 10 Gbit/s or 2 3 40 Gbit/s) at the front of each blade. For storage, each of the processor compute units has direct block access to the local SSD.
For software developers, the KMAX server platform is therefore simply a standard Linux target. You have access to network interfaces, a dedicated local flash disk, and the nodelevel NVMe SSD. You can install any of the associated Linux distribution packages, and you can build and debug directly on the server platform by using GNU build tools or across platforms by installing tools such as Linaro cross-toolchain binaries.
In the KMAX Appliance Edition, the platform orchestration software delivers software-defined network and software-defined storage capabilities, both provided from across the architectural fabric, support the independent hardware resources to allow the resources to be pinned directly into the associated virtual machine. In addition, the storage devices are now exposed as a portion of the distributed capacity of the SSD across the system, with replication, cloning, and locality capabilities.
F
or software applications and user-space Linux libraries, nothing changes. Each compute unit is a single SMP operating system target with local storage and networking devices. Each target then has the full catalogue of a Linux distribution with ARM64 support available. The Kaleao KMAX platform can be considered either as a bare metal ARM64 server target or an appliance able to deliver fully orchestrated services. The new system platform architecture thus enables KMAX to deliver 10 times the compute density at a quarter of the power and a third the cost of today's alternative server solutions.
John Goodacre holds a professorship in computer architectures in the School of Computer Science at the University of Manchester and is the director of technology and systems in the research group at ARM. He's also Kaleao's co-founder and CTO. Contact him at john.goodacre@manchester.ac.uk.
