• Networks → Physical links; Error detection and error correction; Network control algorithms;
ABSTRACT
Rack-scale systems contain thousands of densely packed connected components. While a data center may accommodate a fully provisioned network, rack-scale systems demand a more compact and versatile network that would even up within a heavily populated system. Unless the critical path between communicating hosts is made faster, distributed rack-scale applications cannot scale. We present adaptive rack-scale fabrics, an architecture that uses Physical Layer Primitives, coupled with a Closed Ring Control. The resulting fabric uses pre-fetching techniques, but at the physical layer of the interconnect, to optimize performance within strict power-budget limitations.
MOTIVATION
Rack-scale systems do not necessarily follow the cpu-boardcentric architecture that traditional racks use [4] . Instead of using regular server blades, we strip down the components and redesign according to the relevant metric -NVMe for fast storage, significant amount of DRAM for caching etc. This leads to a layout of hundreds and even thousands of interconnected nodes in a single rack. The meaning is that within a single rack we find a network as sophisticated and complex as in a data center, only much more constrained. In particular two problems arise: latency and power consumption. Figure 1 shows the latency a packet experiences by traversing multiple hops through layer 2 cut-through switches. It also shows that the delay due to the media, (e.g., fiber) is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. negligible relative to the use of packet switching. The conclusion is that in the scale of a rack, it is packet switching that prevents distributed rack-scale applications from scaling. As an example, consider a MapReduce operation that requires transmission from all nodes. Since a reducer has to wait for data from all mappers, the slowest link pulls down the performance of an entire system. The latency due to propagation of packets in the media vs. the latency due to packet traversing a layer 2 state-of-the-art cut through switch. We assume a switch every 2 meters. In the scale of a rack, the latency due to packet switching is dominant, and hence is bottlenecking scalability.
Power budget is also a constraint, since rack-scale systems inherit the power budget of a traditional rack, and is factored into our proposed architecture as shown in figure 2 . Three key points of the architecture are:
• Backwards compatibility -No restructuring of the network layer is needed. In particular, existing applications benefit from the architecture with no required change.
• Media agnostic -the specific underlying media is irrelevant. We only expect it to provide some subset of the Physical Layer Primitives that we define.
• Forward compatibility and fast adoption -Novel physical layer advancements could be easily integrated into a system already running our CRC. 
SIGCOMM Posters and
Demos '18, August 20-25, 2018, Budapest, Hungary O.S.Sella et al.
PROPOSED ARCHITECTURE
Configurable interconnect has seen many advances in recent years. Both on the optics side as in [2] , as well as the electrical side as in [3] . While these solutions are different in the underlying media (optics vs. electrical) as well as in configuration times, they could be treated as functionalities that were added to the (already existing) physical layer. We place these extensions to the physical layer under a single framework, which we call Physical Layer Primitives (PLP). In turn, these PLP are orchestrated by a control mechanism, that also schedules flows according to the availability of PLP's. The control part of the architecture, called Closed Ring Control (CRC), uses feedback from the interconnect such as latency, power consumption etc., to tag each link with a cost function. In this way, both routing as well as changes to the topology, are subject to the tools of control theory. By detaching the development of PLP from innovation in CRC we obtain two goals: 1. Allowing new physical layer improvements to be coupled instantaneously with a control algorithm, and 2. Enabling faster data centre adoption of high cost disruptive technologies. A system that already uses our PLP will absorb seamlessly any physical layer advancement that could be characterized as a CRC.
Physical Layer Primitives -PLP
We assume that a physical link is made up from physical lanes. The canonical example is a 100Gbps link that is made from four 25Gbps physical links, but different wavelengths under wavelength division multiplexing is an equivalent example. Looking at [3] and [2] , we can identify several Physical Layer Primitives, and in addition draw new ones:
( 
Closed Ring Control -CRC
The Closed Ring Control, or CRC uses per-link price tags, with respect to metrics such as latency, congestion, link health etc. to allocate PLP's and schedule flows. The problem that arises in all reconfigurable fabrics is finding the minimum flow size for which reconfiguration is worth the cost. This could be formulated as an optimization problem and solved distributively by the CRC. Further insights on rapid provisioning and reconfiguration, as well as traffic engineering for virtual switching can be found in Andromeda [1] . Figure  2 shows a CRC embedded in the rack. Upon receiving perlink statistics, the CRC issues PLP instructions to improve the target metric, e.g: latency, by reducing the amount of switching logic that a packet has to go through.
EVALUATION
Since rack-scale systems contain hundreds to thousands of connected nodes, a simulation is used to evaluate the solution. We chose omnet++ as our simulation framework. To be certain that a large scale simulation is sound and credible, we begin with a small scale simulation verified by a hardware proof of concept (POC). We intend to use the NETFPGA SUME platform [5] for the hardware POC. Once the small scale simulation is validated, the POC will be integrated into the large scale simulation.
