Emerging applications-cloud computing, the internet of things, and augmented/virtual reality-need responsive, available, secure, ubiquitous, and scalable datacenter networks. Network management currently uses simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under an intelligent, millisecond-latency control plane that runs datadriven performance and security policies. However, to meet users' quality-of-service expectations in a modern data center, networks must operate intelligently at line rate.
INTRODUCTION
The tremendous scale of modern data centers-tens of thousands of servers, connected by elaborate networks [45, 78, 95] -causes many logistical and technical challenges [10, 38, 88] . Moreover, the high throughput and low latency requirements of emerging workloads (e.g., cloud computing, the internet of things, and augmented/virtual reality) make managing such large, complex networks challenging [10, 45, 88] . When implementing management policies (e.g., for performance or security), network operators face a dichotomy: they must choose between line-rate execution and computational complexity.
Data-plane devices (e.g., switches and NICs) can react in nanoseconds to network conditions, but have a limited programming model designed to forward packets at line rate (e.g., flow tables [13] ). This restricts network operations to simple heuristics [5, 60, 65] in data-plane devices and purpose-built tasks in fixed-function hardware (e.g., middle-boxes [22, 68] ). A security policy for anomaly detection, for example, would re-use flow tables-intended as L3 routing or L2 forwarding tables-to implement blacklists or Access Control Lists (ACLs). Such policies, therefore, operate within the constraints of current data-plane abstractions, which set forth a binary world: packets matching a blacklist are dropped, with all others forwarded. Nevertheless, data-plane devices process every packet, so they can capture fine-grained statistics (using counters and sketches) and make a new decision for each packet.
Control-plane servers can make complicated, data-driven decisions, but only for a few packets (e.g., the first of every flow). Later packets match the cached decisionsinstalled in the data plane as flow rules-and are forwarded directly by the data plane. By using more data, a centralized control plane can make better decisions, providing better performance and security. For example, servers (possibly with accelerators [58, 79] ) can implement learning anomaly-detection algorithms like clustering, supportvector machines, and neural networks [71, 81, 102] ; these algorithms can automatically find latent non-linear correlations between features.
Ideally, network processing would be data-driven and react to every packet-all packets could be sent through the control plane, or data-plane devices could be more flexible. Caching data-driven, per-packet decisions would provide per-packet reactivity, but header instability would effectively result in all packets being processed in the control plane. This approach would decrease performance by about three orders of magnitude, precluding data-driven performance tuning and restricting data-driven anomaly detection to the most hardened networks. The better approach is a more flexible data plane: by adding a new abstraction designed for decision-making, not packet forwarding, switches and NICs can improve their functionality with minimal hardware (compared to intelligent decision-making with flow tables).
Data planes, today, use only three abstractions to bridge the programmer-hardware gap-packet parsing maps to Finite State Machines (FSMs) [37] , flow rules map to Match-Action Tables (MATs) [13] , and scheduling maps to Push-In First-Out (PIFO) [98] -so any new abstraction must also be ubiquitous, general-purpose, and provide a coherent highand low-level interface. Machine Learning (ML) provides a broad high-level interface suitable for many applications, including supervised and reinforcement learning. Anomaly detection would use supervised learning: operators identify anomalous packets after the fact, which lets a model learn to predict other anomalies. Reinforcement Learning (RL) is more useful for automatic performance tuning: by automatically trying small tweaks to a running model and seeing which ones improve performance, the system adapts itself.
Most ML algorithms are built around linear algebra, which uses a significant amount of repetitive computation, performed on a small number of weights, with regular communication. Unnecessary flexibility, such as the all-to-all VLIW communication [120] , large memories, and ternary CAMs in MATs [13] , consumes chip area without benefiting ML; prior attempts at ML in switches have failed due to this inefficiency [97] . Map-reduce, on the other hand, is a good low-level abstraction for ML because it provides the necessary computations (large numbers of multiplies and adds) and no unnecessary flexibility.
Although ML can make data-driven decisions, it cannot handle all network functionality. ML is suited for decisions currently made by (approximate) heuristics, such as congestion control, ECMP, and anomaly detection; these decisions impact only networks' performance and security, not their core packet-forwarding behavior. Networks built using ML will therefore use flow rules to express a range of valid decisions (e.g., output ports), and the ML model will optimize best-case while bounding worst-case performance by selecting from a pre-determined set of decisions. An intelligent control plane will thus compile user programs phrased as constrained optimization problems: for example, minimizing congestion while ensuring a certain bandwidth for high-priority flows.
In this paper, we present Taurus, a data plane augmented with a new ML abstraction and programmable map-reduce hardware for intelligent (data-driven) packet forwarding. The control plane receives telemetry data from the entire network (e.g., via In-Band Network Telemetry, or INT [61] ), trains new switch weights, and installs them in Taurus alongside traditional flow rules for packet forwarding. To operate at line rate, Taurus's map-reduce block implements only the multiply and add operations needed for ML. The map-reduce block works alongside parsers, MATs, and the scheduler to forward packets, with MATs connecting mapreduce to the pipeline: pre-processing MATs extract numeric input features from packets, the map-reduce block uses these features and an ML model to generate a numeric result, and post-processing MATs transform this output into a packet-forwarding decision.
Recent coarse-grained accelerators for data analytics [44, 86, 104] underpin Taurus's map-reduce block: a user-defined program graph is spatially mapped to a reconfigurable array, where data flows through the array. Taurus's mapreduce block is tailored for line-rate inference: unnecessary operations, including DRAM access and floating-point operations, are eliminated; a gridded organization of compute and memory units is maintained, with 16 SIMD lanes per compute unit and 16 banks per memory unit. We evaluate the overhead of Taurus's map-reduce block as an addition to a programmable switch, and we demonstrate that the average added latency is 178 ns and added area is 24% to implement a range of proposed algorithms.
By enabling data-plane ML with low overhead and a clear abstraction, Taurus moves data-driven processing from a per-flow to a per-packet level and lets complex performance and security policies run at line rate.
In summary, we make the following contributions:
• A Taurus logical pipeline using a map-reduce abstraction for line-rate, per-packet inference ( §3). • A hardware design of a Taurus-enabled switch with a reconfigurable SIMD dataflow engine [86] for mapreduce ( §4). • Analysis of the design using ASIC synthesis and a 28 nm generic library [40] to determine area and power overheads relative to commercially available switches ( §5.1). • Evaluation with real ML networking applications ( §5. 2) and microbenchmarks ( §5. 3) , showing that Taurus supports the functions common in modern ML at line rate (1 GPkt/s).
We begin by motivating the need for an intelligent data plane ( §2) and highlighting both the importance of perpacket ML and the limitations of existing data-( §2.1) and control-plane ML approaches ( §2.2).
Ethics: This work does not raise any ethical issues. This research has no human subjects and formal institutional review is not required.
WHY AN INTELLIGENT DATA PLANE?
Taurus is an intelligent data plane that runs ML at linerate for every packet and uses ML's output to optimize forwarding decisions. Machine learning provides significant improvements in traffic engineering [114] , scheduling [112, 115] , and security [4, 26, 56, 71, 102] . SIMON [36] also reconstructs queuing delays in network switches with higher accuracy than edge-based methods [46, 76] . Furthermore, decision trees (Remy [112] ) and recurrent neural networks (Pantheon [115] ) for congestion control have a throughput-latency frontier beyond that of many humandesigned algorithms [27, 48, 116] . Many of these algorithms make use of sub-flow features. For example, Remy uses RTTs and ACKs, and anomaly detection (e.g., KDD Cup entries [103? ]) uses packet-level features like connection duration. Boutaba et al. [15] survey ML-based networking applications and find that tasks like traffic classification make liberal use of packet-level features [7, 29-32, 67, 75, 82, 101, 117-119] . Even for encrypted traffic, packet-level features like inter-arrival times and packet sizes allow classification [6, 11] .
Per-Packet ML: A Case Study.
To highlight the importance of packet-level features, we build a simple example using an anomaly-detection DNN [102] and the updated NSL-KDD [103] intrusion-detection data set. The DNN uses TCP-level features available in the data set (e.g., the current connection duration or the number of observed packets with an urgent TCP flag set [42] ), but we exclude features only available after a flow (e.g., total source and destination bytes transferred). We measure the DNN's performance with and without packet-level features after training for ten epochs. Packet-level features improve our model's accuracy by over 25% and reduce the number of missed anomalies (i.e., false negatives) by a factor of two, as shown in Table 1 . In short, packet-level events let ML models better understand network behavior and make more accurate decisions.
Limitations of Data-Plane ML
There have been a number of recent attempts to use current switch abstractions (i.e., MATs) and specialized hardware [36] to support per-packet ML models.
Inference on MATs.
The match-action abstraction is insufficient for line-rate ML in modern data-plane devices, due to both missing operations (especially loops and multiplication) and inefficient MAT pipelines [13] . Binary neural networks have been implemented-using tens of MATs for each-but they lack the precision needed for practical deployments [97] . Likewise, an SVM for IoT classification [113] consumes most of the memory of a NetFPGA switch-an experimental research platform [2, 69]-and has not been mapped onto a real switch ASIC. As these techniques use switches' VLIW pipelines to implement simpler, SIMD programs (with lower memory requirements), they use only a small fraction of the MAT hardware while rendering the entire stage unavailable.
VLIW Parallelism. The difference in communication requirements between a VLIW model and a SIMD model is visually described in Figure 1 . VLIW models, used in current switch MATs [13] , have multiple logically-independent instructions per stage operating in parallel, reading from and writing to arbitrary locations. This all-to-multiple input communication and multiple-to-all output communication requires large crossbars. For example, a 16-issue VLIW processor has 20× as much control logic as an equally-powerful cluster of eight dual-issue processors [120] . VLIW's overhead thus limits the number of instructions per stage. Barefoot's Tofino chip only executes 12 operations per stage: four of each of 8, 16, and 32 bits [47] . A typical DNN layer may require 72 multiplications and 144 additions [102] ; even if multiplication were added to MATs, this would be 18 stages (most of the pipeline).
Inference on Accelerators.
Traditional accelerators, like TPUs [58] , GPUs [79] , and FPGAs [33] could extend the data-plane pipeline as bump-in-the-wire inference engines, connected via PCIe or Ethernet. In most accelerators, inputs are batched to increase parallelism: larger batch sizes boost throughput by enabling more-efficient matrixmatrix multiplication. However, to provide reliably-low per-packet latency, unbatched (matrix-vector) execution is needed; otherwise, packets would be delayed while waiting for a batch to fill. Moreover, adding another physicallyseparate accelerator would either consume switch ports (wasting transceivers) or replicate switch functions like packet parsing and match-action rules for feature extraction; separate accelerators would add area, decrease throughput, and consume power.
Limitations of Control-Plane ML
An alternative is to cache inference results in MATs [73] . In a caching scheme, ML training and inference run in the control plane, while inference results are stored in the data plane as flow-table rules. However, ML models with frequently-changing inputs, like packet size-which provide greater accuracy-would experience excessive cache misses. Cache Miss Rates. To demonstrate this, we build a simple model to predict the cache miss rate as a function of header entropy (i.e., how frequently a header's value changes across packets); matching on high-entropy fields results in more misses than matching on low-entropy fields [94] . We use the five-tuple and a variable number of unstable headers (e.g., packet sizes) as input features and sample flow lengths from an empirical traffic distribution [59] . We assume infinite switch memory to eliminate capacity-driven cache misses (i.e., all rules remain in cache once installed). Figure 2 shows the cache miss rates for different numbers of header fields and levels of entropy. The miss rate increases linearly for a single header field but grows super-linearly as more fields are added. When using eight fields (corresponding to a small ML model [102] ), almost all packets traverse the control plane (a cache hit rate of zero). Cache-based inference, therefore, would be limited to only a few low-entropy headers [84, 92] . This effectively prevents ML models from using per-packet features and decreases their accuracy.
Rule Insertion Time. Per-packet ML with caching systems would also suffer from high installation latencies for match-action rules, which grow with flow-table sizes [63] . Given a limit on table sizes, flow insertion completes in several milliseconds (e.g., 3 ms for TCAMs [18] ). However, because per-packet ML would generate multiple decisions per flow, installation times would increase and interrupt each flow repeatedly. For packet-level decisions, frequent installations taking milliseconds would be prohibitive-to meet end-to-end Service Level Objectives (SLOs), switches must process packets in hundreds of nanoseconds.
Inference Compute Time. Control-plane inference, even using ML accelerators, would increase latency; accelerators use batched processing and have software overheads. Table 2 benchmarks the latency for the anomaly-detection DNN [102] on an Intel Broadwell Xeon CPU running vectorized TensorFlow [3] , an NVIDIA Tesla T4 GPU with ML-optimized Tensor cores [79] , and a Google Cloud TPU v2.8 [58] for unbatched inference. This latency comes from hardware and software (e.g., Tensorflow [3] ), which is necessary to set up these throughput-oriented devices; the lowestlatency design, a vectorized CPU, takes 0.67 ms.
Effect on End-to-End Latency.
We now study the impact of caching control-plane ML decisions on endto-end flow-completion times for the anomaly-detection DNN [102] . In our simulation, a host sends packets drawn from an empirical flow distribution [105] , to another host over a switch. In both schemes, the first packet of each flow is sent to the control plane for a forwarding decision. For data-plane ML, no further packets traverse the control plane, but the caching scheme must process virtually all packets in the control plane. The cache miss rate, rule-insertion time, and compute-inference time of a control-plane ML scheme increase the end-to-end completion time of long flows by 1500× ( Figure 3 ). This simulation is run at nearzero load, so no delays occur due to queueing; as more flows are added, queues would build and the control-plane performance would decrease.
Taurus: An Intelligent Data Plane
To achieve network flexibility and reactivity, we design Taurus to run line-rate inference entirely in the data plane, while training-a non-critical-path operation-remains in the control plane. This is similar to Software-Defined Networking (SDN): the control plane gathers a global view of the network and trains ML models to optimize QoS metrics, while the data plane uses these models to make line rate, data-driven decisions. Unlike traditional SDN, the control plane now installs both weights and flow rules into switches ( Figure 4 ). Weights are more space-efficient than flow rules: for example, matching the behavior of our anomaly detection DNN would require 12 MB of flow rules (the full dataset), but only 5.6 kB of weights-a 2135× reduction in memory usage. Furthermore, using monitoring frameworks like Deep Insight [1] , the control plane can collect fine-grained performance statistics and use them to identify the impact of ML decisions and optimize weights.
TAURUS ARCHITECTURE
We now describe the logical components of the Taurus data-plane pipeline as shown in Figure 5 . As packets enter a switch, FSMs parse them into Packet Header Vectors (PHVs) [13] , a fixed-layout, structured format. Then, switches use the match-action abstraction, looking up each header field in a 
Parsing & Preprocessing
Taurus preprocesses raw packet headers into a canonical form before inference: additional data may be added to augment the packet, and some fields may need repair to correct abnormal values. Furthermore, data preprocessing uses rules (implemented with MATs) to convert header fields to features for the ML model. For our anomaly-detection example, IP addresses could be matched against autonomous system subnets and replaced with features indicating ownership or geographic location. The anomaly-detection network would then evaluate the relationships between numeric features to provide an anomaly score. Taurus replaces categorical relationships with simpler numeric relationships using lookup tables; e.g., a table transforms port numbers into a linear likelihood value, which is easier to infer from [24] . Moreover, preprocessing can invert the probability distribution underlying a sampled value. Taking the logarithm of an exponentially distributed variable results in a uniform distribution, which an ML model can process with fewer layers [91] . Such feature engineering transfers load from an ML model to its designer; using better features increases models' accuracy without increasing their size [14, 91] . Lastly, In-Band Network Telemetry (INT)-local state embedded into packets-provides switches with a view of global network state [61] , which they can process using MATs. Taurus devices are therefore not limited to inference using switches' local state: instead, models can use the packet's entire history (using INT) and the flow's entire history (using stateful registers), greatly increasing their predictive power.
Postprocessing & Scheduling
MATs can also interpret ML decisions. For example, if our anomaly-detection model outputs 0.9, indicating a likelyanomalous packet, MATs decide how the packet should be handled: it can be dropped, flagged, or quarantined. In Taurus, these postprocessing MATs connect inference to scheduling, which uses an abstraction (e.g., PIFO [98] ) to support a variety of scheduling algorithms.
Map-Reduce for ML Inference
For each packet, inference combines cleaned features and model weights to make a decision. Traditional ML algorithms, like Support Vector Machines (SVM) and neural networks, use matrix-vector linear algebra operations and element-wise non-linear operations [43, 53] . Non-linear operations let models learn non-linear semantics; otherwise, the output would be a linear combination of the inputs. Unlike header processing, ML computation is very regular, using many multiply-add operations. In the more computationally-taxing linear portion of a single DNN neuron, input features are each multiplied by a weight, then added to yield a scalar value. Generalizing this operation, vector-to-vector (map) and vector-to-scalar (reduce) operations suffice for the computationally-intensive linear portions of a neuron. This motivates the need for a new dataplane abstraction, map-reduce, that is flexible enough to express a variety of ML models but specific enough to allow efficient hardware development.
Map
Reduce Activation Figure 6 : The compute graph of a single perceptron with the breakdown between map, reduce, and activation functions (outer-loop map) shown.
The Map-Reduce Abstraction.
Our design uses map-reduce SIMD parallelism to provide high computational throughput cheaply. Map operations are elementwise vector operations, such as addition, multiplication, or non-linear operations. Reduce operations combine a vector of elements to a scalar value using associative operations like addition and multiplication. Figure 6 shows how map and reduce are used to compute a single neuron (dot product), which can be combined into large neural networks. Map-reduce is a popular form for ML models: map-reduce can accelerate ML both in distributed systems [39] and at a finer granularity [21] . By supporting common primitives, we support a set of applications broader than ML, including Virtual Network Functions (VNFs) at the switch and NIC [85] . For example, Elastic RSS (eRSS) uses map-reduce for consistent hashing to schedule packets and cores: map is used to evaluate cores' suitability, and reduce selects the closest core [89] . Map-reduce also supports sketching algorithms, including Count-Min-Sketches (CMS) [23] for flow-size estimation. Furthermore, recent research shows that Bloom filters can also benefit from, or be replaced by, neural networks [87] .
Integrating Map-Reduce with P4. To program Taurus, we propose a dedicated P4 control block (like the ones used for checksums and egress computations [12] ). P4 already expresses three logically-separate abstractions: parsing, match-action, and scheduling. By adding a fourth block programmed using a map-reduce abstraction (e.g., Spatial [62] ), we extend SDN's flexibility for a new class of applications. The only additional primitives needed are arrays, map, and reduce (as well as loading weights from the control plane). Our proposed syntax is shown in Figure 7 
Target-Independent Optimizations.
Map-reduce is general enough to support target-independent optimizations: optimizations that consider available execution resources (parallelization factors, bandwidth, and more) without considering hardware-specific design details. Parallelizing map-reduce programs by unrolling loops in space speeds up execution: if sufficient hardware resources are available, a model can have all map and reduce loops laid out spatially for maximum throughput. Because parallelization factors are compile-time constants, Taurus has deterministic throughput: a static profile of the whole network accounting for the decreased throughput can be created, allowing operators to easily analyze performance. This static line-rate reduction is not new: it occurs in RMT recirculation [13] , link oversubscription [45, 78] , and elsewhere.
As packet latencies in switches must be low (on the order of hundreds of nanoseconds), latency, not just area, limits switch-level neural networks. Latency increases with depth, so a switch-level ML accelerator can handle a limited number of layers; thus, datacenters' SLOs essentially force small models in switches, regardless of the resource constraint. By preprocessing features with MATs, we provide high performance with low latency: the model only has to learn relationships between features, not the mapping from header fields to features.
Avoiding Pathologies
ML models only provide probabilistic guarantees; therefore, we must constrain their behavior with deterministic bounds to ensure robust network operation. In a Taurus system, the user specifies high-level safety (no incorrect behavior) and liveness (correct behavior happens eventually) properties to the control plane. The control plane then compiles these high-level constraints into per-switch constraints, which are used as part of post-processing. By constraining the ML model's decision boundary, we ensure correct network behavior without complicated model verification.
Starvation. Congestion control is a promising feature for in-network ML. However, if an ML model were given free reign over per-flow scheduling decisions, it may (erroneously) decide that some flows should receive a zero or near-zero bandwidth allocation, effectively blocking them from the network. The simplest solution to starvation is guaranteeing each flow a fixed minimum bandwidth, but setting the wrong minimum could be problematic: too small, and flows may be starved; too large, and ML's optimization potential is limited. A better option is blending ML and an existing queueing algorithm, like earliest deadline first or least attained service, which are already supported by the PIFO scheduler [98] . By operating in a range set using heuristics, ML can optimize bandwidth while providing a reliable worst-case from low loads to high loads.
Incorrect Decisions. Anomaly detection using ML has a potentially catastrophic pathology: allowing an anomalous packet that compromises a system. Network operators currently define anomalous packets using Access Control Lists (ACLs), which explicitly specify forbidden packets; if ML were used to approximate an ACL, forbidden packets might be forwarded. Instead, the ACL can be used as a safety guarantee, in addition to labeling packets for ML training. Incoming packets first run through an ML model and are then compared against the ACL: they are considered anomalous if flagged by either, making the network more secure than using an ACL alone.
Oscillation. A flow may frequently cross a model's decision boundary. For example, if ML is used to select between upstream ports for ECMP, a flow may be sent over several ports in quick succession, increasing the burden on end hosts to reorder packets. The simplest option is a timeout, which guarantees a minimum number of packets per decision and decreases flow breaks. Hysteresis is a better option: once the ML model has made a decision, the decision boundary is shifted slightly using post-processing to make that decision more likely. Then, if the flow's decision is oscillating immaterially around the original decision boundary, the new decision boundary will ensure that the switch's output never changes. However, if the ML model's output changes significantly, hysteresis lets the switch's output change immediately.
TAURUS IMPLEMENTATION
The complete physical data-plane pipeline of a Taurus device is shown in Figure 8 , consisting of blocks for packet parsing, ML with map-reduce, packet forwarding with MATs, and scheduling. Taurus's packet parser, pre-and post-processing MATs, and scheduler use existing hardware implementations [13, 37, 98] . We base Taurus's map-reduce block on Plasticine, a Coarse-Grained Reconfigurable Array (CGRA) composed of a sea of compute and memory units, which are reconfigurable to match applications' dataflow graphs [86] . The fraction of the PHV containing features enters the mapreduce block, while other headers are bypassed directly to the postprocessing MATs as shown in Figure 9 . Each Compute Unit (CU, Figure 10 ) is composed of Functional Units (FUs) organized in lanes and stages and performs a map, a reduction, or both. Within a CU stage, all lanes execute the same instruction and read the same relative location. CUs have pipeline registers between stages, so every FU is active on every cycle; pipelining also occurs at a higher level between CUs. We use Memory Units (MUs), which are interspersed with CUs in a checkerboard pattern for locality, to store the weights of ML models (Figure 9 ). This also allows coarse-grained pipelining, where CUs perform operations and MUs act as pipeline registers. However, as models in network applications have a low memory footprint, the sizes of the MUs are negligible (less than 0.02% overhead for our largest application benchmark, §5). Multiple levels of pipelining within each CU allow our design to run at a 1 GHz fixed clock-a crucial factor for matching the line rate of high-end switch hardware [13, 98] . By using MATs (VLIW) for data cleaning and map-reduce (SIMD) for inference, Taurus uses different models of parallelism to build a fast and flexible data-plane pipeline. (16 lanes, 2 stages) for different precisions. Scaling is shown relative to the 8-bit design.
Target-Dependent Compilation.
A variety of programming languages natively support map-reduce [52, 74, 80, 106] . To support our Plasticine-based fabric, we implement Taurus with Spatial, a map-reduce DSL for fast and efficient hardware [62] . Spatial supports target-dependent optimizations for Plasticine as well as target-independent optimizations (discussed in §3.3.2), In Spatial, map-reduce patterns are represented as nested loops and use per-loop controllers to sequence execution. Programs are compiled to a streaming dataflow graph from this hierarchy: innermost loops become SIMD operations within a CU, and outer loops are mapped over multiple CUs. Then, overly-large patterns (those requiring too many compute stages, inputs, or memory banks) are split into smaller patterns that fit in CUs and MUs; this is necessary to map some activation functions with long basic blocks. Finally, the resulting graph is placed and routed on the map-reduce block's interconnect.
EVALUATION
We first justify our map-reduce block's configuration by analyzing its power and area overheads. Next, we evaluate Taurus's performance by running several recently-proposed networking ML applications [71, 102, 113, 115] . Finally, we demonstrate Taurus's flexibility by evaluating common ML components, which can be composed to express a variety of ML algorithms. 
Figure 11: Area and power consumption per-FU for various CU configurations (lanes and stages).

ASIC Design Space Exploration
Taurus's map-reduce block is parameterized, including precision, lane count, and stage count; these parameters are selected to optimize line-rate inference. To quantify Taurus's area, we use ASIC synthesis with a 28 nm standard cell library [40] .
We first study the impact of arithmetic precisions ranging from 8 to 32 bits on area and power; as floating point support is expensive and nonessential for inference, we restrict Taurus to fixed-point arithmetic. We investigate differing lane (4-32) and stage (2-6) counts, and determine that an 8-bit data path with 16 lanes and 2 stages is the ideal configuration to support today's network-inference applications.
Precision Selection.
For ML inference, fixed-point arithmetic is faster than floating point with equivalent accuracy [49, 58] . We believe that 8-bit precision suffices for ML (compressed models use even fewer bits [110] ) and use 8-bit precision for Taurus; however, several industrial designs, such as Google's TPU, use 16-bit data paths [58] . We therefore evaluate alternate designs with greater precision and show that precision has a roughly linear cost: going from 8-bit to 16-bit data widths corresponds to a proportional (2×) increase in area and power (Table 3) .
Lane Count Selection.
As CU lane and stage counts increase, the number of FUs, and therefore area, will increase; however, if we were to simply add CUs, area would also increase. Therefore, we normalize CU area and power by FU count to investigate the relative efficiency of different CU designs. Figure 11 shows that the per-FU area and dynamic power decrease with lane count, because adding lanes or stages decreases the amount of control logic and overhead per FU. However, small models cannot be efficiently mapped to large CUs: if there is less application-level parallelism available than CU lanes, some lanes will be unused. Likewise, stages in a CU beyond those needed for a basic block will also be unused-each basic block has its own controller, and the CU only has hardware support for one control hierarchy. The anomaly-detection DNN is our largest model requiring line-rate operation, so we use it to set the ideal lane count. The DNN's largest layer has 12 hidden units, so the largest dot-product calculations involve 12 elements; the 16-lane configuration fully unrolls the dot product within a single CU while minimizing underutilization. Currently, the 16-lane configuration balances area overhead, power, and mapping efficiency, but optimal lane counts may change as data-plane ML models evolve. Because map-reduce programs are hardware agnostic, programs can run on new configurations unmodified; the compiler will handle the differences in unrolling factors as needed.
Stage Count Selection.
We perform a scaling study to quantify the impact of CU stage counts on area ( Figure 12 ). For this study, we use activation functions as they have the deepest compute graphs; we sweep CU stage count and report the area of the smallest array that maps each function. For Taylor series approximations (Sigmoid-Exp and TanhExp), stages added to CUs are used to map computation, but overall area remains flat: adding stages is roughly equivalent to adding CUs. Furthermore, for activation functions with shallow compute graphs (e.g., ReLU), adding stages decreases efficiency: the later stages are not mapped. Dot products require only two stages: one for the map/multiply, and one for the reduce/add. As theoretical area-and energy-efficiency increase with stage count, we want to increase the stage count for better efficiency. However, more stages are not useful for functions like LUTs, ReLU, and linear algebra, so we use two stages.
Final Prototype ASIC Configuration.
We end up with a CU configuration that has 16 lanes, 2 stages, and uses an 8-bit fixed-point data path; each CU takes 0.124 mm 2 , with a single FU taking 3877 µm 2 . Our Taurus parameters are based on the applications and functions in use today: as ML for networking grows, we may need to revisit these parameters. Regardless, our parameterized design shows that map-reduce can be supported with a small amount of additional hardware.
Application Benchmarks
We evaluate Taurus using four ML models: an IoT traffic classification model [113] , two anomaly-detection models [71, 102] , and a model that learns congestion-control windows [115] . The IoT traffic classification implements KMeans clustering, using 11 packet-and flow-level features, to classify IoT traffic into five categories. The first anomaly-detection algorithm is an SVM [71] that uses offline dimensionality reduction to select eight key features of the 41 in the KDD intrusion-detection data set [4, 26] . The SVM uses a radial basis function to model nonlinear relationships. Our second anomaly-detection algorithm is a DNN that takes six input features (also a subset of KDD features) and produces two outputs: the probability of a malicious packet and the probability of a safe packet. The DNN has three intermediate layers with 12, 6, and 3 hidden units, respectively [102] . Finally, the online congestion-control algorithm (Indigo [115] ) is an LSTM-based network. Indigo uses 32 LSTM units followed by a softmax layer and is designed to run at an endpoint. Table 4 shows the performance, area overheads, and power requirements of our benchmarks on Taurus, compared against a 300 mm 2 [37] programmable switch chip with an RMT-based pipeline [13] . By mapping traffic classification and anomaly detection to Taurus, we show that real models can run at line rate in switches. Both anomalydetection applications learn to detect malicious packets with accuracies better than non-ML solutions when running on Taurus, with each using a different ML algorithm. The ability to run multiple ML models for one problem shows Taurus's generality: after network-specific pre/postprocessing, any map-reduce model can be used, allowing network operators to select an optimal model. With the congestioncontrol model, we investigate a neural network running at short intervals, instead of per-packet-operating only at a small fraction of line rate still yields major improvements over the Indigo software.
Area & Power.
We examine overall area and power with respect to an existing programmable switch ASIC to show the additional cost of implementation. Table 4 reports the area of only the CUs needed to implement each operation; therefore, the actual area of a prototype for these benchmarks is the area of the largest benchmark, with unused CUs disabled for smaller benchmarks. Simple models, like SVM-based anomaly detection, have as little as 6.1% area overhead and 1.1% power overhead. Indigo, our largest model, consumes an additional 23.6% area and 4.1% power because it is not fully unrolled. Therefore, we choose Taurus's map-reduce block area as 17.73 mm 2 . If switch designers choose to only support smaller models, KMeans, SVMs, and DNNs add only 12% more area and 2% more power.
Latency & Throughput.
KMeans, the SVM, and the DNN process one packet's headers per cycle (linerate); they do not affect throughput, and latency remains in the nanosecond range (Table 4) . Assuming a datacenter switch latency of 1 µs [28] , KMeans, the SVM, and the DNN add 7.6%, 6.8%, and 18.8% more latency, respectively. We also use Indigo to estimate the performance of models doing periodic-not per-packet-control updates within a network; these models provide more detailed updates for real-time events, like link congestion. In software, the Indigo LSTM network significantly improves applicationlevel throughput and latency [115] , operating in 10 ms intervals-likely slowed due to the LSTM's computational requirements. In Taurus, Indigo can produce a decision every 12.5 ns with each step taking 380 ns: this allows the LSTM network to react more quickly to changes in load and better control tail latency. Overall, Taurus can run perpacket models with minimal performance impact, and allow periodic models to make decisions orders of magnitude faster than software. Figure 13 : A small DNN, broken down into independent microbenchmarks.
Microbenchmarks
Finally, we evaluate Taurus on a variety of microbenchmarks to investigate the key hardware features driving application performance. Smaller dataflow programs can be composed into a single, large program: for example, Figure 13 shows a DNN built from several perceptron layers fused with nonlinear activation functions. Taurus is spatially reconfigurable, hence, the area overhead of any model is the sum of its constituent parts; these parts define the hardware needed to implement the model. By evaluating these building blocks of ML applications, we provide general results that can be adapted to a variety of design points. We divide microbenchmarks into two categories; linear and nonlinear functions, which play different roles in a model and have different implementation characteristics. Linear functions are notable because they are not perfectly parallel: they include a reduction network that limits the degree of communication-free parallelism. Conversely, nonlinear functions can be perfectly SIMD-parallelized because there is no interaction between adjacent data elements. For example, if the output of 16 different perceptrons is input to a ReLU, we simply map the ReLU over the 16 outputs, which are then computed in parallel. If fully unrolled, the latency of this operation is the sum of the perceptron and ReLU execution time. Linear Operations. Our primary linear microbenchmarks are a one-dimensional convolution, a perceptron kernel, and linear SVM. We also evaluate the linear components of LSTM and GRU cells, which have an underlying computation similar to the perceptron. The convolution kernel captures position-invariant features and is frequently used to find spatial or temporal correlations [64] . Table 5 shows the area required for each microbenchmark when unrolled to run at line rate. Because the convolution does not map well to vectorized map-reduce (there are multiple small inner reductions), it requires 8× unrolling and much chip area. However, the SVM and perceptron run at line rate with less than 2 mm 2 of additional chip area; they can be efficiently composed into high-performance deep neural networks.
The latencies imposed by each microbenchmark are also shown in Table 5 . The convolution and SVM kernels have the highest latencies-each has a small loop that is unrolled across CUs. This adds another stage of inter-CU communication, and therefore latency; because the perceptron runs entirely within a single CU, it has the lowest latency. The minimum latency for a 16-lane CU to perform a map-reduce is five cycles: one cycle for map and four cycles for reduce, using different fractions of a single stage for each reduction cycle ( Figure 10 ). The remaining latency comes from data movement from the input to the CU and then to the output; Taurus takes roughly five cycles for each data movement-a result of its spatially-distributed dataflow layout.
Unrolling. Table 6 shows the area and throughput impact of outer-loop unrolling on a selection of linear microbenchmarks. Not all benchmarks can have their outer-loop unrolled: for example, our untiled perceptron has no outer loop. The iterative (i.e., loop-based) versions of the SVM and the convolution run at one-half and one-eighth of line rate, respectively; this corresponds to two loop iterations and eight iterations per packet. By unrolling the SVM twice, throughput improves to line rate at the cost of a 40% area increase; however, unrolling the convolution to meet line rate results in a 6.3× area increase. Using map-reduce's targetindependent optimization, large ML models can loop, thus running over multiple cycles with a corresponding reduction in the number of packets forwarded per second.
Nonlinear Operations. Activation functions are necessary to learn nonlinear behavior; otherwise, the entire neural network would collapse into a single linear function. Different activation functions are used for different purposes: tanh is used in LSTMs to implement gating [54] , while DNNs use ReLU and Leaky ReLU, which are easier to implement [77] . The area and latency results for nonlinear operations are also shown in Table 5 .
The most efficient functions, including ReLU and Leaky ReLU, take under 1 mm 2 ; they do not use lookup tables. More complicated functions, including sigmoid and tanh, have several versions: Taylor series, piecewise approximations, and lookup tables [49, 111] . Taylor series and piecewise approximations require 2-5 times as many resources as other activation functions ( Figure 12 ). LUT-based functions need memory, but only a small amount: each table stores 1024 8-bit entries; even when replicated for parallel lookups, this is a trivial fraction of a switch chip's total memory. Therefore, we present microbenchmark results for LUT-based sigmoid and tanh. Latencies of nonlinear kernels are lower than linear kernels because there is no inter-CU communication generated by loop unrolling.
FUTURE WORK
In this paper, we show that Taurus enables inference for per-packet ML algorithms in the data plane; with it, a wide variety of network ML research directions become available.
Dimensionality Reduction for Data Augmentation. Data augmentation-joining input data with statically-known relationships to aid ML-becomes challenging as the number of new features grows. For example, a network operator using IP-correlated data to precisely model a packet's source may add dozens of derived features. Storing these in MATs would be too expensive; however, dimensionality reduction can reduce feature counts while maintaining the underlying information [108] .
The benefits of dimensionality reduction are twofold: it reduces the amount of preprocessing data and decreases model sizes. However, dimensionality reduction cannot replace ML due to the cross-product explosion: multiple fields could be reduced at one, but due to exact/wildcard matching (binary or ternary), the flow-table sizes grow exponentially with the number of fields. Therefore, in Taurus, dimensionality reduction is best suited to provide additional information about input fields (e.g., IP address or port number), while ML identifies relationships between fields.
Shrinking Models.
A major Taurus application will be network control and coordination. Neural networks can solve a variety of control problems [9, 99, 107] and are getting smaller. For example, structured control nets [99] for non-linear control perform almost as well as 512-neuron DNNs using as few as four neurons per layer. With such small networks, Taurus can run multiple models simultaneously (e.g., one model for intrusion detection and another for traffic optimization). In addition, techniques like quantization, pruning, and distillation can further reduce a models' size [8, 51, 57, 109] .
Learned Traffic Management. For Taurus to run ML models accurately, training models properly is paramount. Simultaneous training for learned congestion control lets devices make decisions using knowledge of other devices' policies [112] . Using data-plane ML models, we can force all data-plane functions to take a global view. For example, a learned scheduling algorithm could be bootstrapped using a simulated (or emulated) data center (like CrystalNet [66] ): realistic traffic would drive the simulation, while switch weights are trained to route more optimally and improve throughput. Further improvements would occur online, using sampled traces from switches to gradually adapt to changes in traffic. Effective network training must optimize globally: if devices are trained in isolation, they will behave greedily, lowering efficiency and quality of experience.
RELATED WORK
Architectures for ML. While Taurus uses Plasticine, a vectorized CGRA [86] , as the basis of its map-reduce block, it could feasibly be implemented with other fabrics. The most widely-available reconfigurable architectures are Field Programmable Gate Arrays (FPGAs), which are used as both custom accelerators [93] and prototyping tools (e.g., the NetFPGA [2, 69] ). However, FPGAs' on-chip interconnects consume up to 70% of the total chip power [16] , and their variable, slow clock frequencies complicate interfacing and operating at network switch speeds (multi-terabits per second). CGRAs are optimized for arithmetic to support ML better and typically have a fast, fixed clock frequency that allows seamless integration between the map-reduce block and MATs in Taurus [25, 35, 41, 70, 72, 96] . Other architectures, like Eyeriss [19] , Brainwave [34] , and EIE [50] , achieve high efficiency by focusing on specific algorithms. These could be used for in-switch ML, but are too rigid: if a specific accelerator were standardized, networks would be unable to benefit from future ML research due to the lack of a flexible abstraction (like map-reduce).
ML For Networking. Many networking applications can benefit from ML. For example, learned algorithms for congestion control [112, 115] have been shown to outperform their human-designed counterparts [27, 48, 116] . In addition, Boutaba et al. [15] identify ML use cases for network tasks such as traffic classification [29, 31] , traffic prediction [17, 20] , active queue management [100, 121] , and security [83] . All of these applications could immediately be deployed in networks today using Taurus.
Networking For ML. Specialized networks can also accelerate ML algorithms themselves. With minor enhancements to modern data-plane hardware, switches can aggregate gradients in-network, accelerating training by up to 300% [90] . Gaia, a system for distributed ML [55] , also accounts for wide-area network bandwidth and regulates the movements of gradients during the training process. While Taurus is not explicitly designed to accelerate distributed training, mapreduce supports aggregating numeric weights contained in packets more efficiently than MATs.
CONCLUSION
Self-driving networks-networks that make observations about their performance and improve themselves-would increase efficiency and users' quality of experience in modern and future data centers, but neither the programming abstraction nor the hardware exists, today, to realize such a network. Bridging this gap, Taurus is the equivalent of adaptive cruise control: automatically adjusting parameters in response to changing network conditions. We demonstrate that Taurus operates at line-rate and adds minimal overhead to a programmable switch pipeline (e.g., RMT)-24% more area and 178 ns average latency-while accelerating several recently-proposed networking benchmarks. Taurus replaces data-plane heuristics with learned functions and can inter-operate with existing data-plane devices. Given a mixture of Taurus and traditional networking hardware (e.g., using only Taurus NICs or ToR switches), Taurus's ML models will make optimal decisions accounting for existing heuristics.
We hope that Taurus will eventually enable full network automation, beyond just performance tuning and learned network security. Operators could use a fully-autonomous ML model for packet forwarding-with tight bounds on its output, like training wheels. The bounds would allow the autonomous model to make decisions and serve as the initial labeling function for bad decisions. As the model becomes more reliable, the bounds could be relaxed until ML is making virtually all packet-forwarding decisions.
To build a self-driving network, hardware must be deployed before large-scale training can begin: Taurus gives a foothold for in-network ML with hardware that can be installed-and improve performance and security-in nextgeneration data-planes.
