Taurus: An Intelligent Data Plane by Swamy, Tushar et al.
Taurus: An Intelligent Data Plane
Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, and Kunle Olukotun
Stanford University
ABSTRACT
Emerging applications—cloud computing, the internet of
things, and augmented/virtual reality—need responsive, avail-
able, secure, ubiquitous, and scalable datacenter networks.
Network management currently uses simple, per-packet,
data-plane heuristics (e.g., ECMP and sketches) under an
intelligent, millisecond-latency control plane that runs data-
driven performance and security policies. However, to meet
users’ quality-of-service expectations in a modern data cen-
ter, networks must operate intelligently at line rate.
In this paper, we present Taurus, an intelligent data plane
capable of machine-learning inference at line rate. Taurus
adds custom hardware based on amap-reduce abstraction to
programmable network devices, such as switches and NICs;
this new hardware uses pipelined and SIMD parallelism for
fast inference. Our evaluation of a Taurus-enabled switch
ASIC—supporting several real-world benchmarks—shows
that Taurus operates three orders of magnitude faster than
a server-based control plane, while increasing area by 24%
and latency, on average, by 178 ns.
On the long road to self-driving networks, Taurus is the
equivalent of adaptive cruise control: deterministic rules
steer flows, while machine learning tunes performance and
heightens security.
1 INTRODUCTION
The tremendous scale of modern data centers—tens of thou-
sands of servers, connected by elaborate networks [45, 78,
95]—causes many logistical and technical challenges [10,
38, 88]. Moreover, the high throughput and low latency
requirements of emerging workloads (e.g., cloud comput-
ing, the internet of things, and augmented/virtual reality)
make managing such large, complex networks challeng-
ing [10, 45, 88]. When implementing management policies
(e.g., for performance or security), network operators face a
dichotomy: they must choose between line-rate execution
and computational complexity.
Data-plane devices (e.g., switches and NICs) can react
in nanoseconds to network conditions, but have a limited
programming model designed to forward packets at line
rate (e.g., flow tables [13]). This restricts network opera-
tions to simple heuristics [5, 60, 65] in data-plane devices
and purpose-built tasks in fixed-function hardware (e.g.,
middle-boxes [22, 68]). A security policy for anomaly de-
tection, for example, would re-use flow tables—intended
as L3 routing or L2 forwarding tables—to implement black-
lists or Access Control Lists (ACLs). Such policies, therefore,
operate within the constraints of current data-plane abstrac-
tions, which set forth a binary world: packets matching a
blacklist are dropped, with all others forwarded. Neverthe-
less, data-plane devices process every packet, so they can
capture fine-grained statistics (using counters and sketches)
and make a new decision for each packet.
Control-plane servers can make complicated, data-driven
decisions, but only for a few packets (e.g., the first of
every flow). Later packets match the cached decisions—
installed in the data plane as flow rules—and are forwarded
directly by the data plane. By using more data, a cen-
tralized control plane can make better decisions, provid-
ing better performance and security. For example, servers
(possibly with accelerators [58, 79]) can implement learn-
ing anomaly-detection algorithms like clustering, support-
vector machines, and neural networks [71, 81, 102]; these
algorithms can automatically find latent non-linear correla-
tions between features.
Ideally, network processing would be data-driven and re-
act to every packet—all packets could be sent through the
control plane, or data-plane devices could be more flexible.
Caching data-driven, per-packet decisions would provide
per-packet reactivity, but header instability would effec-
tively result in all packets being processed in the control
plane. This approach would decrease performance by about
three orders of magnitude, precluding data-driven perfor-
mance tuning and restricting data-driven anomaly detec-
tion to the most hardened networks. The better approach
is a more flexible data plane: by adding a new abstraction
designed for decision-making, not packet forwarding, swit-
ches and NICs can improve their functionality with minimal
hardware (compared to intelligent decision-making with
flow tables).
Data planes, today, use only three abstractions to bridge
the programmer-hardware gap—packet parsing maps to Fi-
nite State Machines (FSMs) [37], flow rules map to Match-
Action Tables (MATs) [13], and scheduling maps to Push-In
First-Out (PIFO) [98]—so any new abstraction must also be
ubiquitous, general-purpose, and provide a coherent high-
and low-level interface. Machine Learning (ML) provides a
broad high-level interface suitable for many applications,
including supervised and reinforcement learning. Anomaly
detection would use supervised learning: operators identify
anomalous packets after the fact, which lets a model learn
1
to predict other anomalies. Reinforcement Learning (RL) is
more useful for automatic performance tuning: by automat-
ically trying small tweaks to a running model and seeing
which ones improve performance, the system adapts itself.
Most ML algorithms are built around linear algebra,
which uses a significant amount of repetitive computation,
performed on a small number of weights, with regular com-
munication. Unnecessary flexibility, such as the all-to-all
VLIW communication [120], large memories, and ternary
CAMs in MATs [13], consumes chip area without benefit-
ing ML; prior attempts at ML in switches have failed due
to this inefficiency [97]. Map-reduce, on the other hand, is
a good low-level abstraction for ML because it provides the
necessary computations (large numbers of multiplies and
adds) and no unnecessary flexibility.
Although ML can make data-driven decisions, it cannot
handle all network functionality. ML is suited for decisions
currently made by (approximate) heuristics, such as conges-
tion control, ECMP, and anomaly detection; these decisions
impact only networks’ performance and security, not their
core packet-forwarding behavior. Networks built using ML
will therefore use flow rules to express a range of valid
decisions (e.g., output ports), and the ML model will opti-
mize best-case while bounding worst-case performance by
selecting from a pre-determined set of decisions. An intelli-
gent control plane will thus compile user programs phrased
as constrained optimization problems: for example, mini-
mizing congestion while ensuring a certain bandwidth for
high-priority flows.
In this paper, we present Taurus, a data plane augmented
with a new ML abstraction and programmable map-reduce
hardware for intelligent (data-driven) packet forwarding.
The control plane receives telemetry data from the entire
network (e.g., via In-Band Network Telemetry, or INT [61]),
trains new switch weights, and installs them in Taurus
alongside traditional flow rules for packet forwarding. To
operate at line rate, Taurus’s map-reduce block implements
only the multiply and add operations needed for ML. The
map-reduce block works alongside parsers, MATs, and the
scheduler to forward packets, with MATs connecting map-
reduce to the pipeline: pre-processing MATs extract nu-
meric input features from packets, the map-reduce block
uses these features and an ML model to generate a numeric
result, and post-processingMATs transform this output into
a packet-forwarding decision.
Recent coarse-grained accelerators for data analytics [44,
86, 104] underpin Taurus’smap-reduce block: a user-defined
program graph is spatially mapped to a reconfigurable ar-
ray, where data flows through the array. Taurus’s map-
reduce block is tailored for line-rate inference: unnecessary
operations, including DRAM access and floating-point op-
erations, are eliminated; a gridded organization of compute
and memory units is maintained, with 16 SIMD lanes per
compute unit and 16 banks per memory unit. We evaluate
the overhead of Taurus’s map-reduce block as an addition
to a programmable switch, and we demonstrate that the
average added latency is 178 ns and added area is 24% to
implement a range of proposed algorithms.
By enabling data-plane MLwith low overhead and a clear
abstraction, Taurus moves data-driven processing from a
per-flow to a per-packet level and lets complex performance
and security policies run at line rate.
In summary, we make the following contributions:
• A Taurus logical pipeline using a map-reduce abstrac-
tion for line-rate, per-packet inference (§3).
• A hardware design of a Taurus-enabled switch with
a reconfigurable SIMD dataflow engine [86] for map-
reduce (§4).
• Analysis of the design using ASIC synthesis and a
28 nm generic library [40] to determine area and power
overheads relative to commercially available switches
(§5.1).
• Evaluation with real ML networking applications (§5.2)
and microbenchmarks (§5.3), showing that Taurus sup-
ports the functions common in modern ML at line rate
(1 GPkt/s).
We begin by motivating the need for an intelligent data
plane (§2) and highlighting both the importance of per-
packet ML and the limitations of existing data- (§2.1) and
control-plane ML approaches (§2.2).
Ethics: This work does not raise any ethical issues. This
research has no human subjects and formal institutional
review is not required.
2 WHY AN INTELLIGENT DATA PLANE?
Taurus is an intelligent data plane that runs ML at line-
rate for every packet and uses ML’s output to optimize
forwarding decisions. Machine learning provides signifi-
cant improvements in traffic engineering [114], schedul-
ing [112, 115], and security [4, 26, 56, 71, 102]. SIMON [36]
also reconstructs queuing delays in network switches with
higher accuracy than edge-based methods [46, 76]. Fur-
thermore, decision trees (Remy [112]) and recurrent neural
networks (Pantheon [115]) for congestion control have a
throughput-latency frontier beyond that of many human-
designed algorithms [27, 48, 116]. Many of these algorithms
make use of sub-flow features. For example, Remy uses
RTTs and ACKs, and anomaly detection (e.g., KDD Cup
entries [103? ]) uses packet-level features like connection
duration. Boutaba et al. [15] survey ML-based network-
ing applications and find that tasks like traffic classifi-
cation make liberal use of packet-level features [7, 29–
32, 67, 75, 82, 101, 117–119]. Even for encrypted traffic,
2
Level Accuracy (%) F1 Score Missed Anomalies
Flow 48.8 58.1 11 392
Packet 75.0 78.3 5 273
Table 1: A comparison of flow- and packet-level anom-
aly detection DNN.
packet-level features like inter-arrival times and packet
sizes allow classification [6, 11].
Per-Packet ML: A Case Study. To highlight the impor-
tance of packet-level features, we build a simple example
using an anomaly-detection DNN [102] and the updated
NSL-KDD [103] intrusion-detection data set. The DNN uses
TCP-level features available in the data set (e.g., the current
connection duration or the number of observed packets
with an urgent TCP flag set [42]), but we exclude features
only available after a flow (e.g., total source and destination
bytes transferred). We measure the DNN’s performance
with and without packet-level features after training for ten
epochs. Packet-level features improve our model’s accuracy
by over 25% and reduce the number ofmissed anomalies (i.e.,
false negatives) by a factor of two, as shown in Table 1. In
short, packet-level events let ML models better understand
network behavior and make more accurate decisions.
2.1 Limitations of Data-Plane ML
There have been a number of recent attempts to use cur-
rent switch abstractions (i.e., MATs) and specialized hard-
ware [36] to support per-packet ML models.
2.1.1 Inference on MATs. The match-action abstraction
is insufficient for line-rate ML inmodern data-plane devices,
due to both missing operations (especially loops and mul-
tiplication) and inefficient MAT pipelines [13]. Binary neu-
ral networks have been implemented—using tens of MATs
for each—but they lack the precision needed for practi-
cal deployments [97]. Likewise, an SVM for IoT classifica-
tion [113] consumes most of the memory of a NetFPGA
switch—an experimental research platform [2, 69]—and has
not been mapped onto a real switch ASIC. As these tech-
niques use switches’ VLIW pipelines to implement simpler,
SIMD programs (with lower memory requirements), they
use only a small fraction of the MAT hardware while ren-
dering the entire stage unavailable.
VLIW Parallelism. The difference in communication
requirements between a VLIW model and a SIMD model is
visually described in Figure 1. VLIW models, used in cur-
rent switch MATs [13], have multiple logically-independent
instructions per stage operating in parallel, reading from
and writing to arbitrary locations. This all-to-multiple input
communication and multiple-to-all output communication
Stage i Stage i + 1
(a) VLIW.
Stage i Stage i + 1
(b) SIMD.
Figure 1: A comparison between VLIW and SIMD
communication, with lines showing possible paths.
The additional communication possible with VLIW in-
creases overhead and is unnecessary for ML.
requires large crossbars. For example, a 16-issue VLIW pro-
cessor has 20× as much control logic as an equally-powerful
cluster of eight dual-issue processors [120]. VLIW’s over-
head thus limits the number of instructions per stage. Bare-
foot’s Tofino chip only executes 12 operations per stage:
four of each of 8, 16, and 32 bits [47]. A typical DNN layer
may require 72multiplications and 144 additions [102]; even
if multiplication were added to MATs, this would be 18
stages (most of the pipeline).
2.1.2 Inference on Accelerators. Traditional accelera-
tors, like TPUs [58], GPUs [79], and FPGAs [33] could ex-
tend the data-plane pipeline as bump-in-the-wire inference
engines, connected via PCIe or Ethernet. In most accelera-
tors, inputs are batched to increase parallelism: larger batch
sizes boost throughput by enabling more-efficient matrix-
matrix multiplication. However, to provide reliably-low
per-packet latency, unbatched (matrix-vector) execution is
needed; otherwise, packets would be delayed while waiting
for a batch to fill. Moreover, adding another physically-
separate accelerator would either consume switch ports
(wasting transceivers) or replicate switch functions like
packet parsing and match-action rules for feature extrac-
tion; separate acceleratorswould add area, decrease through-
put, and consume power.
2.2 Limitations of Control-Plane ML
An alternative is to cache inference results in MATs [73].
In a caching scheme, ML training and inference run in
the control plane, while inference results are stored in the
data plane as flow-table rules. However, ML models with
frequently-changing inputs, like packet size—which provide
greater accuracy—would experience excessive cache misses.
Cache Miss Rates. To demonstrate this, we build a
simple model to predict the cache miss rate as a func-
tion of header entropy (i.e., how frequently a header’s
value changes across packets); matching on high-entropy
3
0 0.1 0.2 0.3 0.4
0
0.2
0.4
0.6
0.8
1
Entropy of a Header Field
Mi
ss
Ra
te 8 Fields4 Fields
2 Fields
1 Field
Figure 2: Cache miss rates with increasing number of
header fields and infinite rule-space.
Accelerator Latency (ms)
Broadwell Xeon 0.67
Tesla T4 GPU 1.15
Cloud TPU v2-8 3.51
Table 2: Inference time for control-plane accelerators.
fields results in more misses than matching on low-entropy
fields [94]. We use the five-tuple and a variable number
of unstable headers (e.g., packet sizes) as input features
and sample flow lengths from an empirical traffic distribu-
tion [59]. We assume infinite switch memory to eliminate
capacity-driven cache misses (i.e., all rules remain in cache
once installed). Figure 2 shows the cache miss rates for dif-
ferent numbers of header fields and levels of entropy. The
miss rate increases linearly for a single header field but
grows super-linearly as more fields are added. When using
eight fields (corresponding to a small ML model [102]), al-
most all packets traverse the control plane (a cache hit rate
of zero).
Cache-based inference, therefore, would be limited to
only a few low-entropy headers [84, 92]. This effectively
prevents ML models from using per-packet features and
decreases their accuracy.
Rule Insertion Time. Per-packet ML with caching sys-
tems would also suffer from high installation latencies for
match-action rules, which grow with flow-table sizes [63].
Given a limit on table sizes, flow insertion completes in
several milliseconds (e.g., 3ms for TCAMs [18]). However,
because per-packet ML would generate multiple decisions
per flow, installation times would increase and interrupt
each flow repeatedly. For packet-level decisions, frequent
installations taking milliseconds would be prohibitive—to
meet end-to-end Service Level Objectives (SLOs), switches
must process packets in hundreds of nanoseconds.
InferenceCompute Time. Control-plane inference, even
using ML accelerators, would increase latency; accelerators
100 101 102 103 104 105
100
102
104
Flow Length (packets)
Flo
w
Co
mp
.T
im
e(
ms
)
Control Plane
Taurus Data Plane
Figure 3:The impact of control- and data-plane ML on
flow-completion times in a minimally-loaded system.
use batched processing and have software overheads. Ta-
ble 2 benchmarks the latency for the anomaly-detection
DNN [102] on an Intel Broadwell Xeon CPU running vec-
torized TensorFlow [3], an NVIDIA Tesla T4 GPU with
ML-optimized Tensor cores [79], and a Google Cloud TPU
v2.8 [58] for unbatched inference. This latency comes from
hardware and software (e.g., Tensorflow [3]), which is neces-
sary to set up these throughput-oriented devices; the lowest-
latency design, a vectorized CPU, takes 0.67ms.
2.2.1 Effect on End-to-End Latency. We now study
the impact of caching control-plane ML decisions on end-
to-end flow-completion times for the anomaly-detection
DNN [102]. In our simulation, a host sends packets drawn
from an empirical flow distribution [105], to another host
over a switch. In both schemes, the first packet of each flow
is sent to the control plane for a forwarding decision. For
data-planeML, no further packets traverse the control plane,
but the caching scheme must process virtually all packets
in the control plane. The cache miss rate, rule-insertion
time, and compute-inference time of a control-plane ML
scheme increase the end-to-end completion time of long
flows by 1500× (Figure 3). This simulation is run at near-
zero load, so no delays occur due to queueing; as more
flows are added, queues would build and the control-plane
performance would decrease.
2.3 Taurus: An Intelligent Data Plane
To achieve network flexibility and reactivity, we design
Taurus to run line-rate inference entirely in the data plane,
while training—a non-critical-path operation—remains in
the control plane. This is similar to Software-Defined Net-
working (SDN): the control plane gathers a global view of
the network and trains ML models to optimize QoS met-
rics, while the data plane uses these models to make line
rate, data-driven decisions. Unlike traditional SDN, the con-
trol plane now installs both weights and flow rules into
switches (Figure 4). Weights are more space-efficient than
flow rules: for example, matching the behavior of our anom-
aly detection DNN would require 12MB of flow rules (the
4
Control Plane (Training)
Host HostSwitch(Inference)
Traced
Packets
Features
& Decisions
Weight
Updates TracedQoS
Figure 4: Training Taurus—hosts randomly mark
packets to trace in the network, and traced switch de-
cisions and measured QoS are used to update weights.
full dataset), but only 5.6 kB of weights—a 2135× reduction
in memory usage. Furthermore, using monitoring frame-
works like Deep Insight [1], the control plane can collect
fine-grained performance statistics and use them to identify
the impact of ML decisions and optimize weights.
3 TAURUS ARCHITECTURE
We now describe the logical components of the Taurus
data-plane pipeline as shown in Figure 5. As packets en-
ter a switch, FSMs parse them into Packet Header Vec-
tors (PHVs) [13], a fixed-layout, structured format. Then,
switches use the match-action abstraction, looking up each
header field in a table and performing a corresponding oper-
ation; we allocate several MATs for Taurus’s preprocessing.
Taurus then uses map-reduce to evaluate an ML model on
the extracted features, and postprocessing MATs transform
the model’s output into a forwarding decision. Finally, the
packet is scheduled based on the collective decisions of the
match-action pipeline and map-reduce block.
3.1 Parsing & Preprocessing
Taurus preprocesses raw packet headers into a canonical
form before inference: additional data may be added to
augment the packet, and some fields may need repair to cor-
rect abnormal values. Furthermore, data preprocessing uses
rules (implemented with MATs) to convert header fields to
features for the ML model. For our anomaly-detection ex-
ample, IP addresses could be matched against autonomous
system subnets and replaced with features indicating own-
ership or geographic location. The anomaly-detection net-
work would then evaluate the relationships between nu-
meric features to provide an anomaly score.
Taurus replaces categorical relationships with simpler nu-
meric relationships using lookup tables; e.g., a table trans-
forms port numbers into a linear likelihood value, which is
easier to infer from [24]. Moreover, preprocessing can in-
vert the probability distribution underlying a sampled value.
Taking the logarithm of an exponentially distributed vari-
able results in a uniform distribution, which an ML model
Interpret
Packet
Integrate
Data
Augment
Data
Mat.-Vec.
Multiply
Nonlinear
Functions
Interpret
Prediction
Send to
Destination
Parse Preprocess Infer Postprocess Schedule
Figure 5:The logical steps for data-planeML inTaurus.
can process with fewer layers [91]. Such feature engineering
transfers load from an ML model to its designer; using bet-
ter features increases models’ accuracy without increasing
their size [14, 91].
Lastly, In-Band Network Telemetry (INT)—local state em-
bedded into packets—provides switches with a view of
global network state [61], which they can process using
MATs. Taurus devices are therefore not limited to infer-
ence using switches’ local state: instead, models can use
the packet’s entire history (using INT) and the flow’s entire
history (using stateful registers), greatly increasing their
predictive power.
3.2 Postprocessing & Scheduling
MATs can also interpret ML decisions. For example, if our
anomaly-detection model outputs 0.9, indicating a likely-
anomalous packet, MATs decide how the packet should
be handled: it can be dropped, flagged, or quarantined. In
Taurus, these postprocessing MATs connect inference to
scheduling, which uses an abstraction (e.g., PIFO [98]) to
support a variety of scheduling algorithms.
3.3 Map-Reduce for ML Inference
For each packet, inference combines cleaned features and
model weights to make a decision. Traditional ML algo-
rithms, like Support Vector Machines (SVM) and neural
networks, use matrix-vector linear algebra operations and
element-wise non-linear operations [43, 53]. Non-linear op-
erations let models learn non-linear semantics; otherwise,
the output would be a linear combination of the inputs.
Unlike header processing, ML computation is very reg-
ular, using many multiply-add operations. In the more
computationally-taxing linear portion of a single DNN neu-
ron, input features are each multiplied by a weight, then
added to yield a scalar value. Generalizing this operation,
vector-to-vector (map) and vector-to-scalar (reduce) opera-
tions suffice for the computationally-intensive linear por-
tions of a neuron. This motivates the need for a new data-
plane abstraction, map-reduce, that is flexible enough to
express a variety of ML models but specific enough to allow
efficient hardware development.
5
x0 ×
W0
x1 ×
W1
x2 ×
W2
x3 ×
W3
+
+
+ +
B
G(z)
Map Reduce Activation
Figure 6: The compute graph of a single perceptron
with the breakdown between map, reduce, and activa-
tion functions (outer-loop map) shown.
3.3.1 The Map-Reduce Abstraction. Our design uses
map-reduce SIMD parallelism to provide high computa-
tional throughput cheaply. Map operations are element-
wise vector operations, such as addition, multiplication, or
non-linear operations. Reduce operations combine a vector
of elements to a scalar value using associative operations
like addition and multiplication. Figure 6 shows how map
and reduce are used to compute a single neuron (dot prod-
uct), which can be combined into large neural networks.
Map-reduce is a popular form for ML models: map-reduce
can accelerate ML both in distributed systems [39] and at a
finer granularity [21].
By supporting common primitives, we support a set of
applications broader than ML, including Virtual Network
Functions (VNFs) at the switch and NIC [85]. For example,
Elastic RSS (eRSS) uses map-reduce for consistent hashing
to schedule packets and cores: map is used to evaluate
cores’ suitability, and reduce selects the closest core [89].
Map-reduce also supports sketching algorithms, including
Count-Min-Sketches (CMS) [23] for flow-size estimation.
Furthermore, recent research shows that Bloom filters can
also benefit from, or be replaced by, neural networks [87].
Integrating Map-Reduce with P4. To program Tau-
rus, we propose a dedicated P4 control block (like the
ones used for checksums and egress computations [12]).
P4 already expresses three logically-separate abstractions:
parsing, match-action, and scheduling. By adding a fourth
block programmed using a map-reduce abstraction (e.g.,
Spatial [62]), we extend SDN’s flexibility for a new class of
applications. The only additional primitives needed are ar-
rays, map, and reduce (as well as loading weights from the
control plane). Our proposed syntax is shown in Figure 7,
which describes a single DNN layer. The outermost map
iterates over all the layer’s neurons, while the inner map-
reduce pair performs the linear operation for each neuron.
A final map instruction applies an activation function.
1 Control MapReduce( inout metadata FeatureSet,
2 inout metadata Output ) {
3 Weights = loadModelFromFile(Model.csv)
4 LinearResults = Map(sizeOf(Weights[0])) { i =>
5 Mult_Results = Map(sizeOf(Weights[1])) { j =>
6 Weights[i,j] * FeatureSet[j]
7 }
8 Reduce(Mult_Results) { (x,y) => x + y }
9 }
10 Output = Map(sizeOf(LinearResults)) { k =>
11 ReLU(LinearResults[k])
12 }
13 }
Figure 7: Map-reduce syntax for a DNN layer based on
Spatial [62].
3.3.2 Target-IndependentOptimizations. Map-reduce
is general enough to support target-independent optimiza-
tions: optimizations that consider available execution re-
sources (parallelization factors, bandwidth, and more) with-
out considering hardware-specific design details. Paralleliz-
ingmap-reduce programs by unrolling loops in space speeds
up execution: if sufficient hardware resources are available,
a model can have all map and reduce loops laid out spa-
tially for maximum throughput. Because parallelization fac-
tors are compile-time constants, Taurus has deterministic
throughput: a static profile of the whole network account-
ing for the decreased throughput can be created, allowing
operators to easily analyze performance.This static line-rate
reduction is not new: it occurs in RMT recirculation [13],
link oversubscription [45, 78], and elsewhere.
As packet latencies in switches must be low (on the or-
der of hundreds of nanoseconds), latency, not just area,
limits switch-level neural networks. Latency increases with
depth, so a switch-level ML accelerator can handle a limited
number of layers; thus, datacenters’ SLOs essentially force
small models in switches, regardless of the resource con-
straint. By preprocessing features with MATs, we provide
high performance with low latency: the model only has to
learn relationships between features, not the mapping from
header fields to features.
3.4 Avoiding Pathologies
ML models only provide probabilistic guarantees; therefore,
we must constrain their behavior with deterministic bounds
to ensure robust network operation. In a Taurus system, the
user specifies high-level safety (no incorrect behavior) and
liveness (correct behavior happens eventually) properties to
the control plane. The control plane then compiles these
high-level constraints into per-switch constraints, which
are used as part of post-processing. By constraining the
6
ML model’s decision boundary, we ensure correct network
behavior without complicated model verification.
Starvation. Congestion control is a promising feature
for in-network ML. However, if an ML model were given
free reign over per-flow scheduling decisions, it may (erro-
neously) decide that some flows should receive a zero or
near-zero bandwidth allocation, effectively blocking them
from the network. The simplest solution to starvation is
guaranteeing each flow a fixed minimum bandwidth, but
setting the wrongminimum could be problematic: too small,
and flows may be starved; too large, and ML’s optimization
potential is limited. A better option is blending ML and
an existing queueing algorithm, like earliest deadline first
or least attained service, which are already supported by
the PIFO scheduler [98]. By operating in a range set using
heuristics, ML can optimize bandwidth while providing a
reliable worst-case from low loads to high loads.
Incorrect Decisions. Anomaly detection using ML has a
potentially catastrophic pathology: allowing an anomalous
packet that compromises a system. Network operators cur-
rently define anomalous packets using Access Control Lists
(ACLs), which explicitly specify forbidden packets; if ML
were used to approximate an ACL, forbidden packets might
be forwarded. Instead, the ACL can be used as a safety
guarantee, in addition to labeling packets for ML training.
Incoming packets first run through an ML model and are
then compared against the ACL: they are considered anoma-
lous if flagged by either, making the network more secure
than using an ACL alone.
Oscillation. A flow may frequently cross a model’s deci-
sion boundary. For example, if ML is used to select between
upstream ports for ECMP, a flow may be sent over sev-
eral ports in quick succession, increasing the burden on
end hosts to reorder packets. The simplest option is a time-
out, which guarantees a minimum number of packets per
decision and decreases flow breaks. Hysteresis is a better
option: once the ML model has made a decision, the deci-
sion boundary is shifted slightly using post-processing to
make that decision more likely. Then, if the flow’s deci-
sion is oscillating immaterially around the original decision
boundary, the new decision boundary will ensure that the
switch’s output never changes. However, if the ML model’s
output changes significantly, hysteresis lets the switch’s
output change immediately.
4 TAURUS IMPLEMENTATION
The complete physical data-plane pipeline of a Taurus de-
vice is shown in Figure 8, consisting of blocks for packet
parsing,MLwithmap-reduce, packet forwardingwithMATs,
and scheduling. Taurus’s packet parser, pre- and post-processing
Parse MAT Map-Reduce MAT Scheduler
Figure 8: Taurus’s modified data-plane pipeline.
CU
MU
CU
MU
CU
MU
CU
MU
CU
MU
CU
MU
CU
MU
CU
HeadersPHV In: Features
HeadersPHV Out: Output
Non-feature headers
bypass map-reduce.
Figure 9: Taurus’s map-reduce block and its interface
to the rest of the pipeline.
MATs, and scheduler use existing hardware implementa-
tions [13, 37, 98]. We base Taurus’s map-reduce block on
Plasticine, a Coarse-Grained Reconfigurable Array (CGRA)
composed of a sea of compute and memory units, which are
reconfigurable to match applications’ dataflow graphs [86].
The fraction of the PHV containing features enters the map-
reduce block, while other headers are bypassed directly to
the postprocessing MATs as shown in Figure 9.
Each Compute Unit (CU, Figure 10) is composed of Func-
tional Units (FUs) organized in lanes and stages and per-
forms a map, a reduction, or both. Within a CU stage, all
lanes execute the same instruction and read the same rela-
tive location. CUs have pipeline registers between stages, so
every FU is active on every cycle; pipelining also occurs at
a higher level between CUs. We use Memory Units (MUs),
which are interspersed with CUs in a checkerboard pattern
for locality, to store the weights of ML models (Figure 9).
This also allows coarse-grained pipelining, where CUs per-
form operations andMUs act as pipeline registers. However,
as models in network applications have a low memory foot-
print, the sizes of the MUs are negligible (less than 0.02%
overhead for our largest application benchmark, §5). Mul-
tiple levels of pipelining within each CU allow our design
to run at a 1GHz fixed clock—a crucial factor for matching
the line rate of high-end switch hardware [13, 98]. By using
MATs (VLIW) for data cleaning and map-reduce (SIMD) for
inference, Taurus uses different models of parallelism to
build a fast and flexible data-plane pipeline.
7
Lane 0
Lane 1
Lane 2
Lane 3
Stage 0 Stage 1 Stage 2
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
FU PR
Figure 10: A three-stage CU pipeline, composed of
Functional Units (FUs) and Pipeline Registers (PRs).
The third stage supports map and sparse reductions.
Precision Area (µm2) Power (µW)
fix8 3 877 223
fix16 8 108 393
fix32 20 203 759
Table 3: Area and power scaling (per-FU) at the target
design (16 lanes, 2 stages) for different precisions. Scal-
ing is shown relative to the 8-bit design.
Target-Dependent Compilation. A variety of program-
ming languages natively support map-reduce [52, 74, 80,
106]. To support our Plasticine-based fabric, we implement
Taurus with Spatial, a map-reduce DSL for fast and efficient
hardware [62]. Spatial supports target-dependent optimiza-
tions for Plasticine as well as target-independent optimiza-
tions (discussed in §3.3.2), In Spatial, map-reduce patterns
are represented as nested loops and use per-loop controllers
to sequence execution. Programs are compiled to a stream-
ing dataflow graph from this hierarchy: innermost loops
become SIMD operations within a CU, and outer loops
are mapped over multiple CUs. Then, overly-large patterns
(those requiring too many compute stages, inputs, or mem-
ory banks) are split into smaller patterns that fit in CUs and
MUs; this is necessary to map some activation functions
with long basic blocks. Finally, the resulting graph is placed
and routed on the map-reduce block’s interconnect.
5 EVALUATION
We first justify our map-reduce block’s configuration by
analyzing its power and area overheads. Next, we evaluate
Taurus’s performance by running several recently-proposed
networking ML applications [71, 102, 113, 115]. Finally, we
demonstrate Taurus’s flexibility by evaluating common ML
components, which can be composed to express a variety
of ML algorithms.
4 8 16 32
3,000
4,000
5,000
Stages
Number of Lanes
Ar
ea
pe
rF
U
(µm
2 )
2 3
4 6
(a) Area.
4 8 16 32
0.1
0.2
0.3
0.4
0.5
Number of Lanes
Po
we
rp
er
FU
(m
W
)
(b) Power (10% switching).
Figure 11: Area and power consumption per-FU for
various CU configurations (lanes and stages).
5.1 ASIC Design Space Exploration
Taurus’s map-reduce block is parameterized, including pre-
cision, lane count, and stage count; these parameters are
selected to optimize line-rate inference. To quantify Tau-
rus’s area, we use ASIC synthesis with a 28 nm standard
cell library [40].
We first study the impact of arithmetic precisions ranging
from 8 to 32 bits on area and power; as floating point
support is expensive and nonessential for inference, we
restrict Taurus to fixed-point arithmetic. We investigate
differing lane (4–32) and stage (2–6) counts, and determine
that an 8-bit data path with 16 lanes and 2 stages is the
ideal configuration to support today’s network-inference
applications.
5.1.1 Precision Selection. For ML inference, fixed-point
arithmetic is faster than floating point with equivalent accu-
racy [49, 58]. We believe that 8-bit precision suffices for ML
(compressed models use even fewer bits [110]) and use 8-bit
precision for Taurus; however, several industrial designs,
such as Google’s TPU, use 16-bit data paths [58]. We there-
fore evaluate alternate designs with greater precision and
show that precision has a roughly linear cost: going from
8-bit to 16-bit data widths corresponds to a proportional
(2×) increase in area and power (Table 3).
5.1.2 Lane Count Selection. As CU lane and stage
counts increase, the number of FUs, and therefore area,
will increase; however, if we were to simply add CUs, area
8
LeakyReLU
ReLU
SigmoidExp
SigmoidPW
TanhExp
TanhPW
SigmoidLUT
TanhLUT
0
1
2
3
4
5
Stages
Ar
ea
(m
m2
)
2 3 4 6
Figure 12: Area needed for activation functions as the
number of stages varies from 2 to 6. All functions run
at line-rate, except sigmoid and tanh that operate at
half the speed.
would also increase. Therefore, we normalize CU area and
power by FU count to investigate the relative efficiency
of different CU designs. Figure 11 shows that the per-FU
area and dynamic power decrease with lane count, because
adding lanes or stages decreases the amount of control logic
and overhead per FU. However, small models cannot be effi-
ciently mapped to large CUs: if there is less application-level
parallelism available than CU lanes, some lanes will be un-
used. Likewise, stages in a CU beyond those needed for a
basic block will also be unused—each basic block has its
own controller, and the CU only has hardware support for
one control hierarchy.
The anomaly-detection DNN is our largest model requir-
ing line-rate operation, so we use it to set the ideal lane
count. The DNN’s largest layer has 12 hidden units, so the
largest dot-product calculations involve 12 elements; the
16-lane configuration fully unrolls the dot product within a
single CU while minimizing underutilization. Currently, the
16-lane configuration balances area overhead, power, and
mapping efficiency, but optimal lane counts may change
as data-plane ML models evolve. Because map-reduce pro-
grams are hardware agnostic, programs can run on new
configurations unmodified; the compiler will handle the
differences in unrolling factors as needed.
5.1.3 Stage Count Selection. We perform a scaling
study to quantify the impact of CU stage counts on area
(Figure 12). For this study, we use activation functions as
they have the deepest compute graphs; we sweep CU stage
count and report the area of the smallest array that maps
each function. For Taylor series approximations (Sigmoid-
Exp and TanhExp), stages added to CUs are used to map
computation, but overall area remains flat: adding stages
is roughly equivalent to adding CUs. Furthermore, for acti-
vation functions with shallow compute graphs (e.g., ReLU),
Perf. Area Power
App Model GPkt/s ns mm2 +% mW +%
IoT KMeans 1.00 76 2.48 3.3 142 0.56
Anomaly SVM 1.00 68 4.59 6.1 263 1.1
Anomaly DNN 1.00 188 8.80 11.7 506 2.0
Indigo LSTM 0.08 380 17.73 23.6 1018 4.1
Table 4: Performance and resource overheads of sev-
eral application models. Overheads are calculated rel-
ative to a 300mm2 chip with 4 reconfigurable pipe-
lines [47], each drawing an estimated 25W.
adding stages decreases efficiency: the later stages are not
mapped. Dot products require only two stages: one for the
map/multiply, and one for the reduce/add. As theoretical
area- and energy-efficiency increase with stage count, we
want to increase the stage count for better efficiency. How-
ever, more stages are not useful for functions like LUTs,
ReLU, and linear algebra, so we use two stages.
5.1.4 Final Prototype ASIC Configuration. We end up
with a CU configuration that has 16 lanes, 2 stages, and uses
an 8-bit fixed-point data path; each CU takes 0.124mm2,
with a single FU taking 3877 µm2. Our Taurus parameters
are based on the applications and functions in use today:
as ML for networking grows, we may need to revisit these
parameters. Regardless, our parameterized design shows
that map-reduce can be supported with a small amount of
additional hardware.
5.2 Application Benchmarks
We evaluate Taurus using four ML models: an IoT traf-
fic classification model [113], two anomaly-detection mod-
els [71, 102], and a model that learns congestion-control
windows [115]. The IoT traffic classification implements
KMeans clustering, using 11 packet- and flow-level fea-
tures, to classify IoT traffic into five categories. The first
anomaly-detection algorithm is an SVM [71] that uses of-
fline dimensionality reduction to select eight key features
of the 41 in the KDD intrusion-detection data set [4, 26].
The SVM uses a radial basis function to model nonlinear
relationships. Our second anomaly-detection algorithm is
a DNN that takes six input features (also a subset of KDD
features) and produces two outputs: the probability of a ma-
licious packet and the probability of a safe packet. The DNN
has three intermediate layers with 12, 6, and 3 hidden units,
respectively [102]. Finally, the online congestion-control al-
gorithm (Indigo [115]) is an LSTM-based network. Indigo
uses 32 LSTM units followed by a softmax layer and is
designed to run at an endpoint.
9
Table 4 shows the performance, area overheads, and
power requirements of our benchmarks on Taurus, com-
pared against a 300mm2 [37] programmable switch chip
with an RMT-based pipeline [13]. By mapping traffic clas-
sification and anomaly detection to Taurus, we show that
real models can run at line rate in switches. Both anomaly-
detection applications learn to detectmalicious packetswith
accuracies better than non-ML solutions when running on
Taurus, with each using a different ML algorithm. The abil-
ity to run multiple ML models for one problem shows Tau-
rus’s generality: after network-specific pre/postprocessing,
any map-reduce model can be used, allowing network op-
erators to select an optimal model. With the congestion-
control model, we investigate a neural network running at
short intervals, instead of per-packet—operating only at a
small fraction of line rate still yields major improvements
over the Indigo software.
5.2.1 Area & Power. We examine overall area and power
with respect to an existing programmable switch ASIC to
show the additional cost of implementation. Table 4 reports
the area of only the CUs needed to implement each oper-
ation; therefore, the actual area of a prototype for these
benchmarks is the area of the largest benchmark, with un-
used CUs disabled for smaller benchmarks. Simple models,
like SVM-based anomaly detection, have as little as 6.1%
area overhead and 1.1% power overhead. Indigo, our largest
model, consumes an additional 23.6% area and 4.1% power
because it is not fully unrolled. Therefore, we choose Tau-
rus’s map-reduce block area as 17.73mm2. If switch design-
ers choose to only support smaller models, KMeans, SVMs,
and DNNs add only 12% more area and 2% more power.
5.2.2 Latency & Throughput. KMeans, the SVM, and
the DNN process one packet’s headers per cycle (line-
rate); they do not affect throughput, and latency remains
in the nanosecond range (Table 4). Assuming a datacen-
ter switch latency of 1 µs [28], KMeans, the SVM, and the
DNN add 7.6%, 6.8%, and 18.8% more latency, respectively.
We also use Indigo to estimate the performance of mod-
els doing periodic—not per-packet—control updates within
a network; these models provide more detailed updates
for real-time events, like link congestion. In software, the
Indigo LSTM network significantly improves application-
level throughput and latency [115], operating in 10ms
intervals—likely slowed due to the LSTM’s computational
requirements. In Taurus, Indigo can produce a decision ev-
ery 12.5 ns with each step taking 380 ns: this allows the
LSTM network to react more quickly to changes in load
and better control tail latency. Overall, Taurus can run per-
packet models with minimal performance impact, and al-
low periodic models to make decisions orders of magnitude
faster than software.
µbmark Area (mm2) Lat. (ns)
Linear
Conv1D 4.93 47
Percept 0.78 16
SVMLin 1.82 30
LSTMLin 2.34 29
GRULin 2.34 29
Nonlinear
LeakyReLU 0.78 21
ReLU 0.52 20
SigmoidLUT 0.52 27
TanhLUT 0.52 27
Table 5: Area and latency of each microbenchmark,
running at line rate in a 16-lane, 2-stage CU.
Percept
Percept
Percept
Percept
Percept
Percept
Percept
Percept
Percept
Percept
ReLU
ReLU Percept
Sigmoid(σ)
Figure 13: A small DNN, broken down into indepen-
dent microbenchmarks.
5.3 Microbenchmarks
Finally, we evaluate Taurus on a variety of microbench-
marks to investigate the key hardware features driving
application performance. Smaller dataflow programs can
be composed into a single, large program: for example,
Figure 13 shows a DNN built from several perceptron lay-
ers fused with nonlinear activation functions. Taurus is
spatially reconfigurable, hence, the area overhead of any
model is the sum of its constituent parts; these parts define
the hardware needed to implement the model. By evaluat-
ing these building blocks of ML applications, we provide
general results that can be adapted to a variety of design
points.
We divide microbenchmarks into two categories; linear
and nonlinear functions, which play different roles in a
model and have different implementation characteristics.
Linear functions are notable because they are not perfectly
parallel: they include a reduction network that limits the
degree of communication-free parallelism. Conversely, non-
linear functions can be perfectly SIMD-parallelized because
there is no interaction between adjacent data elements. For
example, if the output of 16 different perceptrons is input
to a ReLU, we simply map the ReLU over the 16 outputs,
which are then computed in parallel. If fully unrolled, the
latency of this operation is the sum of the perceptron and
ReLU execution time.
10
µbmark Unroll Line Rate Area (mm2)
Conv1D
1 ¹⁄₈ 0.78
2 ¹⁄₄ 1.30
4 ¹⁄₂ 2.34
8 1 4.93
SVMLin 1 ¹⁄₂ 1.302 1 1.82
Perceptron – 1 0.78
Table 6: Throughput and area scaling of microbench-
marks with unrolling factors from 1 to 8.
Linear Operations. Our primary linear microbench-
marks are a one-dimensional convolution, a perceptron
kernel, and linear SVM. We also evaluate the linear com-
ponents of LSTM and GRU cells, which have an underlying
computation similar to the perceptron.The convolution ker-
nel captures position-invariant features and is frequently
used to find spatial or temporal correlations [64]. Table 5
shows the area required for each microbenchmark when
unrolled to run at line rate. Because the convolution does
not map well to vectorized map-reduce (there are multiple
small inner reductions), it requires 8× unrolling and much
chip area. However, the SVM and perceptron run at line
rate with less than 2mm2 of additional chip area; they can
be efficiently composed into high-performance deep neural
networks.
The latencies imposed by each microbenchmark are also
shown in Table 5. The convolution and SVM kernels have
the highest latencies—each has a small loop that is unrolled
across CUs. This adds another stage of inter-CU communi-
cation, and therefore latency; because the perceptron runs
entirely within a single CU, it has the lowest latency. The
minimum latency for a 16-lane CU to perform a map-reduce
is five cycles: one cycle for map and four cycles for reduce,
using different fractions of a single stage for each reduction
cycle (Figure 10). The remaining latency comes from data
movement from the input to the CU and then to the output;
Taurus takes roughly five cycles for each data movement—a
result of its spatially-distributed dataflow layout.
Unrolling. Table 6 shows the area and throughput im-
pact of outer-loop unrolling on a selection of linear micro-
benchmarks. Not all benchmarks can have their outer-loop
unrolled: for example, our untiled perceptron has no outer
loop. The iterative (i.e., loop-based) versions of the SVM
and the convolution run at one-half and one-eighth of line
rate, respectively; this corresponds to two loop iterations
and eight iterations per packet. By unrolling the SVM twice,
throughput improves to line rate at the cost of a 40% area in-
crease; however, unrolling the convolution to meet line rate
results in a 6.3× area increase. Using map-reduce’s target-
independent optimization, large ML models can loop, thus
running over multiple cycles with a corresponding reduc-
tion in the number of packets forwarded per second.
Nonlinear Operations. Activation functions are nec-
essary to learn nonlinear behavior; otherwise, the entire
neural network would collapse into a single linear func-
tion. Different activation functions are used for different
purposes: tanh is used in LSTMs to implement gating [54],
while DNNs use ReLU and Leaky ReLU, which are easier to
implement [77]. The area and latency results for nonlinear
operations are also shown in Table 5.
The most efficient functions, including ReLU and Leaky
ReLU, take under 1mm2; they do not use lookup tables.
More complicated functions, including sigmoid and tanh,
have several versions: Taylor series, piecewise approxima-
tions, and lookup tables [49, 111]. Taylor series and piece-
wise approximations require 2–5 times as many resources
as other activation functions (Figure 12). LUT-based func-
tions need memory, but only a small amount: each table
stores 1024 8-bit entries; even when replicated for parallel
lookups, this is a trivial fraction of a switch chip’s total
memory. Therefore, we present microbenchmark results for
LUT-based sigmoid and tanh. Latencies of nonlinear kernels
are lower than linear kernels because there is no inter-CU
communication generated by loop unrolling.
6 FUTUREWORK
In this paper, we show that Taurus enables inference for
per-packet ML algorithms in the data plane; with it, a wide
variety of networkML research directions become available.
Dimensionality Reduction for Data Augmentation.
Data augmentation—joining input datawith statically-known
relationships to aid ML—becomes challenging as the num-
ber of new features grows. For example, a network operator
using IP-correlated data to preciselymodel a packet’s source
may add dozens of derived features. Storing these in MATs
would be too expensive; however, dimensionality reduction
can reduce feature counts while maintaining the underlying
information [108].
The benefits of dimensionality reduction are twofold: it
reduces the amount of preprocessing data and decreases
model sizes. However, dimensionality reduction cannot re-
place ML due to the cross-product explosion: multiple fields
could be reduced at one, but due to exact/wildcard matching
(binary or ternary), the flow-table sizes grow exponentially
with the number of fields.Therefore, in Taurus, dimensional-
ity reduction is best suited to provide additional information
about input fields (e.g., IP address or port number), while
ML identifies relationships between fields.
11
Shrinking Models. A major Taurus application will be
network control and coordination. Neural networks can
solve a variety of control problems [9, 99, 107] and are get-
ting smaller. For example, structured control nets [99] for
non-linear control perform almost as well as 512-neuron
DNNs using as few as four neurons per layer. With such
small networks, Taurus can run multiple models simultane-
ously (e.g., one model for intrusion detection and another
for traffic optimization). In addition, techniques like quanti-
zation, pruning, and distillation can further reduce amodels’
size [8, 51, 57, 109].
Learned Traffic Management. For Taurus to run ML
models accurately, training models properly is paramount.
Simultaneous training for learned congestion control lets
devices make decisions using knowledge of other devices’
policies [112]. Using data-plane MLmodels, we can force all
data-plane functions to take a global view. For example, a
learned scheduling algorithm could be bootstrapped using
a simulated (or emulated) data center (like CrystalNet [66]):
realistic traffic would drive the simulation, while switch
weights are trained to route more optimally and improve
throughput. Further improvements would occur online, us-
ing sampled traces from switches to gradually adapt to
changes in traffic. Effective network training must optimize
globally: if devices are trained in isolation, they will behave
greedily, lowering efficiency and quality of experience.
7 RELATEDWORK
Architectures for ML. While Taurus uses Plasticine,
a vectorized CGRA [86], as the basis of its map-reduce
block, it could feasibly be implemented with other fabrics.
The most widely-available reconfigurable architectures are
Field Programmable Gate Arrays (FPGAs), which are used
as both custom accelerators [93] and prototyping tools (e.g.,
the NetFPGA [2, 69]). However, FPGAs’ on-chip intercon-
nects consume up to 70% of the total chip power [16], and
their variable, slow clock frequencies complicate interfac-
ing and operating at network switch speeds (multi-terabits
per second). CGRAs are optimized for arithmetic to support
ML better and typically have a fast, fixed clock frequency
that allows seamless integration between the map-reduce
block and MATs in Taurus [25, 35, 41, 70, 72, 96]. Other ar-
chitectures, like Eyeriss [19], Brainwave [34], and EIE [50],
achieve high efficiency by focusing on specific algorithms.
These could be used for in-switch ML, but are too rigid: if a
specific accelerator were standardized, networks would be
unable to benefit from future ML research due to the lack
of a flexible abstraction (like map-reduce).
ML For Networking. Many networking applications can
benefit from ML. For example, learned algorithms for con-
gestion control [112, 115] have been shown to outperform
their human-designed counterparts [27, 48, 116]. In addi-
tion, Boutaba et al. [15] identify ML use cases for network
tasks such as traffic classification [29, 31], traffic predic-
tion [17, 20], active queue management [100, 121], and
security [83]. All of these applications could immediately
be deployed in networks today using Taurus.
Networking ForML. Specialized networks can also accel-
erate ML algorithms themselves.With minor enhancements
to modern data-plane hardware, switches can aggregate gra-
dients in-network, accelerating training by up to 300% [90].
Gaia, a system for distributed ML [55], also accounts for
wide-area network bandwidth and regulates themovements
of gradients during the training process.While Taurus is not
explicitly designed to accelerate distributed training, map-
reduce supports aggregating numeric weights contained in
packets more efficiently than MATs.
8 CONCLUSION
Self-driving networks—networks that make observations
about their performance and improve themselves—would
increase efficiency and users’ quality of experience in mod-
ern and future data centers, but neither the programming
abstraction nor the hardware exists, today, to realize such a
network. Bridging this gap, Taurus is the equivalent of adap-
tive cruise control: automatically adjusting parameters in
response to changing network conditions. We demonstrate
that Taurus operates at line-rate and adds minimal over-
head to a programmable switch pipeline (e.g., RMT)—24%
more area and 178 ns average latency—while accelerating
several recently-proposed networking benchmarks. Taurus
replaces data-plane heuristics with learned functions and
can inter-operate with existing data-plane devices. Given
a mixture of Taurus and traditional networking hardware
(e.g., using only Taurus NICs or ToR switches), Taurus’s ML
models will make optimal decisions accounting for existing
heuristics.
We hope that Taurus will eventually enable full network
automation, beyond just performance tuning and learned
network security. Operators could use a fully-autonomous
ML model for packet forwarding—with tight bounds on
its output, like training wheels. The bounds would allow
the autonomous model to make decisions and serve as the
initial labeling function for bad decisions. As the model
becomes more reliable, the bounds could be relaxed until
ML is making virtually all packet-forwarding decisions.
To build a self-driving network, hardware must be de-
ployed before large-scale training can begin: Taurus gives
a foothold for in-network ML with hardware that can be
installed—and improve performance and security—in next-
generation data-planes.
12
REFERENCES
[1] Barefoot Networks: Deep Insight. https://www.barefootnetworks.
com/static/app/pdf/DI-UG42-003ea-ProdBrief.pdf. Accessed on
02/07/2020.
[2] NetFPGA: A Line-rate, Flexible, and Open Platform for Research
and Classroom Experimentation. https://netfpga.org/. Accessed on
02/07/2020.
[3] Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J.,
Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow:
A System For Large-ScaleMachine Learning. InUSENIX OSDI (2016).
[4] Aggarwal, P., and Sharma, S. K. Analysis of KDD Dataset
Attributes-Class Wise For Intrusion Detection. Computer Science 57
(2015), 842–851.
[5] Alizadeh, M., Edsall, T., Dharmapurikar, S., Vaidyanathan, R.,
Chu, K., Fingerhut, A., Lam, V. T., Matus, F., Pan, R., Yadav, N.,
andVarghese, G. CONGA:Distributed Congestion-aware Load Bal-
ancing for Datacenters. In ACM SIGCOMM (2014).
[6] Alshammari, R., and Zincir-Heywood, A. N. Machine Learning
Based Encrypted Traffic Classification: Identifying SSH and Skype.
In IEEE CISDA (2009).
[7] Auld, T., Moore, A. W., and Gull, S. F. Bayesian Neural Networks
For Internet Traffic Classification. IEEE Transactions on Neural Net-
works 18, 1 (2007), 223–239.
[8] Ba, J., and Caruana, R. Do Deep Nets Really Need to be Deep? In
NeurIPS (2014).
[9] Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The
Arcade Learning Environment: An Evaluation Platform for General
Agents. Journal of Artificial Intelligence Research (JAIR) 47 (2013),
253–279.
[10] Benson, T., Akella, A., and Maltz, D. A. Network Traffic Charac-
teristics of Data Centers in the Wild. In ACM IMC (2010).
[11] Bernaille, L., Teixeira, R., Akodkenou, I., Soule, A., and Salama-
tian, K. Traffic Classification on the Fly. ACM SIGCOMM Computer
Communication Review (CCR) 36, 2 (2006), 23–26.
[12] Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rex-
ford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G.,
et al. P4: Programming protocol-independent packet processors.
ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87–
95.
[13] Bosshart, P., Gibb, G., Kim, H.-S., Varghese, G., McKeown, N., Iz-
zard, M., Mujica, F., and Horowitz, M. Forwarding Metamorpho-
sis: Fast Programmable Match-Action Processing in Hardware for
SDN. In ACM SIGCOMM (2013).
[14] Bottou, L. Feature Engineering. https://www.cs.princeton.edu/
courses/archive/spring10/cos424/slides/18-feat.pdf, 2010. Accessed
on 02/07/2020.
[15] Boutaba, R., Salahuddin, M. A., Limam, N., Ayoubi, S., Shahriar,
N., Estrada-Solano, F., and Caicedo, O. M. A Comprehensive Sur-
vey on Machine Learning for Networking: Evolution, Applications
and Research Opportunities. Journal of Internet Services and Applica-
tions (JISA) 9, 1 (2018), 16.
[16] Calhoun, B. H., Ryan, J. F., Khanna, S., Putic, M., and Lach, J.
Flexible Circuits and Architectures for Ultralow Power. Proceedings
of the IEEE 98, 2 (2010), 267–282.
[17] Chabaa, S., Zeroual, A., and Antari, J. Identification and Predic-
tion of Internet Traffic Using Artificial Neural Networks. Journal
of Intelligent Learning Systems and Applications (JILSA) 2, 03 (2010),
147.
[18] Chen, H., and Benson, T. The Case for Making Tight Control Plane
Latency Guarantees in SDN Switches. In ACM SOSR (2017).
[19] Chen, Y.-H., Krishna, T., Emer, J. S., and Sze, V. Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks. IEEE Journal of Solid-State Circuits 52, 1 (2016), 127–138.
[20] Chen, Z., Wen, J., and Geng, Y. Predicting Future Traffic Using
Hidden Markov Models. In IEEE ICNP (2016).
[21] Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G., Olukotun, K.,
and Ng, A. Y. Map-Reduce for Machine Learning on Multicore. In
NeurIPS (2007), pp. 281–288.
[22] Cisco Systems, I. Cisco Meraki (MX450): Powerful Security and SD-
WAN for the Branch & Campus. https://meraki.cisco.com/products/
appliances/mx450. Accessed on 02/07/2020.
[23] Cormode, G., and Muthukrishnan, S. An Improved Data Stream
Summary: The Count-Min Sketch and its Applications. Journal of
Algorithms 55, 1 (2005), 58–75.
[24] Covington, P., Adams, J., and Sargin, E. Deep Neural Networks
for Youtube Recommendations. In ACM RecSys (2016).
[25] Cronqist, D. C., Fisher, C., Figueroa,M., Franklin, P., and Ebel-
ing, C. Architecture Design of Reconfigurable Pipelined Datapaths.
In IEEE ARVLSI (1999).
[26] Dhanabal, L., and Shantharajah, S. A Study on NSL-KDD
Dataset for Intrusion Detection System Based on Classification Al-
gorithms. International Journal of Advanced Research in Computer
and Communication Engineering (IJARCCE) 4, 6 (2015), 446–452.
[27] Dong, M., Li, Q., Zarchy, D., Godfrey, P. B., and Schapira, M.
PCC: Re-architecting Congestion Control for Consistent High Per-
formance. In USENIX NSDI (2015).
[28] EMC, D. Data Center Switching Quick Reference Guide.
https://i.dell.com/sites/doccontent/shared-content/data-
sheets/en/Documents/Dell-Networking-Data-Center-Quick-
Reference-Guide.pdf. Accessed on 02/07/2020.
[29] Erman, J., Arlitt, M., andMahanti, A. Traffic Classification Using
Clustering Algorithms. In ACM MineNet (2006).
[30] Erman, J., Mahanti, A., and Arlitt, M. Qrp05-4: Internet Traffic
Identification Using Machine Learning. In IEEE Globecom (2006).
[31] Erman, J., Mahanti, A., Arlitt, M., and Williamson, C. Identi-
fying and Discriminating Between Web and Peer-to-Peer Traffic in
the Network Core. InWWW (2007).
[32] Este, A., Gringoli, F., and Salgarelli, L. Support Vector Machines
For TCP Traffic Classification. Computer Networks 53, 14 (2009),
2476–2490.
[33] Firestone, D., Putnam, A., Mundkur, S., Chiou, D., Dabagh, A.,
Andrewartha, M., Angepat, H., Bhanu, V., Caulfield, A., Chung,
E., et al. Azure Accelerated Networking: SmartNICs in the Public
Cloud. In USENIX NSDI (2018).
[34] Fowers, J., Ovtcharov, K., Papamichael, M., Massengill, T., Liu,
M., Lo, D., Alkalay, S., Haselman,M., Adams, L., Ghandi, M., et al.
A Configurable Cloud-Scale DNN Processor for Real-Time AI. In
IEEE ISCA (2018).
[35] Gao, M., and Kozyrakis, C. HRL: Efficient and Flexible Reconfig-
urable Logic for Near-Data Processing. In IEEE HPCA (2016).
[36] Geng, Y., Liu, S., Yin, Z., Naik, A., Prabhakar, B., Rosenblum, M.,
and Vahdat, A. SIMON: A Simple and Scalable Method for Sensing,
Inference and Measurement in Data Center Networks. In USENIX
NSDI (2019).
[37] Gibb, G., Varghese, G., Horowitz, M., and McKeown, N. Design
principles for packet parsers. In ACM/IEE ANCS (2013).
[38] Gill, P., Jain, N., and Nagappan, N. Understanding Network Fail-
ures in Data Centers: Measurement, Analysis, and Implications. In
ACM SIGCOMM (2011).
[39] Gillick, D., Faria, A., and DeNero, J. Mapreduce: Distributed Com-
puting for Machine Learning. Berkley, Dec 18 (2006).
[40] Goldman, R., Bartleson, K., Wood, T., Kranen, K., Melikyan, V.,
and Babayan, E. 32/28nm Educational Design Kit: Capabilities, De-
ployment and Future. In IEEE PrimeAsia (2013).
13
[41] Goldstein, S. C., Schmit, H., Budiu, M., Cadambi, S., Moe, M., and
Taylor, R. R. PipeRench: A Reconfigurable Architecture and Com-
piler. Computer 33, 4 (2000), 70–77.
[42] Gont, F., and Yourtchenko, A. On the Implementation of the TCP
Urgent Mechanism. RFC 6093, Jan. 2011.
[43] Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep
Learning, vol. 1. MIT Press, Cambridge, 2016.
[44] Govindaraju, V., Ho, C.-H., Nowatzki, T., Chhugani, J., Satish,
N., Sankaralingam, K., and Kim, C. Dyser: Unifying Functionality
and Parallelism Specialization for Energy-Efficient Computing. IEEE
Micro 32, 5 (2012), 38–51.
[45] Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C.,
Lahiri, P., Maltz, D. A., Patel, P., and Sengupta, S. VL2: A Scal-
able and Flexible Data Center Network. In ACM SIGCOMM (2009).
[46] Guo, C., Yuan, L., Xiang, D., Dang, Y., Huang, R., Maltz, D., Liu,
Z., Wang, V., Pang, B., Chen, H., et al. Pingmesh: A Large-Scale
System for Data Center Network Latency Measurement and Analy-
sis. In ACM SIGCOMM (2015).
[47] Gurevich, V. Programmable Data Plane at Terabit Speeds, May 2017.
Accessed on 02/07/2020.
[48] Ha, S., Rhee, I., and Xu, L. CUBIC: ANewTCP-Friendly High-Speed
TCP Variant. ACM SIGOPS Operating Systems Review 42, 5 (2008), 64–
74.
[49] Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao,
S., Wang, Y., et al. ESE: Efficient Speech Recognition Engine with
Sparse LSTM on FPGA. In ACM/SIGDA FPGA (2017).
[50] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., Horowitz, M. A., and
Dally, W. J. EIE: Efficient Inference Engine on Compressed Deep
Neural Network. In ACM/IEEE ISCA (2016).
[51] Han, S., Pool, J., Tran, J., and Dally, W. Learning Both Weights
and Connections for Efficient Neural Network. In NeurIPS (2015).
[52] Harper, R., MacQueen, D., and Milner, R. Standard ML. Depart-
ment of Computer Science, University of Edinburgh, 1986.
[53] Hearst, M. A., Dumais, S. T., Osuna, E., Platt, J., and Scholkopf,
B. Support Vector Machines (SVMs). IEEE Intelligent Systems and
their Applications 13, 4 (1998), 18–28.
[54] Hochreiter, S., and Schmidhuber, J. Long Short-Term Memory
(LSTM). Neural Computation 9, 8 (1997), 1735–1780.
[55] Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G. R.,
Gibbons, P. B., and Mutlu, O. Gaia: Geo-Distributed Machine
Learning Approaching LAN Speeds. In USENIX NSDI (2017).
[56] Ingre, B., and Yadav, A. Performance analysis of nsl-kdd dataset
using ann. In 2015 international conference on signal processing and
communication engineering systems (2015), IEEE, pp. 92–96.
[57] Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A.,
Adam, H., and Kalenichenko, D. Quantization and Training of
Neural Networks for Efficient Integer-Arithmetic-Only Inference. In
IEEE CVPR (2018).
[58] Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G.,
Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-
Datacenter Performance Analysis of a Tensor Processing Unit. In
IEEE ISCA (2017).
[59] Jurkiewicz, P., Rzym, G., and Boryło, P. How Many Mice Make an
Elephant? Modelling Flow Length and Size Distribution of Internet
Traffic. arXiv:1809.03486 (2019).
[60] Katta, N., Hira,M., Kim, C., Sivaraman, A., and Rexford, J. HULA:
Scalable Load Balancing Using Programmable Data Planes. In ACM
SOSR (2016).
[61] Kim, C., Sivaraman, A., Katta, N., Bas, A., Dixit, A., andWobker,
L. J. In-Band Network Telemetry via Programmable Dataplanes. In
ACM SIGCOMM (Demo) (2015).
[62] Koeplinger, D., Feldman, M., Prabhakar, R., Zhang, Y., Hadjis,
S., Fiszel, R., Zhao, T., Nardi, L., Pedram, A., Kozyrakis, C., and
Olukotun, K. Spatial: A Language and Compiler for Application
Accelerators. In ACM/SIGPLAN PLDI (2018).
[63] Kuźniar, M., Perešíni, P., and Kostić, D. What You Need to Know
About SDN Flow Tables. In PAM (2015), Springer.
[64] LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker,
J., Drucker, H., Guyon, I., Muller, U., Sackinger, E., Simard, P.,
and Vapnik, V. Comparison of Learning Algorithms for Handwrit-
ten Digit Recognition. In ICANN (1995).
[65] Li, Y., Miao, R., Liu, H. H., Zhuang, Y., Feng, F., Tang, L., Cao, Z.,
Zhang, M., Kelly, F., Alizadeh, M., and Yu, M. HPCC: High Preci-
sion Congestion Control. In ACM SIGCOMM (2019).
[66] Liu, H. H., Zhu, Y., Padhye, J., Cao, J., Tallapragada, S., Lopes,
N. P., Rybalchenko, A., Lu, G., and Yuan, L. CrystalNet: Faithfully
Emulating Large Production Networks. In ACM SOSP (2017).
[67] Liu, Y., Li, W., and Li, Y.-C. Network Traffic Classification Using
K-Means Clustering. In IEEE IMSCCS (2007).
[68] Loadbalancer.org. Hardware ADC: Enterprise Ultra. https://www.
loadbalancer.org/products/hardware/enterprise-ultra/. Accessed on
02/07/2020.
[69] Lockwood, J. W., McKeown, N., Watson, G., Gibb, G., Hartke, P.,
Naous, J., Raghuraman, R., and Luo, J. NetFPGA: An Open Plat-
form for Gigabit-Rate Network Switching and Routing. In IEEE MSE
(2007).
[70] Marshall, A., Stansfield, T., Kostarnov, I., Vuillemin, J., and
Hutchings, B. A Reconfigurable Arithmetic Array for Multimedia
Applications. In IEEE FPGA (1999).
[71] Mehmood, T., and Rais, H. B. M. SVM for Network Anomaly De-
tection using ACO Feature Subset. In IEEE iSMSC (2015).
[72] Mei, B., Vernalde, S., Verkest, D., De Man, H., and Lauwereins,
R. DRESC: A Retargetable Compiler for Coarse-Grained Reconfig-
urable Architectures. In IEEE FPT (2002).
[73] Mestres, A., Rodriguez-Natal, A., Carner, J., Barlet-Ros, P.,
Alarcón, E., Solé, M., Muntés-Mulero, V., Meyer, D., Barkai, S.,
Hibbett, M. J., et al. Knowledge-Defined Networking. ACM SIG-
COMM Computer Communication Review (CCR) 47, 3 (2017), 2–10.
[74] Minsky, Y., Madhavapeddy, A., and Hickey, J. Real World OCaml:
Functional Programming for the Masses. O’Reilly Media, Inc., 2013.
[75] Moore, A. W., and Zuev, D. Internet Traffic Classification using
Bayesian Analysis Techniques. In ACM SIGMETRICS (2005).
[76] Moshref, M., Yu, M., Govindan, R., and Vahdat, A. Trumpet:
Timely and Precise Triggers in Data Centers. In ACM SIGCOMM
(2016).
[77] Nair, V., and Hinton, G. E. Rectified Linear Units Improve Re-
stricted Boltzmann Machines. In ICML (2010).
[78] Niranjan Mysore, R., Pamboris, A., Farrington, N., Huang, N.,
Miri, P., Radhakrishnan, S., Subramanya, V., and Vahdat, A.
PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network
Fabric. In ACM SIGCOMM (2009).
[79] NVIDIA. Tesla T4. https://www.nvidia.com/en-us/data-center/
tesla-t4/. Accessed on 02/07/2020.
[80] Odersky, M., Spoon, L., and Venners, B. Programming in Scala.
Artima Inc, 2008.
[81] Papalexakis, E. E., Beutel, A., and Steenkiste, P. Network Anom-
aly Detection using Co-Clustering. In IEEE/ACM ASONAM (2012).
[82] Park, J., Tyan, H.-R., and Kuo, C.-C. J. Internet Traffic Classification
for Scalable QoS Provision. In IEEE ICME (2006).
[83] Perdisci, R., Ariu, D., Fogla, P., Giacinto, G., and Lee, W. McPAD:
A Multiple Classifier System for Accurate Payload-Based Anomaly
Detection. Computer Networks 53, 6 (2009), 864–881.
[84] Pfaff, B., Pettit, J., Koponen, T., Jackson, E., Zhou, A., Raja-
halme, J., Gross, J., Wang, A., Stringer, J., Shelar, P., Amidon, K.,
14
and Casado, M. The Design and Implementation of Open vSwitch.
In USENIX NSDI (2015).
[85] Ports, D. R., and Nelson, J. When Should The Network Be The
Computer? In ACM HotOS (2019).
[86] Prabhakar, R., Zhang, Y., Koeplinger, D., Feldman, M., Zhao, T.,
Hadjis, S., Pedram, A., Kozyrakis, C., andOlukotun, K. Plasticine:
A Reconfigurable Architecture for Parallel Patterns. In ACM/IEEE
ISCA (2017).
[87] Rae, J. W., Bartunov, S., and Lillicrap, T. P. Meta-Learning Neural
Bloom Filters. arXiv:1906.04304 (2019).
[88] Roy, A., Zeng, H., Bagga, J., Porter, G., and Snoeren, A. C. In-
side the Social Network’s (Datacenter) Network. In ACM SIGCOMM
(2015).
[89] Rucker, A., Swamy, T., Shahbaz, M., and Olukotun, K. Elastic
RSS: Co-Scheduling Packets and Cores Using Programmable NICs.
In ACM APNet (2019).
[90] Sapio, A., Canini, M., Ho, C.-Y., Nelson, J., Kalnis, P., Kim, C.,
Krishnamurthy, A., Moshref, M., Ports, D. R., and Richtárik,
P. Scaling Distributed Machine Learning with In-Network Aggrega-
tion. arXiv:1903.06701 (2019).
[91] Sarkar, D. Continuous Numeric Data – Strate-
gies for Working with Continuous, Numerical Data.
https://towardsdatascience.com/understanding-feature-
engineering-part-1-continuous-numeric-data-da4e47099a7b, 2018.
Accessed on 02/07/2020.
[92] Shahbaz, M., Choi, S., Pfaff, B., Kim, C., Feamster, N., McKeown,
N., and Rexford, J. Pisces: A Programmable, Protocol-Independent
Software Switch. In ACM SIGCOMM (2016).
[93] Shawahna, A., Sait, S. M., and El-Maleh, A. Fpga-Based Acceler-
ators of Deep Learning Networks for Learning and Classification: A
Review. IEEE Access 7 (2018), 7823–7859.
[94] Shelly, N., Jackson, E. J., Koponen, T., McKeown, N., and Raja-
halme, J. Flow Caching for High Entropy Packet Fields. In ACM
HotSDN (2014).
[95] Singh, A., Ong, J., Agarwal, A., Anderson, G., Armistead, A.,
Bannon, R., Boving, S., Desai, G., Felderman, B., Germano, P.,
Kanagala, A., Provost, J., Simmons, J., Tanda, E., Wanderer, J.,
Hölzle, U., Stuart, S., and Vahdat, A. Jupiter Rising: A Decade
of Clos Topologies and Centralized Control in Google’s Datacenter
Network. In ACM SIGCOMM (2015).
[96] Singh, H., Lee, M.-H., Lu, G., Kurdahi, F. J., Bagherzadeh, N., and
Chaves Filho, E. M. MorphoSys: An Integrated Reconfigurable Sys-
tem for Data-Parallel and Computation-Intensive Applications. IEEE
Transactions on Computers 49, 5 (2000), 465–481.
[97] Siracusano, G., and Bifulco, R. In-Network Neural Networks.
arXiv:1801.05731 (2018).
[98] Sivaraman, A., Subramanian, S., Alizadeh, M., Chole, S.,
Chuang, S.-T., Agrawal, A., Balakrishnan, H., Edsall, T., Katti,
S., andMcKeown, N. Programmable Packet Scheduling at Line Rate.
In ACM SIGCOMM (2016).
[99] Srouji, M., Zhang, J., and Salakhutdinov, R. Structured Control
Nets for Deep Reinforcement Learning. arXiv:1802.08311 (2018).
[100] Sun, J., and Zukerman, M. An Adaptive Neuron AQM for a Sta-
ble Internet. In International Conference on Research in Networking
(2007), Springer.
[101] Sun, R., Yang, B., Peng, L., Chen, Z., Zhang, L., and Jing, S. Traffic
Classification Using Probabilistic Neural Networks. In IEEE ICNC
(2010).
[102] Tang, T. A., Mhamdi, L., McLernon, D., Zaidi, S. A. R., and
Ghogho, M. Deep Learning Approach for Network Intrusion De-
tection in Software Defined Networking. In IEEE WINCOM (2016).
[103] Tavallaee,M., Bagheri, E., Lu,W., andGhorbani, A. A. ADetailed
Analysis of the KDD CUP 99 Data Set. In IEEE CISDA (2009).
[104] Taylor, M. B., Kim, J., Miller, J., Wentzlaff, D., Ghodrat, F.,
Greenwald, B., Hoffman, H., Johnson, P., Lee, J.-W., Lee,W., et al.
The Raw Microprocessor: A Computational Fabric for Software Cir-
cuits and General-Purpose Programs. IEEE Micro 22, 2 (2002), 25–35.
[105] Technologies, A. The Journal of Internet Test Methodologies:
Mixed Packet Size Throughput. https://s3.amazonaws.com/zanran_-
storage/www.ixiacom.com/ContentPages/109218067.pdf. Accessed
on 02/07/2020.
[106] Thompson, S. Haskell: The Craft of Functional Programming, vol. 2.
Addison-Wesley, 2011.
[107] Todorov, E., Erez, T., and Tassa, Y. Mujoco: A Physics Engine for
Model-Based Control. In IEEE IROS (2012).
[108] Van Der Maaten, L., Postma, E., and Van den Herik, J. Dimen-
sionality Reduction: A Comparative. Journal of Machine Learning
Research (JMLR) 10 (2009), 66–71.
[109] Wang, E., Davis, J. J., Zhao, R., Ng, H.-C., Niu, X., Luk, W., Cheung,
P. Y., and Constantinides, G. A. Deep Neural Network Approxima-
tion for Custom Hardware: Where We’ve Been, Where We’re Going.
ACM Computing Surveys (CSUR) 52, 2 (2019), 1–39.
[110] Wang, N., Choi, J., Brand, D., Chen, C.-Y., and Gopalakrishnan,
K. Training Deep Neural Networks with 8-bit Floating Point Num-
bers. In NeurIPS (2018).
[111] Wang, S., Li, Z., Ding, C., Yuan, B., Qiu, Q., Wang, Y., and Liang,
Y. C-LSTM: Enabling Efficient LSTM using Structured Compression
Techniques on FPGAs. In ACM/SIGDA FPGA (2018).
[112] Winstein, K., and Balakrishnan, H. TCP Ex Machina: Computer-
Generated Congestion Control. In ACM SIGCOMM (2013).
[113] Xiong, Z., and Zilberman, N. Do Switches Dream of Machine
Learning? Toward In-Network Classification. In ACM HotNets
(2019).
[114] Xu, Z., Tang, J., Meng, J., Zhang, W., Wang, Y., Liu, C. H., and
Yang, D. Experience-Driven Networking: A Deep Reinforcement
Learning Based Approach. In IEEE INFOCOM (2018).
[115] Yan, F. Y., Ma, J., Hill, G. D., Raghavan, D., Wahby, R. S., Levis,
P., and Winstein, K. Pantheon: The Training Ground for Internet
Congestion-Control Research. In USENIX ATC (2018).
[116] Zaki, Y., Pötsch, T., Chen, J., Subramanian, L., and Görg, C. Adap-
tive Congestion Control for Unpredictable Cellular Networks. In
ACM SIGCOMM (2015).
[117] Zander, S., Nguyen, T., and Armitage, G. Automated Traffic Clas-
sification and Application Identification UsingMachine Learning. In
IEEE LCN (2005).
[118] Zhang, J., Chen, C., Xiang, Y., Zhou, W., and Xiang, Y. Internet
Traffic Classification by Aggregating Correlated Naïve Bayes Predic-
tions. IEEE Transactions on Information Forensics and Security 8, 1
(2012), 5–15.
[119] Zhang, J., Chen, X., Xiang, Y., Zhou, W., and Wu, J. Robust Net-
work Traffic Classification. IEEE/ACM Transactions on Networking
23, 4 (2014), 1257–1270.
[120] Zhong, H., Fan, K., Mahlke, S., and Schlansker, M. A Distributed
Control Path Architecture for VLIW Processors. In IEEE PACT
(2005).
[121] Zhou, C., Di, D., Chen, Q., andGuo, J. AnAdaptive AQMAlgorithm
Based on Neuron Reinforcement Learning. In IEEE ICCA (2009).
15
