Year after year, network traffic keeps reaching new highs as unprecedented volumes of data flow between an ever-increasing number of sources and destinations. As part of the Analysis on the Wire project, we aim to develop a network-centric approach for streaming data processing to facilitate scientific data analysis and reduce the overhead in sending big data to a data center. This work discusses our design for programmable network devices to augment network capabilities for streaming data processing, including efforts in progress at Brookhaven National Laboratory and the challenges faced to date.
INTRODUCTION
Fast processing of big data has become a fundamental requirement for advancing academic and industrial pursuits in areas such as scientific data analysis, machine learning, and artificial intelligence. The volumes of data collected and/or generated then transported over the network to data centers for processing grow rapidly each year, especially given the ever-increasing number of data sources and demand for sophisticated offline machine learning algorithms, e.g., neural networks and deep learning. As their capacity and energy efficiency are strained, this trend creates a major burden on networks and data centers, adversely affecting user waiting times. Today, more data are in transit in the network than ever before, and Cisco estimates that global Internet traffic will reach 4.8 zettabytes by 2022 [12] . Thus, adding processing capabilities to analyze and manipulate data in transit is an approach worth exploring-one with the potential to provide early insight into data and reduce network traffic and data center loads.
This work aims to achieve in-network, large-bandwidth data processing through a lightweight augmentation of network capabilities. Two main aspects involve: 1) the effort to perform generalpurpose, lightweight computations on intermediate and/or edge network devices before the data reach the data center and 2) development of streaming algorithms suitable for in-network processing. The work will entail upgrading a network by designing a data forwarding plane in hardware to accelerate large-scale data processing in transit. The goal is to reduce the decision-making time at the data destination by performing data processing, such as encryption/decryption, compression/decompression, sampling, lightweight transformations, and supervised/unsupervised learning, in the transmission phase.
NETWORK UPGRADE FOR STREAMING DATA PROCESSING
Herein, we primarily consider two points when upgrading a network. First, we must efficiently control the network's heterogeneous structure. Typically, the general-purpose computing capabilities of current network devices (e.g., routers and switches) are limited to none. Although more sophisticated devices slowly are entering the market, such as Mellanox's SHArP technology [2] ), it is unrealistic to expect that they will immediately replace older generations. Therefore, a generic infrastructure design is needed that allows old and new devices to coexist while also considering optimal device deployment, cooperation, security, resource sharing and allocation, and overall network performance. Second, a flexible, scalable design is needed to accommodate the growth of future data applications, e.g., increasing demands for large-scale scientific data computations. Thus, the design should be flexible enough to evolve with changes to the control strategy and/or device functionality.
2.1 Designing a smart hybrid software-defined network 2.1.1 Network architecture. We adopt the concept of softwaredefined networking (SDN) [5] to manage heterogeneous devices and upgrade control strategies. SDN is a network architecture designed to flexibly manage heterogeneous devices by separating the control and data planes while using centralized, software-based hierarchical control logic. The logical control is a tree-based hierarchical architecture. Due to its hybrid structure, cooperation must be implemented among switches through the high-level controllers. Thus, we will adopt distributed network control mechanisms that should adapt to network dynamics. The general idea is that switches automatically collaborate and exchange cooperation and network status information through one or several hop neighbors in a distributed way. The exchanged information includes topology information, flow statistics, TCP or UDP latency, and/or neighbors' actions, and switches only exchange their differences from previous status to reduce overhead and flooding. After several exchange steps, such as those in gossip protocol, local network information can propagate through network links to network edges. As a result, an optimal or sub-optimal global consensus is reached.
Work in progress.
We have implemented simple control strategies on the NetFPGA platform [14] , such as real-time populating of routing tables, tunneling support based on labels, and simple linear packet stream processing at the L3 level. The current implementation is distributed without collaboration and is responsible for local traffic forwarding, screening, and monitoring.
We design our data plane to have forwarding and data processing function. For simple data processing, such as sampling and labeling, we can avoid sending big data streams to a data center. An intermediate SDN switch/router/node can process buffered data before forwarding them. This processing can be achieved on a switch or a computational accelerator attached to one. Moreover, we can quickly deploy a load-balancing scheduler in hardware to split big data, which achieves spatial parallelism and reduces the computational burden for each intermediate device/node. Figure 2 shows a single switch data flowchart. Generally, the data plane is in a pipeline structure, which complies with most hardware design architectures. For implementation, we use the NetFPGA-SUME board [14] , based on the Xilinx Inc. Virtex 7 FPGA, and the SimpleSumeSwitch model developed by Stanford University and the University of Cambridge. We also use the hardware-independent P4 language. Thus, by engaging the appropriate compiler, we can move the developed package to different hardware.
In Figure 2 , large-bandwidth data flow into a switch from the left. The switch first parses packets' headers then examines them to obtain other decisions, for example, whether or not to inspect packet headers and perform deep payload inspection, label packets for security purposes, or process data if the packet headers contain a mark indicating certain processing types. If the switch has a corresponding processing unit, it executes computations on board or pushes the packet through the Peripheral Component Interconnect express (PCIe) bus to other devices or servers attached to it. Depending on the computation type, data reassembly may be needed. For fast processing, the NetFPGA board may perform computations at packet level without reassembly. After processing packets, the computation unit sends packets to the scheduler. The scheduler decides whether to drop any packets or forward them to the next node, which, if no additional calculations are needed, could be the packets' destination or the next processing unit.
The scheduler can re-encapsulate packets to different paths for computation. It also may reorder packet sending times based on congestion, quality of service (QoS) or security priority, link status, total throughput, and end-to-end latency. The scheduler automatically updates these policies or follows commands from the control plane (whenever needed). The control plane can dispatch policies in real time by updating the context of the board's registers. Thus, the control and data planes exchange information iteratively to guarantee network performance.
This work also considers lightweight scheduling algorithms for improved fast processing, scalability, and cooperation. The control and management plane computes the best strategy based on local data sent from neighbors to achieve global objectives, such as best effort QoS, latency, reliable network connectivity, fault tolerance, and fairness. These control strategies include access control taken in the processing decision component and scheduler; resource allocation, such as channel sharing in the scheduler; routing information in the routing table; and overall system upgrade information. The control plane also can adopt smart control strategies from artificial intelligence to dynamically update policy based on its own and the neighbors' data.
Along with traditional forwarding, the proposed SDN switch also will consider implementing two basic functions to enable innetwork computation:
• Encapsulation and decapsulation for packets tunneling to computational units • Lightweight and fast preprocessing of streaming data.
Our switch includes a Network Service Header (NSH) for innetwork computation based on RFC 8300 [9] . NSHs are used to realize encapsulation and decapsulation, split data flow, and reroute packets through network computational units. Based on NSH, switches route packets through in-ordered network service functions (NSFs), scattering locally to achieve sophisticated data processing in transit. The NSF service path is known as a service function chain (SFC). Figure 3 shows our testbed that includes two switches, a conventional switch (PICA8 hardware) and a NetFPGA-based one, along with four servers, aow1 through aow4. The testbed can be configured to run several different experiments and simulations. In a conventional scientific data computation example, raw data on a storage node need to be processed then sent to another node for further processing, analysis, and decision making. This is depicted with aow4 transmitting packets to a remote data center first, which, after processing, forwards the results to aow3 (blue line). In our proposed approach, aow4 can first mark data packets that need computation and send them through the NetFPGA switch to their destination, aow3. The NetFPGA switch recognizes the mark and either processes the data packets locally or redirects them using NSHs to NSFs implemented in nearby aow1, which is equipped with GPUs and/or FPGAs for accelerated computation. The NSF-hosting device, typically a powerful but lightweight server (here, aow1), can be easily deployed/redeployed and upgraded. Our framework is particularly suited to handle time-sensitive and/or high-security data applications when compared to applications that traditionally run in data centers. It can be used to provide early insight into data for various scientific or cybersecurity purposes.
Choosing a streaming data processing algorithm
Streaming algorithms must be able to process big data flows, i.e., flows with high volume, variety, and velocity. Michael Stonebraker et al. [10] , proposed eight requirements for real-time stream processing with no specific order of importance. Of those eight, the most important requirements for designing algorithms on a switch are, in our opinion, the following:
• Keep data running • Assemble packets due to truncated or out-of-order data • Handle stream imperfections, primarily missing and inconsistent data. In some cases, to keep data moving at high speed, not all bytes from a packet's payload can be selected and accessed for processing. Thus, streaming algorithms must be robust and capable of getting coherent results even with partial input data. We seek to design a generalized architecture for scalable stream processing that adapts to dynamic changing traffic patterns and does not discriminate against user cases. Thus, the optimal solution is to design a switch as a plug-in architecture. Streaming algorithms would be plugged in after parsing packet headers and processing payloads at the per-packet level or after reassembly. Of note, several basic streaming algorithms can be adapted for specific applications:
• Dimensionality reduction includes sampling (random, Monte Carlo, priority, weighted, and Fourier sampling), sketching, data compression/decompression, data cleaning (noise elimination).
• Fast time-series processing, such as streaming data encryption and decryption, linear regression, time-series prediction, and linear numerical analysis.
• Flow monitoring; screening; and high security alerts, such as measuring packet inter-arrival times and time between peaks and detecting big jumps and falls among data values, which indicate unusual and emergent events with high priority.
• Big data clustering, splitting, aggregation (including counting, e.g., high-frequency statistics; empirical statistics evaluation), feature selection, flow division based on labels for supervised learning, flow labeling, and interpolation based on neighborhood data. After clustering, flow can be reordered or redirected to NSF for further processing. MapReduce is a prominent example of clustering data by counting and mapping then reducing data for decision.
• Adaptive decision-making algorithms, such as reinforcement learning, to make dynamic decisions based on rewards from timeline data. 
RELATED WORK
Amazon has developed Kinesis [11] for streaming data processing in its Amazon Web Services (AWS) data center. Microsoft Catapult [8] deploys FPGAs (no forwarding) between top-of-rack (ToR) and servers in the Azure data center to accelerate streaming data processing. Because they use FPGAs, the processing function is reprogrammable. Microsoft AccelNet [1] also uses FPGAs as smart NICs to offload processing tasks from the network stack. As mentioned earlier, Mellanox has developed a scalable hierarchical aggregation and reduction protocol (SHArP) [2] to aggregate data flows in ToR switches, known as switch-IB 2. Switch-IB 2 is not reprogrammable and is deployed in a data center. However, stream processing in data centers remains less prevalent when compared to offline learning algorithms. In-network computation has the potential to reduce network traffic and data center computation load by shifting some of the processing that normally would take place in a data center to upstream network devices. An SDN architecture can be used to achieve in-network computation through an SFC-without involving a data center. An SFC can realize complicated computations and achieve computational parallelism among application flows. The cooperation among these flows can be intelligent through a control plane with a global view. Some research efforts have examined the general architecture of SDN and its design [5] [4] . RFC 7665 [3] provides detailed implementation of network protocols and organizations among devices and network functions. OpenFlow [6] is a widely supported protocol for opensource SDN platforms and is featured on many projects, including Big Switch, Switch Light, PICA8, Stratum Project, OpenDaylight, and Microsoft SONiC. There also has been work investigating possibilities to combine OpenFlow with data center techniques [13] .
Another approach to achieve in-network computation involves processing data within a network device. Barefoot Networks has developed Tofino and Deep Insight [7] for data plane transmission. Besides the ordinary function of forwarding packets, Deep Insight aims to monitor network flows for packet inspection and guarantee their security. These products indicate that implementing packet processing in a programmable switch is achievable. Therefore, we can reference their methods of collecting flow statistics and use it in our effort to develop a switch that can realize preprocessing of packets for large-scale scientific data computation.
CONCLUSION
As part of this Analysis on the Wire project, we have discussed network upgrades for large-scale, high-speed data processing in terms of system architecture, especially for a data plane design suitable for processing high-speed data streams. We also have been studying the problems encountered in developing stream-processing algorithms and their deployment on an in-network processing system. We are focused on developing an infrastructure for applying high-bandwidth streaming algorithms to scientific data processing and are examining algorithm robustness when processing partial data sets (with missing data). We also are considering topics such as smart switch cooperative network control; network resource sharing (with parity); high-bandwidth data security; and, in general, the scalability of our stream processing network design.
