Abstract. Network processors are programmable devices that can process packets at a high speed. A network processor is typified by multi-threading and heterogeneous multiprocessing, which usually requires programmers to manually create multiple tasks and map these tasks onto different processing elements. This paper addresses the problem of automating task creation and mapping of network applications onto the underlying hardware to maximize their throughput. We propose a throughput cost model to guide the task creation and mapping with the objective of both minimizing the number of stages in the processing pipeline and maximizing the average throughput of the slowest task simultaneously. The average throughput is modeled by taking communication cost, computation cost, memory access latency and synchronization cost into account. We envision that programmers write small functions for network applications, such that we use grouping and duplication to construct tasks from the functions. The optimal solution of creating tasks from m functions and mapping them to n processors is an NP-hard problem. Therefore, we present a practical and efficient heuristic algorithm with an O((n + m)m) complexity and show that the obtained solutions produce excellent performance for typical network applications. The entire framework has been implemented in the Open Research Compiler (ORC) adapted to compile network applications written in a domain-specific dataflow language. Experimental results show that the code produced by our compiler can achieve the 100% throughput on the OC-48 input line rate. OC-48 is a fiber optic connection that can handle a 2.488Gbps connection speeds, which is what our targeted hardware was designed for. We also demonstrate the importance of good creation and mapping choices on achieving high throughput. Furthermore, we show that reducing communication cost and efficient resource management are the most important factors for maximizing throughput on the Intel IXP network processors.
Introduction
While there are increasing demands for high throughput on network applications, network processors with their programmability and high processing rates have emerged to become important devices for packet processing applications in addition to ASICs (application-specific integrated circuits). Network processors typically incorporate multiple heterogeneous, multi-threaded cores, and programmers of network applications are often required to manually partition applications into tasks at design time and map the tasks onto different processing elements. However, most programmers find it challenging to produce efficient software for such a complex architecture. Because the type and number of processing elements that these tasks are mapped to greatly influence the overall performance, it is an important but tedious effort for programmers to carefully create and map tasks of applications to achieve a maximal throughput. It is also rather difficult to port these applications from one generation of an architecture to another while still achieving high performance.
The problem being addressed in this paper is the automatic task creation and mapping of packet processing applications on network processors while maximizing the throughput of these applications. We envision programmers writing small functions for modularity in our programming model, and focus on mapping coarsegrained task parallelism in this work. Hence, we apply grouping and duplication to construct tasks from functions as opposed to splitting functions to smaller tasks. Note that the cost model itself is applicable both for different task granularities and for function splitting. Our approach has been implemented in the Open Research Compiler (ORC) [4] [6] and evaluated on Intel IXP2400 network processors. We also conjecture that our approach is applicable to other processor architectures that support multi-threading and chip multiprocessors (CMP).
The primary contributions of this paper are as follows. First, a throughput cost model is developed to model the critical factors that affect the throughput of a CMP system. Compared to other simpler models, we demonstrate our throughput cost model to be more accurate and effective. Second, we develop a practical and efficient heuristic algorithm guided by the throughput cost model to partition and map applications onto network processors automatically. This algorithm also manages hardware resources (e.g. processors and threads) and handles special hardware constraints (e.g. the limited size of control store allowed limits instructions for each task). Third, we have implemented and evaluated our partitioning and mapping approach on a real network processor system. The rest of this paper is organized as follows. Section 2 introduces background on the Intel IXP network processors and the features of our domain-specific programming language, Baker. Section 3 states the problems of creation and mapping. Section 4 presents our throughput cost model and describes a practical heuristic algorithm for task creation and mapping. Section 5 evaluates the performance of three network applications using different heuristics in comparison to our proposed one. Section 6 covers the related work, and then we conclude this paper in Section 7.
