40 research outputs found

    Optimizing energy-efficiency for multi-core packet processing systems in a compiler framework

    Get PDF
    Network applications become increasingly computation-intensive and the amount of traffic soars unprecedentedly nowadays. Multi-core and multi-threaded techniques are thus widely employed in packet processing system to meet the changing requirement. However, the processing power cannot be fully utilized without a suitable programming environment. The compilation procedure is decisive for the quality of the code. It can largely determine the overall system performance in terms of packet throughput, individual packet latency, core utilization and energy efficiency. The thesis investigated compilation issues in networking domain first, particularly on energy consumption. And as a cornerstone for any compiler optimizations, a code analysis module for collecting program dependency is presented and incorporated into a compiler framework. With that dependency information, a strategy based on graph bi-partitioning and mapping is proposed to search for an optimal configuration in a parallel-pipeline fashion. The energy-aware extension is specifically effective in enhancing the energy-efficiency of the whole system. Finally, a generic evaluation framework for simulating the performance and energy consumption of a packet processing system is given. It accepts flexible architectural configuration and is capable of performingarbitrary code mapping. The simulation time is extremely short compared to full-fledged simulators. A set of our optimization results is gathered using the framework

    Transformations for polyhedral process networks

    Get PDF
    We use the polyhedral process network (PPN) model of computation to program and map streaming applications onto embedded Multi-Processor Systems on Chip (MPSoCs) platforms. The PPNs, which can be automatically derived from sequential program applications, do not necessarily meet the performance/resource constraints. A designer can therefore apply the process splitting transformations to increase program performance, and the process merging transformation to reduce the number of processes in a PPN. These transformations were defined, but a designer had many possibilities to apply a particular transformation, and these transformations can also be ordered in many different ways. In this dissertation, we define compile-time solution approaches that assist the designer in evaluating and applying process splitting and merging transformations in the most effective way.UBL - phd migration 201

    High Speed Networking In The Multi-Core Era

    Get PDF
    High speed networking is a demanding task that has traditionally been performed in dedicated, purpose built hardware or specialized network processors. These platforms sacrifice flexibility or programmability in favor of performance. Recently, there has been much interest in using multi-core general purpose processors for this task, which have the advantage of being easily programmable and upgradeable. The best way to exploit these new architectures for networking is an open question that has been the subject of much recent research. In this dissertation, I explore the best way to exploit multi-core general purpose processors for packet processing applications. This includes both new architectural organizations for the processors as well as changes to the systems software. I intend to demonstrate the efficacy of these techniques by using them to build an open and extensible network security and monitoring platform that can out perform existing solutions

    Branch Prediction For Network Processors

    Get PDF
    Originally designed to favour flexibility over packet processing performance, the future of the programmable network processor is challenged by the need to meet both increasing line rate as well as providing additional processing capabilities. To meet these requirements, trends within networking research has tended to focus on techniques such as offloading computation intensive tasks to dedicated hardware logic or through increased parallelism. While parallelism retains flexibility, challenges such as load-balancing limit its scope. On the other hand, hardware offloading allows complex algorithms to be implemented at high speed but sacrifice flexibility. To this end, the work in this thesis is focused on a more fundamental aspect of a network processor, the data-plane processing engine. Performing both system modelling and analysis of packet processing functions; the goal of this thesis is to identify and extract salient information regarding the performance of multi-processor workloads. Following on from a traditional software based analysis of programme workloads, we develop a method of modelling and analysing hardware accelerators when applied to network processors. Using this quantitative information, this thesis proposes an architecture which allows deeply pipelined micro-architectures to be implemented on the data-plane while reducing the branch penalty associated with these architectures

    Parallel and distributed processing in high speed traffic monitoring

    Get PDF
    This thesis presents a parallel and distributed approach for the purpose of processing network traffic at high speeds. The proposed architecture provides the processing power required to run one or more traffic processing applications at line rates by means of processing full packets at multi-gigabits speeds using a parallel and distributed processing environment. Moreover, the architecture is flexible and scalable to future needs by supporting heterogeneous processing nodes such as different hardware architectures or different generations of the same hardware architecture. In addition to the processing, flexibility, and scalability features, our architecture provides an easy-to-use environment with the help of a new programming language, called FPL, for traffic processing in a distributed environment. The language and its compiler come to hide specific programming details when using heterogeneous systems and a distributed environment.UBL - phd migration 201

    Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies

    Get PDF
    Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support

    Abstractions and Algorithms for Control of Extensible and Heterogeneous Virtualized Network Infrastructures

    Get PDF
    Virtualized network infrastructures are currently deployed in both research and commercial contexts. The complexity of the virtualization layer varies greatly in different deployments, ranging from cloud computing environments, to carrier Ethernet applications using stacked VLANs, to networking testbeds. In all of these cases, many users are sharing the resources of one provider and each user expects their resources to be isolated from all other users. There are many challenges associated with the control and management of these systems, including resource allocation and sharing, resource isolation, system security, and usability. Among the different types of virtualized infrastructures, network testbeds are of particular interest due to their widespread use in education and in the networking research community. Networking researchers rely extensively on testbeds when evaluating new protocols and ideas. Indeed, a substantial percentage of top research papers include results gathered from testbeds. Network emulation testbeds in particular are often used to conduct innovative research because they allow users to emulate diverse network topologies in a controlled environment. That is, researchers run experiments with a collection of resources that can be reconfigured to represent many different network scenarios. The user typically has control over most of the resources in their experiment which results in a high level of reproducibility. As such, these types of testbeds provide an excellent bridge between simulation and deployment of new ideas. Unfortunately, most testbeds suffer from a general lack of resource extensibility and diversity. This dissertation extends the current state of the art by designing a new, more general testbed infrastructure that expands and enhances the capabilities of modern testbeds. This includes pertinent abstractions, software design, and related algorithms. The design has also been prototyped in the form of the Open Network Laboratory network testbed, which has been successfully used in educational and research pursuits. While the focus is on network testbeds, the results of this research will also be applicable to the broader class of virtualized system infrastructures

    Streamroller : A Unified Compilation and Synthesis System for Streaming Applications.

    Full text link
    The growing complexity of applications has increased the need for higher processing power. In the embedded domain, the convergence of audio, video, and networking on a handheld device has prompted the need for low cost, low power,and high performance implementations of these applications in the form of custom hardware. In a more mainstream domain like gaming consoles, the move towards more realism in physics simulations and graphics has forced the industry towards multicore systems. Many of the applications in these domains are streaming in nature. The key challenge is to get efficient implementations of custom hardware from these applications and map these applications efficiently onto multicore architectures. This dissertation presents a unified methodology, referred to as Streamroller, that can be applied for the problem of scheduling stream programs to multicore architectures and to the problem of automatic synthesis of custom hardware for stream applications. Firstly, a method called stream-graph modulo scheduling is presented, which maps stream programs effectively onto a multicore architecture. Many aspects of a real system, like limited memory and explicit DMAs are modeled in the scheduler. The scheduler is evaluated for a set of stream programs on IBM's Cell processor. Secondly, an automated high-level synthesis system for creating custom hardware for stream applications is presented. The template for the custom hardware is a pipeline of accelerators. The synthesis involves designing loop accelerators for individual kernels, instantiating buffers to store data passed between kernels, and linking these building blocks to form a pipeline. A unique aspect of this system is the use of multifunction accelerators, which improves cost by efficiently sharing hardware between multiple kernels. Finally, a method to improve the integer linear program formulations used in the schedulers that exploits symmetry in the solution space is presented. Symmetry-breaking constraints are added to the formulation, and the performance of the solver is evaluated.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/61662/1/kvman_1.pd

    NFComms: A synchronous communication framework for the CPU-NFP heterogeneous system

    Get PDF
    This work explores the viability of using a Network Flow Processor (NFP), developed by Netronome, as a coprocessor for the construction of a CPU-NFP heterogeneous platform in the domain of general processing. When considering heterogeneous platforms involving architectures like the NFP, the communication framework provided is typically represented as virtual network interfaces and is thus not suitable for generic communication. To enable a CPU-NFP heterogeneous platform for use in the domain of general computing, a suitable generic communication framework is required. A feasibility study for a suitable communication medium between the two candidate architectures showed that a generic framework that conforms to the mechanisms dictated by Communicating Sequential Processes is achievable. The resulting NFComms framework, which facilitates inter- and intra-architecture communication through the use of synchronous message passing, supports up to 16 unidirectional channels and includes queuing mechanisms for transparently supporting concurrent streams exceeding the channel count. The framework has a minimum latency of between 15.5 μs and 18 μs per synchronous transaction and can sustain a peak throughput of up to 30 Gbit/s. The framework also supports a runtime for interacting with the Go programming language, allowing user-space processes to subscribe channels to the framework for interacting with processes executing on the NFP. The viability of utilising a heterogeneous CPU-NFP system for use in the domain of general and network computing was explored by introducing a set of problems or applications spanning general computing, and network processing. These were implemented on the heterogeneous architecture and benchmarked against equivalent CPU-only and CPU/GPU solutions. The results recorded were used to form an opinion on the viability of using an NFP for general processing. It is the author’s opinion that, beyond very specific use cases, it appears that the NFP-400 is not currently a viable solution as a coprocessor in the field of general computing. This does not mean that the proposed framework or the concept of a heterogeneous CPU-NFP system should be discarded as such a system does have acceptable use in the fields of network and stream processing. Additionally, when comparing the recorded limitations to those seen during the early stages of general purpose GPU development, it is clear that general processing on the NFP is currently in a similar state
    corecore