38 research outputs found

    Application specific asynchronous microengines for efficient high-level control

    Get PDF
    technical reportDespite the growing interest in asynchronous circuits programmable asynchronous controllers based on the idea of microprogramming have not been actively pursued Since programmable control is widely used in many commercial ASICs to allow late correction of design errors to easily upgrade product families to meet the time to market and even efficient run time modications to control in adaptive systems we consider it crucial that self timed techniques support efficient programmable control This is especially true given that asynchronous (self-timed) circuits are well suited for realizing reactive and control intensive designs We offer a practical solution to programmable asynchronous control in the form of application-speciffic microprogrammed asynchronous controllers (or microengines). The features of our solution include a modular and easily extensible datapath structure support for two main styles of hand shaking (namely two phase and four phase), and many efficiency measures based on exploiting concurrency between operations and employing efficient circuit structures Our results demonstrate that the proposed microengine can yield high performance-in fact performance close to that offered by automated high level synthesis tools targeting custom hard wired burstmode machines

    Application specific asynchronous microgengines for efficient high-level control

    Get PDF
    technical reportDespite the growing interest in asynchronous circuits, programmable asynchronous controllers based on the idea of microprogramming have not been actively pursued. Since programmable control is widely used in many commercial ASICs to allow late correction of design errors, to easily upgrade product families, to meet the time to market, and even effect run-time modifications to control in adaptive systems, we consider it crucial that self-timed techniques support efficient programmable control. This is especially true given that asynchronous (self-timed) circuits are well suited for realizing reactive and control-intensive designs. We offer a practical solution to programmable asynchronous control in the form of application-specific micro-programmed asynchronous controllers (or microengines). The features of our solution include a modular and easily extensible datapath structure, support for two main styles of handshaking (namely two-phase and four-phase), and many efficiency measures based on exploiting concurrency between operations and employing efficient circuit structures. Our results demonstrate that the proposed microengine can yield high performance?in fact performance close to that offered by automated high-level synthesis tools targeting custom hard-wired burstmode machines

    High-level asynchronous system design using the ACK framework

    Get PDF
    Journal ArticleDesigning asynchronous circuits is becoming easier as a number of design styles are making the transition from research projects to real, usable tools. However, designing asynchronous "systems" is still a difficult problem. We define asynchronous systems to be medium to large digital systems whose descriptions include both datapath and control, that may involve non-trivial interface requirements, and whose control is too large to be synthesized in one large controller. ACK is a framework for designing high performance asynchronous systems of this type. In ACK we advocate an approach that begins with procedural level descriptions of control and datapath and results in a hybrid system that mixes a variety of hardware implementation styles including burst-mode AFSMs, macromodule circuits, and programmable control. We present our views on what makes asynchronous high level system design different from lower level circuit design, motivate our ACK approach, and demonstrate using an example system design

    A Massively Parallel MIMD Implemented by SIMD Hardware?

    Get PDF
    Both conventional wisdom and engineering practice hold that a massively parallel MIMD machine should be constructed using a large number of independent processors and an asynchronous interconnection network. In this paper, we suggest that it may be beneficial to implement a massively parallel MIMD using microcode on a massively parallel SIMD microengine; the synchronous nature of the system allows much higher performance to be obtained with simpler hardware. The primary disadvantage is simply that the SIMD microengine must serialize execution of different types of instructions - but again the static nature of the machine allows various optimizations that can minimize this detrimental effect. In addition to presenting the theory behind construction of efficient MIMD machines using SIMD microengines, this paper discusses how the techniques were applied to create a 16,384- processor shared memory barrier MIMD using a SIMD MasPar MP-1. Both the MIMD structure and benchmark results are presented. Even though the MasPar hardware is not ideal for implementing a MIMD and our microinterpreter was written in a high-level language (MPL), peak MIMD performance was 280 MFLOPS as compared to 1.2 GFLOPS for the native SIMD instruction set. Of course, comparing peak speeds is of dubious value; hence, we have also included a number of more realistic benchmark results

    Fast Packet Processing on High Performance Architectures

    Get PDF
    The rapid growth of Internet and the fast emergence of new network applications have brought great challenges and complex issues in deploying high-speed and QoS guaranteed IP network. For this reason packet classication and network intrusion detection have assumed a key role in modern communication networks in order to provide Qos and security. In this thesis we describe a number of the most advanced solutions to these tasks. We introduce NetFPGA and Network Processors as reference platforms both for the design and the implementation of the solutions and algorithms described in this thesis. The rise in links capacity reduces the time available to network devices for packet processing. For this reason, we show different solutions which, either by heuristic and randomization or by smart construction of state machine, allow IP lookup, packet classification and deep packet inspection to be fast in real devices based on high speed platforms such as NetFPGA or Network Processors

    NFComms: A synchronous communication framework for the CPU-NFP heterogeneous system

    Get PDF
    This work explores the viability of using a Network Flow Processor (NFP), developed by Netronome, as a coprocessor for the construction of a CPU-NFP heterogeneous platform in the domain of general processing. When considering heterogeneous platforms involving architectures like the NFP, the communication framework provided is typically represented as virtual network interfaces and is thus not suitable for generic communication. To enable a CPU-NFP heterogeneous platform for use in the domain of general computing, a suitable generic communication framework is required. A feasibility study for a suitable communication medium between the two candidate architectures showed that a generic framework that conforms to the mechanisms dictated by Communicating Sequential Processes is achievable. The resulting NFComms framework, which facilitates inter- and intra-architecture communication through the use of synchronous message passing, supports up to 16 unidirectional channels and includes queuing mechanisms for transparently supporting concurrent streams exceeding the channel count. The framework has a minimum latency of between 15.5 μs and 18 μs per synchronous transaction and can sustain a peak throughput of up to 30 Gbit/s. The framework also supports a runtime for interacting with the Go programming language, allowing user-space processes to subscribe channels to the framework for interacting with processes executing on the NFP. The viability of utilising a heterogeneous CPU-NFP system for use in the domain of general and network computing was explored by introducing a set of problems or applications spanning general computing, and network processing. These were implemented on the heterogeneous architecture and benchmarked against equivalent CPU-only and CPU/GPU solutions. The results recorded were used to form an opinion on the viability of using an NFP for general processing. It is the author’s opinion that, beyond very specific use cases, it appears that the NFP-400 is not currently a viable solution as a coprocessor in the field of general computing. This does not mean that the proposed framework or the concept of a heterogeneous CPU-NFP system should be discarded as such a system does have acceptable use in the fields of network and stream processing. Additionally, when comparing the recorded limitations to those seen during the early stages of general purpose GPU development, it is clear that general processing on the NFP is currently in a similar state

    Revisiting the Classics: Online RL in the Programmable Dataplane

    Get PDF
    Data-driven networking is becoming more capable and widely researched, partly driven by the efficacy of Deep Reinforcement Learning (DRL) algorithms. Yet the complexity of both DRL inference and learning force these tasks to be pushed away from the dataplane to hosts, harming latency-sensitive applications. Online learning of such policies cannot occur in the dataplane, despite being useful techniques when problems evolve or are hard to model.We present OPaL—On Path Learning—the first work to bring online reinforcement learning to the dataplane. OPaL makes online learning possible in constrained SmartNIC hardware by returning to classical RL techniques—avoiding neural networks. Our design allows weak yet highly parallel SmartNIC NPUs to be competitive against commodity x86 hosts, despite having fewer features and slower cores. Compared to hosts, we achieve a 21 × reduction in 99.99th tail inference times to 34 µs, and 9.9 × improvement in online throughput for real-world policy designs. In-NIC execution eliminates PCIe transfers, and our asynchronous compute model ensures minimal impact on traffic carried by a co-hosted P4 dataplane. OPaL’s design scales with additional resources at compile-time to improve upon both decision latency and throughput, and is quickly reconfigurable at runtime compared to reinstalling device firmware

    Partial Program Admission by Path Enumeration

    Get PDF
    Real-time systems on non-preemptive platforms require a means of bounding the execution time of programs for admission purposes. Worst-Case Execution Time (WCET) is most commonly used to bound program execution time. While bounding a program\u27s WCET statically is possible, computing its true WCET is difficult without significant semantic knowledge. We present an algorithm for partial program admission, suited for non-preemptive platforms, using dynamic programming to perform explicit enumeration of program paths. Paths - possible or not - are bounded by the available execution time and admitted on a path-by-path basis without requiring semantic knowledge of the program beyond its Control Flow Graph (CFG)

    HAIL: An Algorithm for the Hardware Accelerated Identification of Languages, Master\u27s Thesis, May 2006

    Get PDF
    This thesis examines in detail the Hardware-Accelerated Identification of Languages (HAIL) project. The goal of HAIL is to provide an accurate means to identify the language and encoding used in streaming content, such as documents passed over a high-speed network. HAIL has been implemented on the Field-programmable Port eXtender (FPX), an open hardware platform developed at Washington University in St. Louis. HAIL can accurately identify the primary languages and encodings used in text at rates much higher than what can be achieved by software algorithms running on microprocessors

    Performance, scalability, and flexibility in the RAW network router

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.Includes bibliographical references (p. 46).Conventional high speed Internet routers are built using custom designed microprocessors, dubbed network processors, to efficiently handle the task of packet routing. While capable of meeting the performance demanded of them, these custom network processors generally lack the flexibility to incorporate new features and do not scale well beyond that for which they were designed. Furthermore, they tend to suffer from long and costly development cycles, since each new generation must be redesigned to support new features and fabricated anew in hardware. This thesis presents a new design for a network processor, one implemented entirely in software, on a tiled, general purpose microprocessor. The network processor is implemented on the Raw microprocessor, a general purpose microchip developed by the Computer Architecture Group at MIT. The Raw chip consists of sixteen identical processing tiles arranged in a four by four matrix and connected by four inter-tile communication networks; the Raw chip is designed to be able to scale up merely by adding more tiles to the matrix. By taking advantage of the parallelism inherent in the task of packet forwarding on this inherently parallel microprocessor, the Raw network processor is able to achieve performance that matches or exceeds that of commercially available custom designed network processors. At the same time, it maintains the flexibility to incorporate new features since it is implemented entirely in software, as well as the scalability to handle more ports by simply adding more tiles to the microprocessor.by Anthony M. DeGangi.M.Eng
    corecore