41 research outputs found

    FPGA-accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

    Get PDF
    Griessl R, Peykanu M, Hagemeyer J, et al. FPGA-accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters. Presented at the First International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC‘15), held in conjunction with Supercomputing 2015, Austin Texas, USA

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design

    Real Time Control on Firewire

    Get PDF
    The goal of this project is to get insight into the use of Firewire as a field bus for real-time control. A characterization of Firewire's asynchronous transmission has been made by testing the point-to-point roundtrip in a 3-node Firewire network.\ud The results show Firewire's asynchronous transmission between 2 PC/104 stacks, using FCP (Functional Control Protocol) as the way of implementing, can transfer data in an average latency between 100¿s and 140μs, depending on the data payload. The maximum variation in this latency is 20μs.\ud During this project, it is also found that while the payload is increased, the roundtrip time does not show a significant increase: it rises only around 40μs from 1byte payload to 449 bytes payload. So the bus is used most efficiently when the packet is fully loaded. To get a complete insight of Firewire, it is recommended to implement Firewire¿s isochronous transmission and examine its applicability for real-time control also. And to use Firewire as a field bus in a real distributed control system with a plant is also advised to get a better understanding of Firewire

    Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002

    Get PDF
    The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design

    A scalable packetised radio astronomy imager

    Get PDF
    Includes bibliographical referencesModern radio astronomy telescopes the world over require digital back-ends. The complexity of these systems depends on many site-specific factors, including the number of antennas, beams and frequency channels and the bandwidth to be processed. With the increasing popularity for ever larger interferometric arrays, the processing requirements for these back-ends have increased significantly. While the techniques for building these back-ends are well understood, every installation typically still takes many years to develop as the instruments use highly specialised, custom hardware in order to cope with the demanding engineering requirements. Modern technology has enabled reprogrammable FPGA-based processing boards, together with packet-based switching techniques, to perform all the digital signal processing requirements of a modern radio telescope array. The various instruments used by radio telescopes are functionally very different, but the component operations remain remarkably similar and many share core functionalities. Generic processing platforms are thus able to share signal processing libraries and can acquire different personalities to perform different functions simply by reprogramming them and rerouting the data appropriately. Furthermore, Ethernet-based packet-switched networks are highly flexible and scalable, enabling the same instrument design to be scaled to larger installations simply by adding additional processing nodes and larger network switches. The ability of a packetised network to transfer data to arbitrary processing nodes, along with these nodes' reconfigurability, allows for unrestrained partitioning of designs and resource allocation. This thesis describes the design and construction of the first working radio astronomy imaging instrument hosted on Ethernet-interconnected re- programmable FPGA hardware. I attempt to establish an optimal packetised architecture for the most popular instruments with particular attention to the core array functions of correlation and beamforming. Emphasis is placed on requirements for South Africa's MeerKAT array. A demonstration system is constructed and deployed on the KAT-7 array, MeerKAT's prototype. This research promises reduced instrument development time, lower costs, improved reliability and closer collaboration between telescope design teams

    Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations

    Get PDF
    This study considers software techniques for improving performance on clusters of workstations and approaches for designing message-passing middleware that facilitate scalable, parallel processing. Early binding and overlapping of communication and computation are identified as fundamental approaches for improving parallel performance and scalability on clusters. Currently, cluster computers using the Message-Passing Interface for interprocess communication are the predominant choice for building high-performance computing facilities, which makes the findings of this work relevant to a wide audience from the areas of high-performance computing and parallel processing. The performance-enhancing techniques studied in this work are presently underutilized in practice because of the lack of adequate support by existing message-passing libraries and are also rarely considered by parallel algorithm designers. Furthermore, commonly accepted methods for performance analysis and evaluation of parallel systems omit these techniques and focus primarily on more obvious communication characteristics such as latency and bandwidth. This study provides a theoretical framework for describing early binding and overlapping of communication and computation in models for parallel programming. This framework defines four new performance metrics that facilitate new approaches for performance analysis of parallel systems and algorithms. This dissertation provides experimental data that validate the correctness and accuracy of the performance analysis based on the new framework. The theoretical results of this performance analysis can be used by designers of parallel system and application software for assessing the quality of their implementations and for predicting the effective performance benefits of early binding and overlapping. This work presents MPI/Pro, a new MPI implementation that is specifically optimized for clusters of workstations interconnected with high-speed networks. This MPI implementation emphasizes features such as persistent communication, asynchronous processing, low processor overhead, and independent message progress. These features are identified as critical for delivering maximum performance to applications. The experimental section of this dissertation demonstrates the capability of MPI/Pro to facilitate software techniques that result in significant application performance improvements. Specific demonstrations with Virtual Interface Architecture and TCP/IP over Ethernet are offered

    Control Software for Reconfigurable Coprocessors

    Get PDF
    On-line data processing at the ATLAS general purpose particle detector, which is currently under construction at Geneva, generates demands on computing power that are difficult to satisfy with commodity CPU-based computers. One of the most demanding applications is the recognition of particle tracks that originate from B-quark decays. However, this and many others applications can benefit from parallel execution on field programmable gate arrays (FPGA). After the demonstration of accelerated track recognition with big FPGA-based custom computers, the development of FPGA based coprocessors started in the late 1990's. Applications of FPGA coprocessors are usually partitioned between the host and the tightly coupled coprocessor. The objective of the research that I present in this thesis was the development of software that mediates to applications the access to FPGA coprocessors. I used a software process based on iterative prototyping to cope with the expected changing requirements. Also, I used a strict bottom-up design to create classes that model devices on the coprocessors. Using these low-level classes, I developed tools which were used for bootstrapping, debugging, and firmware update of the coprocessors during their development and maintenance. Measurements show that the software overhead introduced by object-oriented programming and software layering is small. The software-support for six different coprocessors was partitioned into corresponding independent packages, which reuse a set of packages that provide common and basic functions. The steady evolution and use of the software during more than four years shows that the software is maintainable, adaptable, and usable
    corecore