4 research outputs found

    A Hardware Verification Methodology for an Interconnection Network with fast Process Synchronization

    Full text link
    Shrinking process node sizes allow the integration of more and more functionality into a single chip design. At the same time, the mask costs to manufacture a new chip increases steadily. For the industry this cost increase can be absorbed by selling more chips. Furthermore, new innovative chip designs have a higher risk. Therefore, the industry only changes small parts of a chip design between different generations to minimize their risks. Thus, new innovative chip designs can only be realized by research institutes, which do not have the cost restrictions and the pressure from the markets as the industry. Such an innovative research project is EXTOLL, which is developed by the Computer Architecture Group of the University of Heidelberg. It is a new interconnection network for High performance Computing, and targets the problems of existing interconnection networks commercially available. EXTOLL is optimized for a high bandwidth, a low latency, and a high message rate. Especially, the low latency and high message rate become more important for modern interconnection networks. As the size of networks grow, the same computational problem is distributed to more nodes. This leads to a lower data granularity and more smaller messages, that have to be transported by the interconnection network. The problem of smaller messages in the interconnection network is addressed by this thesis. It develops a new network protocol, which is optimized for small messages. It reduces the protocol overhead required for sending small messages. Furthermore, the growing network sizes introduce a reliability problem. This is also addressed by the developed efficient network protocol. The smaller data granularity also increases the need for an efficient barrier synchronization. Such a hardware barrier synchronization is developed by thesis, using a new approach of integrating the barrier functionality into the interconnection network. The masks costs to manufacture an ASIC make it difficult for a research institute to build an ASIC. A research institute cannot afford re-spin, because of the costs. Therefore, there is the pressure to make it right the first time. An approach to avoid a re-spin is the functional verification in prior to the submission. A complete and comprehensive verification methodology is developed for the EXTOLL interconnection network. Due to the structured approach, it is possible to realize the functional verification with limited resources in a small time frame. Additionally, the developed verification methodology is able to support different target technologies for the design with a very little overhead

    Hardware Support for Efficient Packet Processing

    Full text link
    Scalability is the key ingredient to further increase the performance of today’s supercomputers. As other approaches like frequency scaling reach their limits, parallelization is the only feasible way to further improve the performance. The time required for communication needs to be kept as small as possible to increase the scalability, in order to be able to further parallelize such systems. In the first part of this thesis ways to reduce the inflicted latency in packet based interconnection networks are analyzed and several new architectural solutions are proposed to solve these issues. These solutions have been tested and proven in a field programmable gate array (FPGA) environment. In addition, a hardware (HW) structure is presented that enables low latency packet processing for financial markets. The second part and the main contribution of this thesis is the newly designed crossbar architecture. It introduces a novel way to integrate the ability to multicast in a crossbar design. Furthermore, an efficient implementation of adaptive routing to reduce the congestion vulnerability in packet based interconnection networks is shown. The low latency of the design is demonstrated through simulation and its scalability is proven with synthesis results. The third part concentrates on the improvements and modifications made to EXTOLL, a high performance interconnection network specifically designed for low latency and high throughput applications. Contributions are modules enabling an efficient integration of multiple host interfaces as well as the integration of the on-chip interconnect. Additionally, some of the already existing functionality has been revised and improved to reach better performance and a lower latency. Micro-benchmark results are presented to underline the contribution of the made modifications

    Acceleration of the hardware-software interface of a communication device for parallel systems

    Full text link
    During the last decades the ever growing need for computational power fostered the development of parallel computer architectures. Applications need to be parallelized and optimized to be able to exploit modern system architectures. Today, scalability of applications is more and more limited both by development resources, as programming of complex parallel applications becomes increasingly demanding, and by the fundamental scalability issues introduced by the cost of communication in distributed memory systems. Lowering the latency of communication is mandatory to increase scalability and serves as an enabling technology for programming of distributed memory systems at a higher abstraction layer using higher degrees of compiler driven automation. At the same time it can increase performance of such systems in general. In this work, the software/hardware interface and the network interface controller functions of the EXTOLL network architecture, which is specifically designed to satisfy the needs of low-latency networking for high-performance computing, is presented. Several new architectural contributions are made in this thesis, namely a new efficient method for virtual-tophysical address-translation named ATU and a novel method to issue operations to a virtual device in an optimal way which has been termed Transactional I/O. This new method needs changes in the architecture of the host CPU the device is connected to. Two additional methods that emulate most of the characteristics of Transactional I/O are developed and employed in the development of the EXTOLL hardware to facilitate usage together with contemporary CPUs. These new methods heavily leverage properties of the HyperTransport interface used to connect the device to the CPU. Finally, this thesis also introduces an optimized remote-memory-access architecture for efficient split-phase transactions and atomic operations. The complete architecture has been prototyped using FPGA technology enabling a more precise analysis and verification than is possible using simulation alone. The resulting design utilizes 95 % of a 90 nm FPGA device and reaches speeds of 200 MHz and 156 MHz in the different clock domains of the design. The EXTOLL software stack is developed and a performance evaluation of the software using the EXTOLL hardware is performed. The performance evaluation shows an excellent start-up latency value of 1.3 μs, which competes with the most advanced networks available, in spite of the technological performance handicap encountered by FPGA technology. The resulting network is, to the best of the knowledge of the author, the fastest FPGA-based interconnection network for commodity processors ever built

    ARCHITECTURAL IMPROVEMENTS OF INTERCONNECTION NETWORK INTERFACES

    Full text link
    The architecture of modern computing systems is getting more and more parallel, in order to exploit more of the offered parallelism by applications and to increase the system's overall performance. The recent trend for single systems is to include multiple cores in one processor module. Another trend is the introduction of virtual machines, which allows to run several operating systems (O/S) independently on one physical node. If the performance of one single system is not sufficient to meet the requirements, multiple single systems are combined in one parallel distributed system. Clusters are parallel distributed systems based on commodity computing parts and an interconnection network. They have an excellent cost-effectiveness and a high efficiency. This is substantiated by their increasing use in high performance computing. But they rely on a highly efficient interconnection network; otherwise computation is limited by the communication overhead. While the computing nodes and systems become more and more parallel due to architectural improvements, virtual machines and parallel programming paradigms, the network interface is typically available only once. If the network interface is not able to exploit the offered parallelism, it becomes a bottleneck limiting the system's overall performance. Goal of this work is to overcome this situation and to develop a network interface architecture which offers unconstrained and parallel access by multiple processes. Any available parallelism should be exploited without limitations, which is in particular true for virtual machine environments. Beside the network interface architecture a set of communication and synchronization methods is developed, which allow a close coupling of the computing nodes. In particular for fine grain communication such a tight coupling is inevitable. The developed network interface architecture has many similarities with the architecture of modern processors, but also introduces new techniques. It is based on Simultaneous Multi-Threading (SMT) and a memory hierarchy including an on-device Translation Look-aside Buffer. The SMT approach removes any partitioning, which allows to exploit any type of parallelism without constraints. While the SMT architecture for main processors relies on the O/S for context switching, the architecture here is self-switching. Upon an issue of a work request an available resource is switched to one of 2^16 contexts, each storing the configuration of the calling process. The main contribution of this work is the introduction of a new technique to enqueue work requests by multiple producers into a central shared work queue. The processes are the producers, which issue work by enqueuing requests. The scheduler of the network interface is the consumer, which forwards the work requests to available functional units. This enqueue technique is essential for an efficient virtualization of the network interface. It allows almost any number of processes to issue simultaneously work requests, is completely transparent to the processes and provides security by separation. The key component to achieve a high efficiency is to avoid explicit mutual exclusion. This is achieved by integrating the complete issue process into a single operation, including flow control to inform the process of the success or failure of the enqueue operation. The new developed techniques like virtualization and ULTRA are successfully tested and evaluated. Both have been implemented on FPGA-based prototyping stations. It is noteworthy that both can be used with any I/O interface, because they do not require any special functionality. But in particular ULTRA can benefit a lot from a closer coupling between peripheral device and main processor than traditional I/O standards like PCI or PCIe can offer. For this purpose a new rapid prototyping station is developed, called HTX-Board. It is based on an FPGA and connects to the main system over a HyperTransport (HT) interface. The HT interface avoids any intermediate bridges between device and main processor, providing an excellent coupling. With the HTX-Board the full potential of ULTRA's communication technique is shown, resulting in yet unmatched latencies for commodity systems. The insights gained during the development of the virtualization technique and ULTRA are used to analyze the requirements for a generic coprocessor interface. An instruction set extension is proposed to achieve a tight coupling between main processor and coprocessor. Key component is the tagging of load and store operations, which allows to include additional information. The efficiency and performance of the virtualization and ULTRA can be even improved by using the proposed instructions of this extension
    corecore