unknown

ARCHITECTURAL IMPROVEMENTS OF INTERCONNECTION NETWORK INTERFACES

Abstract

The architecture of modern computing systems is getting more and more parallel, in order to exploit more of the offered parallelism by applications and to increase the system's overall performance. The recent trend for single systems is to include multiple cores in one processor module. Another trend is the introduction of virtual machines, which allows to run several operating systems (O/S) independently on one physical node. If the performance of one single system is not sufficient to meet the requirements, multiple single systems are combined in one parallel distributed system. Clusters are parallel distributed systems based on commodity computing parts and an interconnection network. They have an excellent cost-effectiveness and a high efficiency. This is substantiated by their increasing use in high performance computing. But they rely on a highly efficient interconnection network; otherwise computation is limited by the communication overhead. While the computing nodes and systems become more and more parallel due to architectural improvements, virtual machines and parallel programming paradigms, the network interface is typically available only once. If the network interface is not able to exploit the offered parallelism, it becomes a bottleneck limiting the system's overall performance. Goal of this work is to overcome this situation and to develop a network interface architecture which offers unconstrained and parallel access by multiple processes. Any available parallelism should be exploited without limitations, which is in particular true for virtual machine environments. Beside the network interface architecture a set of communication and synchronization methods is developed, which allow a close coupling of the computing nodes. In particular for fine grain communication such a tight coupling is inevitable. The developed network interface architecture has many similarities with the architecture of modern processors, but also introduces new techniques. It is based on Simultaneous Multi-Threading (SMT) and a memory hierarchy including an on-device Translation Look-aside Buffer. The SMT approach removes any partitioning, which allows to exploit any type of parallelism without constraints. While the SMT architecture for main processors relies on the O/S for context switching, the architecture here is self-switching. Upon an issue of a work request an available resource is switched to one of 2^16 contexts, each storing the configuration of the calling process. The main contribution of this work is the introduction of a new technique to enqueue work requests by multiple producers into a central shared work queue. The processes are the producers, which issue work by enqueuing requests. The scheduler of the network interface is the consumer, which forwards the work requests to available functional units. This enqueue technique is essential for an efficient virtualization of the network interface. It allows almost any number of processes to issue simultaneously work requests, is completely transparent to the processes and provides security by separation. The key component to achieve a high efficiency is to avoid explicit mutual exclusion. This is achieved by integrating the complete issue process into a single operation, including flow control to inform the process of the success or failure of the enqueue operation. The new developed techniques like virtualization and ULTRA are successfully tested and evaluated. Both have been implemented on FPGA-based prototyping stations. It is noteworthy that both can be used with any I/O interface, because they do not require any special functionality. But in particular ULTRA can benefit a lot from a closer coupling between peripheral device and main processor than traditional I/O standards like PCI or PCIe can offer. For this purpose a new rapid prototyping station is developed, called HTX-Board. It is based on an FPGA and connects to the main system over a HyperTransport (HT) interface. The HT interface avoids any intermediate bridges between device and main processor, providing an excellent coupling. With the HTX-Board the full potential of ULTRA's communication technique is shown, resulting in yet unmatched latencies for commodity systems. The insights gained during the development of the virtualization technique and ULTRA are used to analyze the requirements for a generic coprocessor interface. An instruction set extension is proposed to achieve a tight coupling between main processor and coprocessor. Key component is the tagging of load and store operations, which allows to include additional information. The efficiency and performance of the virtualization and ULTRA can be even improved by using the proposed instructions of this extension

    Similar works