2 research outputs found

    Tightly-Coupled and Fault-Tolerant Communication in Parallel Systems

    Full text link
    The demand for processing power is increasing steadily. In the past, single processor architectures clearly dominated the markets. As instruction level parallelism is limited in most applications, significant performance can only be achieved in the future by exploiting parallelism at the higher levels of thread or process parallelism. As a consequence, modern “processors” incorporate multiple processor cores that form a single shared memory multiprocessor. In such systems, high performance devices like network interface controllers are connected to processors and memory like every other input/output device over a hierarchy of peripheral interconnects. Thus, one target must be to couple coprocessors physically closer to main memory and to the processors of a computing node. This removes the overhead of today’s peripheral interconnect structures. Such a step is the direct connection of HyperTransport (HT) devices to Opteron processors, which is presented in this thesis. Also, this work analyzes how communication from a device to processors can be optimized on the protocol level. As today’s computing nodes are shared memory systems, the cache coherence protocol is the central protocol for data exchange between processors and devices. Consequently, the analysis extends to classes of devices that are cache coherence protocol aware. Also, the concept of a transfer cache is proposed in this thesis, which reduces latency significantly even for non-coherent devices. The trend to the exploitation of process and thread level parallelism leads to a steady increase of system sizes. Networks that are used in such large systems are very susceptible to both hard and transient faults. Most transient fault rates are constant per bit that is stored or transmitted. With increasing system sizes and higher clock frequencies, the number of faults in time increases drastically. In the end, the error rate may rise at a level where high level error recovery becomes too costly if lower layers do not perform error correction that is transparent to the layers above. The second part of this thesis describes a direct interconnection network that provides a reliable transport service even without the use of end-to-end protocols. Also, a novel hardware based solution for intermediate routing is developed in this thesis, which allows an efficient, deadlock free routing around faulty links

    Memory Scalability Evaluation of the Next-Generation Intel Bensley Platform with

    No full text
    As multi-core systems gain popularity for their increased computing power at low-cost, the rest of the architecture must be kept in balance, such as the memory subsystem. Many existing memory subsystems can suffer from scalability issues and show memory performance degradation with more than one process running. To address these scalability issues, Fully-Buffered DIMMs have recently been introduced. In this paper we present an initial performance evaluation of the next-generation multi-core Intel platform by evaluating the FB-DIMM-based memory subsystem and the associated InfiniBand performance. To the best of our knowledge this is the first such study of Intel multi-core platforms with multi-rail InfiniBand DDR configurations. We provide an evaluation of the current-generation Intel Lindenhurst platform as a reference point. We find that the Intel Bensley platform can provide memory scalability to support memory accesses by multiple processes on the same machine as well as drastically improved inter-node throughput over InfiniBand. On the Bensley platform we observe a 1.85 times increase in aggregate write bandwidth over the Lindenhurst platform. For inter-node MPI-level benchmarks we show bi-directional bandwidth of over 4.55 GB/sec for the Bensley platform using 2 DDR InfiniBand Host Channel Adapters (HCAs), an improvement of 77 % over the current generation Lindenhurst platform. The Bensley system is also able to achieve a throughput of 3.12 million MPI messages/sec in the above configuration. 1
    corecore