720 research outputs found

    Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

    Get PDF
    We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect

    Lattice-Boltzmann and finite-difference simulations for the permeability for three-dimensional porous media

    Full text link
    Numerical micropermeametry is performed on three dimensional porous samples having a linear size of approximately 3 mm and a resolution of 7.5 μ\mum. One of the samples is a microtomographic image of Fontainebleau sandstone. Two of the samples are stochastic reconstructions with the same porosity, specific surface area, and two-point correlation function as the Fontainebleau sample. The fourth sample is a physical model which mimics the processes of sedimentation, compaction and diagenesis of Fontainebleau sandstone. The permeabilities of these samples are determined by numerically solving at low Reynolds numbers the appropriate Stokes equations in the pore spaces of the samples. The physical diagenesis model appears to reproduce the permeability of the real sandstone sample quite accurately, while the permeabilities of the stochastic reconstructions deviate from the latter by at least an order of magnitude. This finding confirms earlier qualitative predictions based on local porosity theory. Two numerical algorithms were used in these simulations. One is based on the lattice-Boltzmann method, and the other on conventional finite-difference techniques. The accuracy of these two methods is discussed and compared, also with experiment.Comment: to appear in: Phys.Rev.E (2002), 32 pages, Latex, 1 Figur

    Static and Dynamic Scheduling for Effective Use of Multicore Systems

    Get PDF
    Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems. We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures. We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications

    QuEST and High Performance Simulation of Quantum Computers

    Full text link
    We introduce QuEST, the Quantum Exact Simulation Toolkit, and compare it to ProjectQ, qHipster and a recent distributed implementation of Quantum++. QuEST is the first open source, OpenMP and MPI hybridised, GPU accelerated simulator of universal quantum circuits. Embodied as a C library, it is designed so that a user's code can be deployed seamlessly to any platform from a laptop to a supercomputer. QuEST is capable of simulating generic quantum circuits of general single-qubit gates and multi-qubit controlled gates, on pure and mixed states, represented as state-vectors and density matrices, and under the presence of decoherence. Using the ARCUS Phase-B and ARCHER supercomputers, we benchmark QuEST's simulation of random circuits of up to 38 qubits, distributed over up to 2048 compute nodes, each with up to 24 cores. We directly compare QuEST's performance to ProjectQ's on single machines, and discuss the differences in distribution strategies of QuEST, qHipster and Quantum++. QuEST shows excellent scaling, both strong and weak, on multicore and distributed architectures.Comment: 8 pages, 8 figures; fixed typos; updated QuEST URL and fixed typo in Fig. 4 caption where ProjectQ and QuEST were swapped in speedup subplot explanation; added explanation of simulation algorithm, updated bibliography; stressed technical novelty of QuEST; mentioned new density matrix suppor

    Adaptable register file organization for vector processors

    Get PDF
    Contemporary Vector Processors (VPs) are de-signed either for short vector lengths, e.g., Fujitsu A64FX with 512-bit ARM SVE vector support, or long vectors, e.g., NEC Aurora Tsubasa with 16Kbits Maximum Vector Length (MVL1). Unfortunately, both approaches have drawbacks. On the one hand, short vector length VP designs struggle to provide high efficiency for applications featuring long vectors with high Data Level Parallelism (DLP). On the other hand, long vector VP designs waste resources and underutilize the Vector Register File (VRF) when executing low DLP applications with short vector lengths. Therefore, those long vector VP implementations are limited to a specialized subset of applications, where relatively high DLP must be present to achieve excellent performance with high efficiency. Modern scientific applications are getting more diverse, and the vector lengths in those applications vary widely. To overcome these limitations, we propose an Adaptable Vector Architecture (AVA) that leads to having the best of both worlds. AVA is designed for short vectors (MVL=16 elements) and is thus area and energy-efficient. However, AVA has the functionality to reconfigure the MVL, thereby allowing to exploit the benefits of having a longer vector of up to 128 elements microarchitecture when abundant DLP is present. We model AVA on the gem5 simulator and evaluate AVA performance with six applications taken from the RiVEC Benchmark Suite. To obtain area and power consumption metrics, we model AVA on McPAT for 22nm technology. Our results show that by reconfiguring our small VRF (8KB) plus our novel issue queue scheme, AVA yields a 2X speedup over the default configuration for short vectors. Additionally, AVA shows competitive performance when compared to a long vector VP, while saving 50% of area.Research reported in this publication is partially supported by CONACyT Mexico under Grant No. 472106, the Spanish State Research Agency - Ministry of Science and Innovation (contract PID2019-107255GB), and the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of the total cost eligible, under the DRAC project [001-P-001723].Peer ReviewedPostprint (author's final draft
    • …
    corecore