260 research outputs found

    Loo.py: transformation-based code generation for GPUs and CPUs

    Full text link
    Today's highly heterogeneous computing landscape places a burden on programmers wanting to achieve high performance on a reasonably broad cross-section of machines. To do so, computations need to be expressed in many different but mathematically equivalent ways, with, in the worst case, one variant per target machine. Loo.py, a programming system embedded in Python, meets this challenge by defining a data model for array-style computations and a library of transformations that operate on this model. Offering transformations such as loop tiling, vectorization, storage management, unrolling, instruction-level parallelism, change of data layout, and many more, it provides a convenient way to capture, parametrize, and re-unify the growth among code variants. Optional, deep integration with numpy and PyOpenCL provides a convenient computing environment where the transition from prototype to high-performance implementation can occur in a gradual, machine-assisted form

    Distributed-Memory Breadth-First Search on Massive Graphs

    Full text link
    This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local data structures such as CSR and DCSC, enabling in-node multithreading, and graph decompositions such as 1D and 2D decomposition.Comment: arXiv admin note: text overlap with arXiv:1104.451

    RingScalar: A Complexity-Effective Out-of-Order Superscalar Microarchitecture

    Get PDF
    RingScalar is a complexity-effective microarchitecture for out-of-order superscalar processors, that reduces the area, latency, and power of all major structures in the instruction flow. The design divides an N-way superscalar into N columns connected in a unidirectional ring, where each column contains a portion of the instruction window, a bank of the register file, and an ALU. The design exploits the fact that most decoded instructions are waiting on just one operand to use only a single tag per issue window entry, and to restrict instruction wakeup and value bypass to only communicate with the neighboring column. Detailed simulations of four-issue single-threaded machines running SPECint2000 show that RingScalar has IPC only 13% lower than an idealized superscalar, while providing large reductions in area, power, and circuit latency

    Scale Control Processor Test-Chip

    Get PDF
    We are investigating vector-thread architectures which provide competitive performance and efficiency across a broad class of application domains. Vector-thread architectures unify data-level, thread-level, and instruction-level parallelism, providing new ways of parallelizing codes that are difficult to vectorize or that incur excessive synchronization costs when multithreaded. To illustrate these ideas we have developed the Scale processor, which is an example of a vector-thread architecture designed for low-power and high-performance embedded systems. The prototype includes a single-issue 32-bit RISC control processor, a vector-thread unit which supports up to 128 virtual processor threads and can execute up to 16 instructions per cycle, and a 32 KB shared primary cache.Since the Scale Vector-Thread Processor is a large and complex design (especially for an academic project), we first designed and fabricated the Scale Test Chip (STC1). STC1 includes a simplified version of the Scale control processor, 8 KB of RAM, a host interface, and a custom clock generator. STC1 helped mitigate the risk involved in fabricating the full Scale chip in several ways. First, we were able to establish and test our CAD toolflow. Our toolflow included several custom tools which had not previously been used in any tapeouts. Second, we were able to better characterize our target package and process. For example, STC1 enabled us to better correlate the static timing numbers from our CAD tools with actual silicon and also to characterize the expected rise/fall times of our external signal pins. Finally, STC1 allowed us to test our custom clock generator. We used our experiences with STC1 to help us implement the Scale vector-thread processor. Scale was taped out on October 15, 2006 and it is currently being fabricated through MOSIS. This report discusses the fabrication of STC1 and presents power and performance results

    StochKit-FF: Efficient Systems Biology on Multicore Architectures

    Full text link
    The stochastic modelling of biological systems is an informative, and in some cases, very adequate technique, which may however result in being more expensive than other modelling approaches, such as differential equations. We present StochKit-FF, a parallel version of StochKit, a reference toolkit for stochastic simulations. StochKit-FF is based on the FastFlow programming toolkit for multicores and exploits the novel concept of selective memory. We experiment StochKit-FF on a model of HIV infection dynamics, with the aim of extracting information from efficiently run experiments, here in terms of average and variance and, on a longer term, of more structured data.Comment: 14 pages + cover pag

    Hardware Transactional Memory

    Get PDF
    This work shows how hardware transactional memory (HTM) can be implemented to support transactions of arbitrarily large size, while ensuring that small transactions run efficiently. Our implementation handles small transactions similar to Herlihy and Moss's scheme in that it holds tentative updates in a cache. Unlike their scheme, which uses a special fully associative cache, ours augments the ordinary processor cache and provides a mechanism to handle cache spills of uncommitted transactional data. Consequently, our scheme runs faster for small transactions while correctly handling transactions of arbitrarily large size. Although transactions are small in the common case, we argue that HTM should not restrict the size of transactions, because it complicates the programmer/compiler model and precludes some important programs from exploiting transactional memory. We show that the Linux 2.4.19 kernel can be automatically and efficiently “transactified” if boundless transactions can be supported. Our experimental results show that the largest transaction touches over 7000 64-byte cache lines, whereas 99.94\% of the transactions touch fewer than 64 cache lines. We further show that synchronized methods in Java can be easily compiled to our HTM scheme, thereby providing the advantages of nonblocking atomicity (including absence of deadlock) in a straightforward fashion. Our HTM scheme for boundless transactions uses an efficiently implementable hardware snapshot and the ordinary set-associative L2 cache extended with less than two bits per cache line. One of the bits tells whether the cached item is part of a transaction (as in the Herlihy-Moss scheme), and all the lines in an associative set share another bit telling whether a line has overflowed from the cache and is now stored in a special overflow area of main memory. We provide empirical results to show that our scheme does not adversely affect the processor pipeline or hinder speculative execution.Singapore-MIT Alliance (SMA

    Compression and strength behaviour of viscose/polypropylene nonwoven fabrics

    Get PDF
    Compression and strength properties of viscose/polypropylene nonwoven fabrics has been studied. Compressionbehavior of the nonwoven samples (sample compressibility, sample thickness loss & sample compressive resilience) havebeen analyzed considering the magnitude of applied pressure, fabric weight, fabric thickness, and the porosity of thesamples. Based on the calculated porosity of the samples, pore compression behavior (pore compressibility, porosity loss &pore compressive resilience) are determined. Equations for the determination of pore compressibility, porosity loss, and porecompressive resilience, are established. Tensile strength and elongation as well as bursting strength and ball traverseelongation are also determined. The results show that the sample compression behavior as well as pore compressionbehavior depend on the magnitude of applied pressure. At the high level of applied pressure, a sample with highercompressibility has the lower sample compressive resilience. Differences in pore compressibility and porosity loss betweeninvestigated samples have also been registered, except in pore compressive resilience. Sample with the higher fabric weight,higher thickness, and lower porosity shows the lower sample compressibility, pore compressibility, sample thickness loss,porosity loss, and tensile elongation, but the higher tensile strength, bursting strength, and ball traverse elongation

    Quality of clothing fabrics in terms of their comfort properties

    Get PDF
    Quality of various clothing woven fabrics with respect to their comfort properties, such as electro-physical properties, air permeability, and compression properties has been studied. Fabrics are produced from cotton and cotton/polyester fibre blends in plain, twill, satin and basket weave. Results show that cotton fabrics have lower values of the volume resistivity, air permeability and compressive resilience but higher values of effective relative dielectric permeability and compressibility as compared to fabrics that have been produced from cotton/PES fibre blends. Regression analysis shows a strong linear correlative relationship between the air permeability and the porosity of the woven fabrics with very high coefficient of linear correlation (0.9807). It is also observed that comfort properties are determined by the structure of woven fabrics (raw material composition, type of weave) as well as by the fabrics surface condition. Findings of the studies have been used for estimating the quality of woven fabrics in terms of their comfort properties by the application of ranking method. It is concluded that the group of cotton fabrics exhibits better quality of comfort as compared to the group of cotton/PES blend fabrics.
    • …
    corecore