31 research outputs found

    A methodology for speeding up matrix vector multiplication for single/multi-core architectures

    Get PDF
    In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores

    Parallel edge detection using uni-directional multiring on spiral architecture

    Full text link
    Improving the computation efficiency is the key issue in image processing, especially in edge detection, because edge detection is very computationally intensive. With the development of real-time image processing application, fast processing response is becoming the major requirement in this area. In this paper, a parallel and distributed algorithm on Spiral Architecture for edge detection using a uni-directional MultiRing is proposed. The proposed algorithm is based on Master-Slave model. It guarantees a better load balancing among the processors (or nodes) and greatly reduces the traffic from and to the master node. It uses each MultiRing configuration more efficiently. It also improves the performance for message passing among the modes on the MultiRing

    A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

    Get PDF
    Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures

    Scalable switch for bi-directional MultiRing network

    Full text link
    MultiRing is a network of 2n nodes which can be configured into different ring networks of 2n-1 different configurations when ring direction being taken into account. It supports a wide variety of algorithms, such as algorithms for parallel image processing. In this paper, a MultiRing is implemented using a star topology with a MultiRing switch at the centre. We present a hierarchical design of the MultiRing switch. We demonstrate that the construction of the switch is economical in terms of gate count and port numbers. Our design preserves the need that all nodes can communicate simultaneously and independently in a ring configuration. We list the advantages of our design such as scalability and feasibility when comparing with other switch designs. © 2004 IEEE
    corecore