48 research outputs found

    Parallel image restoration

    Get PDF
    Cataloged from PDF version of article.In this thesis, we are concerned with the image restoration problem which has been formulated in the literature as a system of linear inequalities. With this formulation, the resulting constraint matrix is an unstructured sparse-matrix and even with small size images we end up with huge matrices. So, to solve the restoration problem, we have used the surrogate constraint methods, that can work efficiently for large size problems and are amenable for parallel implementations. Among the surrogate constraint methods, the basic method considers all of the violated constraints in the system and performs a single block projection in each step. On the other hand, parallel method considers a subset of the constraints, and makes simultaneous block projections. Using several partitioning strategies and adopting different communication models we have realized several parallel implementations of the two methods. We have used the hypergraph partitioning based decomposition methods in order to minimize the communication costs while ensuring load balance among the processors. The implementations are evaluated based on the per iteration performance and on the overall performance. Besides, the effects of different partitioning strategies on the speed of convergence are investigated. The experimental results reveal that the proposed parallelization schemes have practical usage in the restoration problem and in many other real-world applications which can be modeled as a system of linear inequalities.Malas, TahirM.S

    Web-site-based partitioning techniques for efficient parallelization of the PageRank computation

    Get PDF
    Cataloged from PDF version of article.Web search engines use ranking techniques to order Web pages in query results. PageRank is an important technique, which orders Web pages according to the linkage structure of the Web. The efficiency of the PageRank computation is important since the constantly evolving nature of the Web requires this computation to be repeated many times. PageRank computation includes repeated iterative sparse matrix-vector multiplications. Due to the enormous size of the Web matrix to be multiplied, PageRank computations are usually carried out on parallel systems. However, efficiently parallelizing PageRank is not an easy task, because of the irregular sparsity pattern of the Web matrix. Graph and hypergraphpartitioning-based techniques are widely used for efficiently parallelizing matrixvector multiplications. Recently, a hypergraph-partitioning-based decomposition technique for fast parallel computation of PageRank is proposed. This technique aims to minimize the communication overhead of the parallel matrix-vector multiplication. However, the proposed technique has a high prepropocessing time, which makes the technique impractical. In this work, we propose 1D (rowwise and columnwise) and 2D (fine-grain and checkerboard) decomposition models using web-site-based graph and hypergraph-partitioning techniques. Proposed models minimize the communication overhead of the parallel PageRank computations with a reasonable preprocessing time. The models encapsulate not only the matrix-vector multiplication, but the overall iterative algorithm. Conducted experiments show that the proposed models achieve fast PageRank computation with low preprocessing time, compared with those in the literature.Cevahir, AliM.S

    Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors

    Full text link
    [EN] Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM (R) big.LITTLE (TM) processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. (C) 2016 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2015-65277-R. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. dinnovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Igual Peña, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES.; Adeniyi-Jones, C. (2018). Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors. Journal of Computational Science. 25:140-151. https://doi.org/10.1016/j.jocs.2016.10.020S1401512

    Multi-criteria optimization for energy-efficient multi-core systems-on-chip

    Get PDF
    The steady down-scaling of transistor dimensions has made possible the evolutionary progress leading to today’s high-performance multi-GHz microprocessors and core based System-on-Chip (SoC) that offer superior performance, dramatically reduced cost per function, and much-reduced physical size compared to their predecessors. On the negative side, this rapid scaling however also translates to high power densities, higher operating temperatures and reduced reliability making it imperative to address design issues that have cropped up in its wake. In particular, the aggressive physical miniaturization have increased CMOS fault sensitivity to the extent that many reliability constraints pose threat to the device normal operation and accelerate the onset of wearout-based failures. Among various wearout-based failure mechanisms, Negative biased temperature instability (NBTI) has been recognized as the most critical source of device aging. The urge of reliable, low-power circuits is driving the EDA community to develop new design techniques, circuit solutions, algorithms, and software, that can address these critical issues. Unfortunately, this challenge is complicated by the fact that power and reliability are known to be intrinsically conflicting metrics: traditional solutions to improve reliability such as redundancy, increase of voltage levels, and up-sizing of critical devices do contrast with traditional low-power solutions, which rely on compact architectures, scaled supply voltages, and small devices. This dissertation focuses on methodologies to bridge this gap and establishes an important link between low-power solutions and aging effects. More specifically, we proposed new architectural solutions based on power management strategies to enable the design of low-power, aging aware cache memories. Cache memories are one of the most critical components for warranting reliable and timely operation. However, they are also more susceptible to aging effects. Due to symmetric structure of a memory cell, aging occurs regardless of the fact that a cell (or word) is accessed or not. Moreover, aging is a worst-case matric and line with worst-case access pattern determines the aging of the entire cache. In order to stop the aging of a memory cell, it must be put into a proper idle state when a cell (or word) is not accessed which require proper management of the idleness of each atomic unit of power management. We have proposed several reliability management techniques based on the idea of cache partitioning to alleviate NBTI-induced aging and obtain joint energy and lifetime benefits. We introduce graceful degradation mechanism which allows different cache blocks into which a cache is partitioned to age at different rates. This implies that various sub-blocks become unreliable at different times, and the cache keeps functioning with reduced efficiency. We extended the capabilities of this architecture by integrating the concept of reconfigurable caches to maintain the performance of the cache throughout its lifetime. By this strategy, whenever a block becomes unreliable, the remaining cache is reconfigured to work as a smaller size cache with only a marginal degradation of performance. Many mission-critical applications require guaranteed lifetime of their operations and therefore the hardware implementing their functionality. Such constraints are usually enforced by means of various reliability enhancing solutions mostly based on redundancy which are not energy-friendly. In our work, we have proposed a novel cache architecture in which a smart use of cache partitions for redundancy allows us to obtain cache that meet a desired lifetime target with minimal energy consumption

    Parallelization of the H.261 video coding algorithm on the IBM SP2(R) multiprocessor system

    Get PDF
    In this paper, the parallelization of the H.261 video coding algorithm on the IBM SP2 multiprocessor system is described. Based on domain decomposition as a framework, data partitioning, data dependencies and communication issues are carefully assessed. From these, two parallel algorithms were developed with the first one maximizes on processor utilization and the second one minimizes on communications. Our analysiis shows that the first algorithm exhibits poor scalability and high communication overhead; and the second algorithm exhibits good scalability and low communication overhead. A best median speed up of 13.72 or 11 frameskec was achieved on 24 processors.published_or_final_versio

    SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation

    Full text link
    Graph neural networks (GNNs) have emerged as a powerful tool to process graph-based data in fields like communication networks, molecular interactions, chemistry, social networks, and neuroscience. GNNs are characterized by the ultra-sparse nature of their adjacency matrix that necessitates the development of dedicated hardware beyond general-purpose sparse matrix multipliers. While there has been extensive research on designing dedicated hardware accelerators for GNNs, few have extensively explored the impact of the sparse storage format on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the novel sparse compressed vectors (SCV) format optimized for the aggregation operation. We use Z-Morton ordering to derive a data-locality-based computation ordering and partitioning scheme. The paper also presents how the proposed SCV-GNN is scalable on a vector processing system. Experimental results over various datasets show that the proposed method achieves a geometric mean speedup of 7.96×7.96\times and 7.04×7.04\times over CSC and CSR aggregation operations, respectively. The proposed method also reduces the memory traffic by a factor of 3.29×3.29\times and 4.37×4.37\times over compressed sparse column (CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel aggregation format reduces the latency and memory access for GNN inference

    GPGPU: Hardware/Software Co-Design for the Masses

    Get PDF
    With the recent development of high-performance graphical processing units (GPUs), capable of performing general-purpose computation (GPGPU: general-purpose computation on the GPU), a new platform is emerging. It consists of a central processing unit (CPU), which is very fast in sequential execution, and a GPU, which exhibits high degree of parallelism and thus very high performance on certain types of computations. Optimally leveraging the advantages of this platform is challenging in practice. We spotlight the analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, to derive a design process for GPGPU. This process, with appropriate tool support and automation, will ease GPGPU design significantly. Identifying the challenges associated with establishing this process can serve as a roadmap for the future development of the GPGPU field

    Hardware Resource Allocation for Hardware/Software Partitioning in the LYCOS System

    Get PDF
    This paper presents a novel hardware resource alloca-tion technique for hardware/software partitioning. It al-locates hardware resources to the hardware data-path us-ing information such as data-dependencies between op-erations in the application, and profiling information. The algorithm is useful as a designer’s/designtool’s aid to generate good hardware allocations for use in hard-ware/software partitioning. The algorithm has been imple-mented in a tool under the LYCOS system [9]. The results show that the allocations produced by the algorithm come close to the best allocations obtained by exhaustive search.

    Cell Transformations and Physical Design Techniques for 3D Monolithic Integrated Circuits

    Get PDF
    3D monolithic integration (3DMI), also termed as sequential integration, is a potential technology for future gigascale circuits. In 3DMI technology the 3D contacts, connecting different active layers, are in the order of few 100 nm. Given the advantage of such small contacts, 3DMI enables fine-grain (gate-level) partitioning of circuits. In this work we present three cell transformation techniques for standard cell based ICs with 3DMI technology. As a major contribution of this work, we propose a design flow comprising of a cell transformation technique, cell-on-cell stacking, and a physical design technique (CELONCELPD) aimed at placing cells transformed with cell-on-cell stacking. We analyze and compare various cell transformation techniques for 3DMI technology without disrupting the regularity of the IC design flow. Our experiments demonstrate the effectiveness of CELONCEL design technique, yielding us an area reduction of 37.5%, 16.2% average reduction in wirelength, and 6.2% average improvement in overall delay, compared with a 2D case when benchmarked across various designs in 45nm technology node
    corecore