48 research outputs found
Parallel image restoration
Cataloged from PDF version of article.In this thesis, we are concerned with the image restoration problem which has
been formulated in the literature as a system of linear inequalities. With this formulation,
the resulting constraint matrix is an unstructured sparse-matrix and
even with small size images we end up with huge matrices. So, to solve the
restoration problem, we have used the surrogate constraint methods, that can
work efficiently for large size problems and are amenable for parallel implementations.
Among the surrogate constraint methods, the basic method considers all
of the violated constraints in the system and performs a single block projection
in each step. On the other hand, parallel method considers a subset of the constraints,
and makes simultaneous block projections. Using several partitioning
strategies and adopting different communication models we have realized several
parallel implementations of the two methods. We have used the hypergraph partitioning
based decomposition methods in order to minimize the communication
costs while ensuring load balance among the processors. The implementations
are evaluated based on the per iteration performance and on the overall performance.
Besides, the effects of different partitioning strategies on the speed of
convergence are investigated. The experimental results reveal that the proposed
parallelization schemes have practical usage in the restoration problem and in
many other real-world applications which can be modeled as a system of linear
inequalities.Malas, TahirM.S
Web-site-based partitioning techniques for efficient parallelization of the PageRank computation
Cataloged from PDF version of article.Web search engines use ranking techniques to order Web pages in query results.
PageRank is an important technique, which orders Web pages according to the
linkage structure of the Web. The efficiency of the PageRank computation is important
since the constantly evolving nature of the Web requires this computation
to be repeated many times. PageRank computation includes repeated iterative
sparse matrix-vector multiplications. Due to the enormous size of the Web matrix
to be multiplied, PageRank computations are usually carried out on parallel
systems. However, efficiently parallelizing PageRank is not an easy task, because
of the irregular sparsity pattern of the Web matrix. Graph and hypergraphpartitioning-based
techniques are widely used for efficiently parallelizing matrixvector
multiplications. Recently, a hypergraph-partitioning-based decomposition
technique for fast parallel computation of PageRank is proposed. This technique
aims to minimize the communication overhead of the parallel matrix-vector multiplication.
However, the proposed technique has a high prepropocessing time,
which makes the technique impractical. In this work, we propose 1D (rowwise
and columnwise) and 2D (fine-grain and checkerboard) decomposition models
using web-site-based graph and hypergraph-partitioning techniques. Proposed
models minimize the communication overhead of the parallel PageRank computations
with a reasonable preprocessing time. The models encapsulate not only
the matrix-vector multiplication, but the overall iterative algorithm. Conducted
experiments show that the proposed models achieve fast PageRank computation
with low preprocessing time, compared with those in the literature.Cevahir, AliM.S
Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors
[EN] Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type.
Our results on an ARM (R) big.LITTLE (TM) processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. (C) 2016 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2015-65277-R. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. dinnovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Igual Peña, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES.; Adeniyi-Jones, C. (2018). Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors. Journal of Computational Science. 25:140-151. https://doi.org/10.1016/j.jocs.2016.10.020S1401512
Multi-criteria optimization for energy-efficient multi-core systems-on-chip
The steady down-scaling of transistor dimensions has made possible the evolutionary progress leading to today’s high-performance multi-GHz microprocessors and core based System-on-Chip (SoC) that offer superior performance, dramatically reduced cost per function, and much-reduced physical size compared to their predecessors. On the negative side, this rapid scaling however also translates to high power densities, higher operating temperatures and reduced reliability making it imperative to address design issues that have cropped up in its wake. In particular, the aggressive physical miniaturization have increased CMOS fault sensitivity to the extent that many reliability constraints pose threat to the device normal operation and accelerate the onset of wearout-based failures. Among various wearout-based failure mechanisms, Negative biased temperature instability (NBTI) has been recognized as the most critical source of device aging.
The urge of reliable, low-power circuits is driving the EDA community to develop new design techniques, circuit solutions, algorithms, and software, that can address these critical issues. Unfortunately, this challenge is complicated by the fact that power and reliability are known to be intrinsically conflicting metrics: traditional solutions to improve reliability such as redundancy, increase of voltage levels, and up-sizing of critical devices do contrast with traditional low-power solutions, which rely on compact architectures, scaled supply voltages, and small devices.
This dissertation focuses on methodologies to bridge this gap and establishes an important link between low-power solutions and aging effects. More specifically, we proposed new architectural solutions based on power management strategies to enable the design of low-power, aging aware cache memories.
Cache memories are one of the most critical components for warranting reliable and timely operation. However, they are also more susceptible to aging effects. Due to symmetric structure of a memory cell, aging occurs regardless of the fact that a cell (or word) is accessed or not. Moreover, aging is a worst-case matric and line with worst-case access pattern determines the aging of the entire cache. In order to stop the aging of a memory cell, it must be put into a proper idle state when a cell (or word) is not accessed which require proper management of the idleness of each atomic unit of power management.
We have proposed several reliability management techniques based on the idea of cache partitioning to alleviate NBTI-induced aging and obtain joint energy and lifetime benefits. We introduce graceful degradation mechanism which allows different cache blocks into which a cache is partitioned to age at different rates. This implies that various sub-blocks become unreliable at different times, and the cache keeps functioning with reduced efficiency. We extended the capabilities of this architecture by integrating the concept of reconfigurable caches to maintain the performance of the cache throughout its lifetime. By this strategy, whenever a block becomes unreliable, the remaining cache is reconfigured to work as a smaller size cache with only a marginal degradation of performance.
Many mission-critical applications require guaranteed lifetime of their operations and therefore the hardware implementing their functionality. Such constraints are usually enforced by means of various reliability enhancing solutions mostly based on redundancy which are not energy-friendly. In our work, we have proposed a novel cache architecture in which a smart use of cache partitions for redundancy allows us to obtain cache that meet a desired lifetime target with minimal energy consumption
Parallelization of the H.261 video coding algorithm on the IBM SP2(R) multiprocessor system
In this paper, the parallelization of the H.261 video coding algorithm on the IBM SP2 multiprocessor system is described. Based on domain decomposition as a framework, data partitioning, data dependencies and communication issues are carefully assessed. From these, two parallel algorithms were developed with the first one maximizes on processor utilization and the second one minimizes on communications. Our analysiis shows that the first algorithm exhibits poor scalability and high communication overhead; and the second algorithm exhibits good scalability and low communication overhead. A best median speed up of 13.72 or 11 frameskec was achieved on 24 processors.published_or_final_versio
SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation
Graph neural networks (GNNs) have emerged as a powerful tool to process
graph-based data in fields like communication networks, molecular interactions,
chemistry, social networks, and neuroscience. GNNs are characterized by the
ultra-sparse nature of their adjacency matrix that necessitates the development
of dedicated hardware beyond general-purpose sparse matrix multipliers. While
there has been extensive research on designing dedicated hardware accelerators
for GNNs, few have extensively explored the impact of the sparse storage format
on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the
novel sparse compressed vectors (SCV) format optimized for the aggregation
operation. We use Z-Morton ordering to derive a data-locality-based computation
ordering and partitioning scheme. The paper also presents how the proposed
SCV-GNN is scalable on a vector processing system. Experimental results over
various datasets show that the proposed method achieves a geometric mean
speedup of and over CSC and CSR aggregation
operations, respectively. The proposed method also reduces the memory traffic
by a factor of and over compressed sparse column
(CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel
aggregation format reduces the latency and memory access for GNN inference
GPGPU: Hardware/Software Co-Design for the Masses
With the recent development of high-performance graphical processing units (GPUs), capable of performing general-purpose computation (GPGPU: general-purpose computation on the GPU), a new platform is emerging. It consists of a central processing unit (CPU), which is very fast in sequential execution, and a GPU, which exhibits high degree of parallelism and thus very high performance on certain types of computations. Optimally leveraging the advantages of this platform is challenging in practice. We spotlight the analogy between GPGPU and hardware/software co-design (HSCD), a more mature design paradigm, to derive a design process for GPGPU. This process, with appropriate tool support and automation, will ease GPGPU design significantly. Identifying the challenges associated with establishing this process can serve as a roadmap for the future development of the GPGPU field
Hardware Resource Allocation for Hardware/Software Partitioning in the LYCOS System
This paper presents a novel hardware resource alloca-tion technique for hardware/software partitioning. It al-locates hardware resources to the hardware data-path us-ing information such as data-dependencies between op-erations in the application, and profiling information. The algorithm is useful as a designer’s/designtool’s aid to generate good hardware allocations for use in hard-ware/software partitioning. The algorithm has been imple-mented in a tool under the LYCOS system [9]. The results show that the allocations produced by the algorithm come close to the best allocations obtained by exhaustive search.
Cell Transformations and Physical Design Techniques for 3D Monolithic Integrated Circuits
3D monolithic integration (3DMI), also termed as sequential integration, is a potential technology for future gigascale circuits. In 3DMI technology the 3D contacts, connecting different active layers, are in the order of few 100 nm. Given the advantage of such small contacts, 3DMI enables fine-grain (gate-level) partitioning of circuits. In this work we present three cell transformation techniques for standard cell based ICs with 3DMI technology. As a major contribution of this work, we propose a design flow comprising of a cell transformation technique, cell-on-cell stacking, and a physical design technique (CELONCELPD) aimed at placing cells transformed with cell-on-cell stacking. We analyze and compare various cell transformation techniques for 3DMI technology without disrupting the regularity of the IC design flow. Our experiments demonstrate the effectiveness of CELONCEL design technique, yielding us an area reduction of 37.5%, 16.2% average reduction in wirelength, and 6.2% average improvement in overall delay, compared with a 2D case when benchmarked across various designs in 45nm technology node