639 research outputs found

    High performance numerical modeling of ultra-short laser pulse propagation based on multithreaded parallel hardware

    Get PDF
    The focus of this study is development of parallelised version of severely sequential and iterative numerical algorithms based on multi-threaded parallel platform such as a graphics processing unit. This requires design and development of a platform-specific numerical solution that can benefit from the parallel capabilities of the chosen platform. Graphics processing unit was chosen as a parallel platform for design and development of a numerical solution for a specific physical model in non-linear optics. This problem appears in describing ultra-short pulse propagation in bulk transparent media that has recently been subject to several theoretical and numerical studies. The mathematical model describing this phenomenon is a challenging and complex problem and its numerical modeling limited on current modern workstations. Numerical modeling of this problem requires a parallelisation of an essentially serial algorithms and elimination of numerical bottlenecks. The main challenge to overcome is parallelisation of the globally non-local mathematical model. This thesis presents a numerical solution for elimination of numerical bottleneck associated with the non-local nature of the mathematical model. The accuracy and performance of the parallel code is identified by back-to-back testing with a similar serial version

    A Practical Hierarchial Model of Parallel Computation: The Model

    Get PDF
    We introduce a model of parallel computation that retains the ideal properties of the PRAM by using it as a sub-model, while simultaneously being more reflective of realistic parallel architectures by accounting for and providing abstract control over communication and synchronization costs. The Hierarchical PRAM (H-PRAM) model controls conceptual complexity in the face of asynchrony in two ways. First, by providing the simplifying assumption of synchronization to the design of algorithms, but allowing the algorithms to work asynchronously with each other; and organizing this control asynchrony via an implicit hierarchy relation. Second, by allowing the restriction of communication asynchrony in order to obtain determinate algorithms (thus greatly simplifying proofs of correctness). It is shown that the model is reflective of a variety of existing and proposed parallel architectures, particularly ones that can support massive parallelism. Relationships to programming languages are discussed. Since the PRAM is a sub-model, we can use PRAM algorithms as sub-algorithms in algorithms for the H-PRAM; thus results that have been established with respect to the PRAM are potentially transferable to this new model. The H-PRAM can be used as a flexible tool to investigate general degrees of locality (“neighborhoods of activity) in problems, considering communication and synchronization simultaneously. This gives the potential of obtaining algorithms that map more efficiently to architectures, and of increasing the number of processors that can efficiently be used on a problem (in comparison to a PRAM that charges for communication and synchronization). The model presents a framework in which to study the extent that general locality can be exploited in parallel computing. A companion paper demonstrates the usage of the H-PRAM via the design and analysis of various algorithms for computing the complete binary tree and the FFT/butterfly graph

    The development of a multi-layer architecture for image processing

    Get PDF
    The extraction of useful information from an image involves a series of operations, which can be functionally divided into low-level, intermediate-level and high- level processing. Because different amounts of computing power may be demanded by each level, a system which can simultaneously carry out operations at different levels is desirable. A multi-layer system which embodies both functional and spatial parallelism is envisioned. This thesis describes the development of a three-layer architecture which is designed to tackle vision problems embodying operations in each processing level. A survey of various multi-layer and multi-processor systems is carried out and a set of guidelines for the design of a multi-layer image processing system is established. The linear array is proposed as a possible basis for multi-layer systems and a significant part of the thesis is concerned with a study of this structure. The CLIP7A system, which is a linear array with 256 processing elements, is examined in depth. The CLIP7A system operates under SIMD control, enhanced by local autonomy. In order to examine the possible benefits of this arrangement, image processing algorithms which exploit the autonomous functions are implemented. Additionally, the structural properties of linear arrays are also studied. Information regarding typical computing requirements in each layer and the communication networks between elements in different layers is obtained by applying the CLIP7A system to solve an integrated vision problem. From the results obtained, a three layer architecture is proposed. The system has 256, 16 and 4 processing elements in the low, intermediate and high level layer respectively. The processing elements will employ a 16-bit microprocessor as the computing unit, which is selected from off-the-shelf components. Communication between elements in consecutive layers is via two different networks, which are designed so that efficient data transfer is achieved. Additionally, the networks enable the system to maintain fault tolerance and to permit expansion in the second and third layers

    Parallelization and Optimization of Iterative Solvers on High Performance Architectures

    Get PDF
    The main objective of this thesis is to develop an optimal sparse matrix storage format and implement efficient computing kernels that accelerate the execution of the sparse matrix vector (SpMV) product on modern computer architectures. The SpMV product is an essential building brick for a myriad of numerical application codes, especially for iterative solvers and numerical simulators. Improving the performance of the SpMV product is of special interest for researchers, because it is the major bottleneck for codes where it is required. Optimizing this product on modern computer architectures requires knowledge of parallel programing paradigms, efficient parallel algorithms and a basic idea of the device architecture being targeted

    Highly parallel sparse Cholesky factorization

    Get PDF
    Several fine grained parallel algorithms were developed and compared to compute the Cholesky factorization of a sparse matrix. The experimental implementations are on the Connection Machine, a distributed memory SIMD machine whose programming model conceptually supplies one processor per data element. In contrast to special purpose algorithms in which the matrix structure conforms to the connection structure of the machine, the focus is on matrices with arbitrary sparsity structure. The most promising algorithm is one whose inner loop performs several dense factorizations simultaneously on a 2-D grid of processors. Virtually any massively parallel dense factorization algorithm can be used as the key subroutine. The sparse code attains execution rates comparable to those of the dense subroutine. Although at present architectural limitations prevent the dense factorization from realizing its potential efficiency, it is concluded that a regular data parallel architecture can be used efficiently to solve arbitrarily structured sparse problems. A performance model is also presented and it is used to analyze the algorithms
    corecore