757 research outputs found

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Analysis of microstrip antennas by multilevel matrix decomposition algorithm

    Get PDF
    Integral equation methods (IE) are widely used in conjunction with Method of Moments (MoM) discretization for the numerical analysis of microstrip antennas. However, their application to large antenna arrays is difficult due to the fact that the computational requirements increase rapidly with the number of unknowns N. Several techniques have been proposed to reduce the computational cost of IE-MoM. The Multilevel Matrix Decomposition Algorithm (MLMDA) has been implemented in 3D for arbitrary perfectly conducting surfaces discretized in Rao, Wilton and Glisson linear triangle basis functions . This algorithm requires an operation count that is proportional to N·log2N. The performance of the algorithm is much better for planar or piece-wise planar objects than for general 3D problems, which makes the algorithm particularly well-suited for the analysis of microstrip antennas. The memory requirements are proportional to N·logN and very low. The main advantage of the MLMDA compared with other efficient techniques to solve integral equations is that it does not rely on specific mathematical properties of the Green's functions being used. Thus, we can apply the method to interesting configurations governed by special Green's functions like multilayered media. In fact, the MDA-MLMDA method can be used at the top of any existing MoM code. In this paper we present the application to the analysis of large printed antenna arrays.Peer ReviewedPostprint (published version

    A 2D algorithm with asymmetric workload for the UPC conjugate gradient method

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-014-1300-0[Abstract] This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.Ministerio de EconomĂ­a y Competitividad; TIN2013-42148-PXunta de Galicia; GRC2013/055United States. Department of Energy; DEAC03-76SF0009

    Performance Evaluation of Sparse Matrix Products in UPC

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-012-0796-4[Abstract] Unified Parallel C (UPC) is a Partitioned Global Address Space (PGAS) language whose popularity has increased during the last years owing to its high programmability and reasonable performance through an efficient exploitation of data locality, especially on hierarchical architectures like multicore clusters. However, the performance issues that arise in this language due to the irregular structure of sparse matrix operations have not yet been studied. Among them, the selection of an adequate storage format for the sparse matrices can significantly improve the efficiency of the parallel codes. This paper presents an evaluation, using UPC, of the most common sparse storage formats with different implementations of the matrix-vector and matrix-matrix products, which are key kernels in many scientific applications.Ministerio de Ciencia e InnovaciĂłn; TIN2010-16735Ministerio de EducaciĂłn; AP2008-01578Ministerio de Ciencia e InnovaciĂłn; CAPAP-H3; TIN2010-12011-

    A Flexible NTT-based multiplier for Post-Quantum Cryptography

    Get PDF
    In this work an NTT-based (Number Theoretic Transform) multiplier for code-based Post-Quantum Cryptography (PQC) is presented, supporting Quasi Cyclic Low/Moderate-Density Parity-Check (QC LDPC/MDPC) codes. The cyclic matrix product, which is the fundamental operation required in this application, is treated as a polynomial product and adapted to the specific case of QC-MDPC codes proposed for Round 3 and 4 in the National Institute of Standards and Technology (NIST) competition for PQC. The multiplier is a fundamental component in both encryption and decryption, and the proposed solution leads to a flexible NTT-based multiplier, which can efficiently handle all types of required products, where the vectors have a length ≈104 and can be moderately sparse. The proposed architecture is implemented using both Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) technologies and, when compared with the best published results, it features a 10 times reduction of the encryption times with the area increased by 3 times. The proposed multiplier, incorporated in the encryption and decryption stages of a code-based PQC cryptosystem, leads to an improvement over the best published results between 3 to 10 times in terms of LC product (LUT times latency)

    Dynamic Multiple Work Stealing Strategy for Flexible Load Balancing

    Get PDF
    Lazy-task creation is an efficient method of overcoming the overhead of the grain-size problem in parallel computing. Work stealing is an effective load balancing strategy for parallel computing. In this paper, we present dynamic work stealing strategies in a lazy-task creation technique for efficient fine-grain task scheduling. The basic idea is to control load balancing granularity depending on the number of task parents in a stack. The dynamic-length strategy of work stealing uses run-time information, which is information on the load of the victim, to determine the number of tasks that a thief is allowed to steal. We compare it with the bottommost first work stealing strategy used in StackThread/MP, and the fixed-length strategy of work stealing, where a thief requests to steal a fixed number of tasks, as well as other multithreaded frameworks such as Cilk and OpenMP task implementations. The experiments show that the dynamic-length strategy of work stealing performs well in irregular workloads such as in UTS benchmarks, as well as in regular workloads such as Fibonacci, Strassen\u27s matrix multiplication, FFT, and Sparse-LU factorization. The dynamic-length strategy works better than the fixed-length strategy because it is more flexible than the latter; this strategy can avoid load imbalance due to overstealing

    Method of moments enhancement technique for the analysis of Sierpinski pre-fractal antennas

    Get PDF
    The numerical analysis of highly iterated Sierpinski microstrip patch antennas by method of moments (MoM) involves many tiny subdomain basis functions, resulting in a very large number of unknowns. The Sierpinski pre-fractal can be defined by an iterated function system (IFS). As a consequence, the geometry has a multilevel structure with many equal subdomains. This property, together with a multilevel matrix decomposition algorithm (MLMDA) implementation in which the MLMDA blocks are equal to the IFS generating shape, is used to reduce the computational cost of the frequency analysis of a Sierpinski based structure.Peer Reviewe

    Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments

    Full text link
    Generalized sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. Here we show that SpGEMM also yields efficient algorithms for general sparse-matrix indexing in distributed memory, provided that the underlying SpGEMM implementation is sufficiently flexible and scalable. We demonstrate that our parallel SpGEMM methods, which use two-dimensional block data distributions with serial hypersparse kernels, are indeed highly flexible, scalable, and memory-efficient in the general case. This algorithm is the first to yield increasing speedup on an unbounded number of processors; our experiments show scaling up to thousands of processors in a variety of test scenarios

    Architectures for Code-based Post-Quantum Cryptography

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen
    • 

    corecore