51 research outputs found

    Exact Sparse Matrix-Vector Multiplication on GPU's and Multicore Architectures

    Full text link
    We propose different implementations of the sparse matrix--dense vector multiplication (\spmv{}) for finite fields and rings \Zb/m\Zb. We take advantage of graphic card processors (GPU) and multi-core architectures. Our aim is to improve the speed of \spmv{} in the \linbox library, and henceforth the speed of its black box algorithms. Besides, we use this and a new parallelization of the sigma-basis algorithm in a parallel block Wiedemann rank implementation over finite fields

    Elements of Design for Containers and Solutions in the LinBox Library

    Get PDF
    We describe in this paper new design techniques used in the \cpp exact linear algebra library \linbox, intended to make the library safer and easier to use, while keeping it generic and efficient. First, we review the new simplified structure for containers, based on our \emph{founding scope allocation} model. We explain design choices and their impact on coding: unification of our matrix classes, clearer model for matrices and submatrices, \etc Then we present a variation of the \emph{strategy} design pattern that is comprised of a controller--plugin system: the controller (solution) chooses among plug-ins (algorithms) that always call back the controllers for subtasks. We give examples using the solution \mul. Finally we present a benchmark architecture that serves two purposes: Providing the user with easier ways to produce graphs; Creating a framework for automatically tuning the library and supporting regression testing.Comment: 8 pages, 4th International Congress on Mathematical Software, Seoul : Korea, Republic Of (2014

    Improving MPI Applications Performance on Multicore Clusters with Rank Reordering

    Get PDF
    International audienceModern hardware architectures featuring multicores and a complex memory hierarchy raise challenges that need to be addressed by parallel applications programmers. It is therefore tempting to adapt an application communication pattern to the characteristics of the underlying hardware. The MPI standard features several functions that allow the ranks of MPI processes to be reordered according to a graph attached to a newly created communicator. In this paper, we explain how the MPICH2 implementation of the MPI_Dist_graph_create function was modified to reorder the MPI process ranks to create a match between the application communication pattern and the hardware topology. The experimental results on a multicore cluster show that improvements can be achieved as long as the application communication pattern is expressed by a relevant metric

    Solution of Large Sparse System of Linear Equations over GF(2) on a Multi Node Multi GPU Platform

    Get PDF
    We provide an efficient multi-node, multi-GPU implementation of the Block Wiedemann Algorithm (BWA)to find the solution of a large sparse system of linear equations over GF(2). One of the important applications ofsolving such systems arises in most integer factorization algorithms like Number Field Sieve. In this paper, wedescribe how hybrid parallelization can be adapted to speed up the most time-consuming sequence generation stage of BWA. This stage involves generating a sequence of matrix-matrix products and matrix transpose-matrix products where the matrices are very large, highly sparse, and have entries over GF(2). We describe a GPU-accelerated parallel method for the computation of these matrix-matrix products using techniques like row-wise parallel distribution of the first matrix over multi-node multi-GPU platform using MPI and CUDA and word-wise XORing of rows of the second matrix. We also describe the hybrid parallelization of matrix transpose-matrix product computation, where we divide both the matrices row-wise into equal-sized blocks using MPI. Then after a GPU-accelerated matrix transpose-matrix product generation, we combine all those blocks using MPI_BXOR operation in MPI_Reduce to obtain the result. The performance of hybrid parallelization of the sequence generation step on a hybrid cluster using multiple GPUs has been compared with parallelization on only multiple MPI processors. We have used this hybrid parallel sequence generation tool for the benchmarking of an HPC cluster. Detailed timings of the complete solution of number field sieve matrices of RSA-130, RSA-140, and RSA-170 are also compared in this paper using up to 4 NVidia V100 GPUs of a DGX station. We got a speedup of 2.8 after parallelization on 4 V100 GPUs compared to that over 1 GPU

    Recursion based parallelization of exact dense linear algebra routines for Gaussian elimination

    Get PDF
    International audienceWe present block algorithms and their implementation for the parallelization of sub-cubic Gaussian elimination on shared memory architectures.Contrarily to the classical cubic algorithms in parallel numerical linear algebra, we focus here on recursive algorithms and coarse grain parallelization.Indeed, sub-cubic matrix arithmetic can only be achieved through recursive algorithms making coarse grain block algorithms perform more efficiently than fine grain ones. This work is motivated by the design and implementation of dense linear algebraover a finite field, where fast matrix multiplication is used extensively and where costly modular reductions also advocate for coarse grain block decomposition. We incrementally build efficient kernels, for matrix multiplication first, then triangular system solving, on top of which a recursive PLUQ decomposition algorithm is built. We study the parallelization of these kernels using several algorithmic variants: either iterative or recursive and using different splitting strategies. Experiments show that recursive adaptive methods for matrix multiplication, hybrid recursive-iterative methods for triangular system solve and tile recursive versions of the PLUQ decomposition, together with various data mapping policies, provide the best performance on a 32 cores NUMA architecture. Overall, we show that the overhead of modular reductions is more than compensated by the fast linear algebra algorithms and that exact dense linear algebra matches the performance of full rank reference numerical software even in the presence of rank deficiencies

    Automated HW/SW co-design for edge AI : State, challenges and steps ahead

    Get PDF
    Gigantic rates of data production in the era of Big Data, Internet of Thing (IoT), and Smart Cyber Physical Systems (CPS) pose incessantly escalating demands for massive data processing, storage, and transmission while continuously interacting with the physical world using edge sensors and actuators. For IoT systems, there is now a strong trend to move the intelligence from the cloud to the edge or the extreme edge (known as TinyML). Yet, this shift to edge AI systems requires to design powerful machine learning systems under very strict resource constraints. This poses a difficult design task that needs to take the complete system stack from machine learning algorithm, to model optimization and compression, to software implementation, to hardware platform and ML accelerator design into account. This paper discusses the open research challenges to achieve such a holistic Design Space Exploration for a HW/SW Co-design for Edge AI Systems and discusses the current state with three currently developed flows: one design flow for systems with tightly-coupled accelerator architectures based on RISC-V, one approach using loosely-coupled, application-specific accelerators as well as one framework that integrates software and hardware optimization techniques to built efficient Deep Neural Network (DNN) systems.publishedVersionPeer reviewe

    Modular SIMD arithmetic in Mathemagix

    Full text link
    Modular integer arithmetic occurs in many algorithms for computer algebra, cryptography, and error correcting codes. Although recent microprocessors typically offer a wide range of highly optimized arithmetic functions, modular integer operations still require dedicated implementations. In this article, we survey existing algorithms for modular integer arithmetic, and present detailed vectorized counterparts. We also present several applications, such as fast modular Fourier transforms and multiplication of integer polynomials and matrices. The vectorized algorithms have been implemented in C++ inside the free computer algebra and analysis system Mathemagix. The performance of our implementation is illustrated by various benchmarks

    Processamento ótico e digital de sinal em sistemas de transmissão com multiplexagem por divisão espacial

    Get PDF
    The present thesis focuses on the development of optical and digital signal processing techniques for coherent optical transmission systems with spacedivision multiplexing (SDM). According to the levels of spatial crosstalk, these systems can be grouped in the ones with and the ones without spatial selectivity; drastically changing its operation principle. In systems with spatial selectivity, the mode coupling is negligible and therefore, an arbitrary spacial channel can be independently routed through the optical network and post-processed at the optical coherent receiver. In systems without spatial selectivity, mode coupling plays a key role in a way that spatial channels are jointly transmitted and post-processed at the optical coherent receiver. With this in mind, optical switching techniques for SDM transmission systems with spatial selectivity are developed, whereas digital techniques for space-demultiplexing are developed for SDM systems without spatial selectivity. With the purpose of developing switching techniques, the acoustic-optic effect is analyzed in few-mode fibers (FMF)s and in multicore fibers (MCF)s. In FMF, the signal switching between two arbitrary modes using flexural or longitudinal acoustic waves is numerically and experimentally demonstrated. While, in MCF, it is shown that a double resonant coupling, induced by flexural acoustic waves, allows for the signal switching between two arbitrary cores. Still in the context of signal switching, the signal propagation in the multimodal nonlinear regime is analyzed. The nonlinear Schrödinger equation is deduced in the presence of mode coupling, allowing the meticulous analysis of the multimodal process of four-wave mixing. Under the right conditions, it is shown that such process allows for the signal switching between distinguishable optical modes. The signal representation in higher-order Poincaré spheres is introduced and analyzed in order to develop digital signal processing techniques. In this representation, an arbitrary pair of tributary signals is represented in a Poincaré sphere, where the samples appear symmetrically distributed around a symmetry plane. Based on this property, spatial-demultiplexing and mode dependent loss compensation techniques are developed, which are independent of the modulation format, are free of training sequences and tend to be robust to frequency offsets and phase fluctuations. The aforementioned techniques are numerically validated, and its performance is assessed through the calculation of the remaining penalty in the signal-to-noise ratio of the post-processed signal. Finally, the complexity of such techniques is analytically described in terms of real multiplications per sample.A presente tese tem por objectivo o desenvolvimento de técnicas de processamento ótico e digital de sinal para sistemas coerentes de transmissão ótica com multiplexagem por diversidade espacial. De acordo com a magnitude de diafonia espacial, estes sistemas podem ser agrupados em sistemas com e sem seletividade espacial, alterando drasticamente o seu princípio de funcionamento. Em sistemas com seletividade espacial, o acoplamento modal é negligenciável e, portanto, um canal espacial arbitrário pode ser encaminhado de forma independente através da rede ótica e pós-processado no recetor ótico coerente. Em sistemas sem seletividade espacial, o acoplamento modal tem um papel fulcral pelo que os canais espaciais são transmitidos e pós-processados conjuntamente. Perante este cenário, foram desenvolvidas técnicas de comutação entre canais espaciais para sistemas com seletividade espacial, ao passo que para sistemas sem seletividade espacial, foram desenvolvidas técnicas digitais de desmultiplexagem espacial. O efeito acústico-ótico foi analisado em fibras com alguns modos (FMF) e em fibras com múltiplos núcleos (MCF) com o intuito de desenvolver técnicas de comutação de sinal no domínio ótico. Em FMF, demonstrou-se numérica e experimentalmente a comutação do sinal entre dois modos de propagação arbitrários através de ondas acústicas transversais ou longitudinais, enquanto, em MCF, a comutação entre dois núcleos arbitrários é mediada por um processo de acoplamento duplamente ressonante induzido por ondas acústicas transversais. Ainda neste contexto, analisou-se a propagação do sinal no regime multimodal não linear. Foi deduzida a equação não linear de Schrödinger na presença de acoplamento modal, posteriormente usada na análise do processo multimodal de mistura de quatro ondas. Nas condições adequadas, é demonstrado que este processo permite a comutação ótica de sinal entre dois modos de propagação distintos. A representação de sinal em esferas de Poincaré de ordem superior é introduzida e analisada com o objetivo de desenvolver técnicas de processamento digital de sinal. Nesta representação, um par arbitrário de sinais tributários é representado numa esfera de Poincaré onde as amostras surgem simetricamente distribuídas em torno de um plano de simetria. Com base nesta propriedade, foram desenvolvidas técnicas de desmultiplexagem espacial e de compensação das perdas dependentes do modo de propagação, as quais são independentes do formato de modulação, não necessitam de sequências de treino e tendem a ser robustas aos desvios de frequência e às flutuações de fase. As técnicas referidas foram validadas numericamente, e o seu desempenho é avaliado mediante a penalidade remanescente na relação sinal-ruído do sinal pós-processado. Por fim, a complexidade destas é analiticamente descrita em termos de multiplicações reais por amostra.Programa Doutoral em Engenharia Eletrotécnic
    corecore