95 research outputs found

    Multi-GPU design and performance evaluation of homomorphic encryption on GPU clusters

    Get PDF
    We present a multi-GPU design, implementation and performance evaluation of the Halevi-Polyakov-Shoup (HPS) variant of the Fan-Vercauteren (FV) levelled Fully Homomorphic Encryption (FHE) scheme. Our design follows a data parallelism approach and uses partitioning methods to distribute the workload in FV primitives evenly across available GPUs. The design is put to address space and runtime requirements of FHE computations. It is also suitable for distributed-memory architectures, and includes efficient GPU-to-GPU data exchange protocols. Moreover, it is user-friendly as user intervention is not required for task decomposition, scheduling or load balancing. We implement and evaluate the performance of our design on two homogeneous and heterogeneous NVIDIA GPU clusters: K80, and a customized P100. We also provide a comparison with a recent shared-memory-based multi-core CPU implementation using two homomorphic circuits as workloads: vector addition and multiplication. Moreover, we use our multi-GPU Levelled-FHE to implement the inference circuit of two Convolutional Neural Networks (CNNs) to perform homomorphically image classification on encrypted images from the MNIST and CIFAR - 10 datasets. Our implementation provides 1 to 3 orders of magnitude speedup compared with the CPU implementation on vector operations. In terms of scalability, our design shows reasonable scalability curves when the GPUs are fully connected.This work is supported by A*STAR under its RIE2020 Advanced Manufacturing and Engineering (AME) Programmtic Programme (Award A19E3b0099).Peer ReviewedPostprint (author's final draft

    Local time stepping on high performance computing architectures: mitigating CFL bottlenecks for large-scale wave propagation

    Get PDF
    Modeling problems that require the simulation of hyperbolic PDEs (wave equations) on large heterogeneous domains have potentially many bottlenecks. We attack this problem through two techniques: the massively parallel capabilities of graphics processors (GPUs) and local time stepping (LTS) to mitigate any CFL bottlenecks on a multiscale mesh. Many modern supercomputing centers are installing GPUs due to their high performance, and extending existing seismic wave-propagation software to use GPUs is vitally important to give application scientists the highest possible performance. In addition to this architectural optimization, LTS schemes avoid performance losses in meshes with localized areas of refinement. Coupled with the GPU performance optimizations, the derivation and implementation of an Newmark LTS scheme enables next-generation performance for real-world applications. Included in this implementation is work addressing the load-balancing problem inherent to multi-level LTS schemes, enabling scalability to hundreds and thousands of CPUs and GPUs. These GPU, LTS, and scaling optimizations accelerate the performance of existing applications by a factor of 30 or more, and enable future modeling scenarios previously made unfeasible by the cost of standard explicit time-stepping schemes

    Exploiting spatial symmetries for solving Poisson's equation

    Get PDF
    This paper presents a strategy to accelerate virtually any Poisson solver by taking advantage of s spatial reflection symmetries. More precisely, we have proved the existence of an inexpensive block diagonalisation that transforms the original Poisson equation into a set of 2s fully decoupled subsystems then solved concurrently. This block diagonalisation is identical regardless of the mesh connectivity (structured or unstructured) and the geometric complexity of the problem, therefore applying to a wide range of academic and industrial configurations. In fact, it simplifies the task of discretising complex geometries since it only requires meshing a portion of the domain that is then mirrored implicitly by the symmetries’ hyperplanes. Thus, the resulting meshes naturally inherit the exploited symmetries, and their memory footprint becomes 2s times smaller. Thanks to the subsystems’ better spectral properties, iterative solvers converge significantly faster. Additionally, imposing an adequate grid points’ ordering allows reducing the operators’ footprint and replacing the standard sparse matrix-vector products with the sparse matrixmatrix product, a higher arithmetic intensity kernel. As a result, matrix multiplications are accelerated, and massive simulations become more affordable. Finally, we include numerical experiments based on a turbulent flow simulation and making state-of-theart solvers exploit a varying number of symmetries. On the one hand, algebraic multigrid and preconditioned Krylov subspace methods require up to 23% and 72% fewer iterations, resulting in up to 1.7x and 5.6x overall speedups, respectively. On the other, sparse direct solvers’ memory footprint, setup and solution costs are reduced by up to 48%, 58% and 46%, respectively.This work has been financially supported by two competitive R+D projects: RETOtwin (PDC2021-120970-I00), given by MCIN/AEI/10.13039/501100011033 and European Union Next GenerationEU/PRTR, and FusionCAT (001-P-001722), given by Generalitat de Catalunya RIS3CAT-FEDER. Àdel Alsalti-Baldellou has also been supported by the predoctoral grants DIN2018-010061 and 2019-DI-90, given by MCIN/AEI/10.13039/501100011033 and the Catalan Agency for Management of University and Research Grants (AGAUR), respectively.Peer ReviewedPostprint (published version

    Scientific Application Acceleration Utilizing Heterogeneous Architectures

    Get PDF
    Within the past decade, there have been substantial leaps in computer architectures to exploit the parallelism that is inherently present in many applications. The scientific community has benefited from the emergence of not only multi-core processors, but also other, less traditional architectures including general purpose graphical processing units (GPGPUs), field programmable gate arrays (FPGAs), and Intel\u27s many integrated cores (MICs) architecture (i.e. Xeon Phi). The popularity of the GPGPU has increased rapidly because of their ability to perform massive amounts of parallel computation quickly and at low cost with an ease of programmability. Also, with the addition of high-level programming interfaces for these devices, technical and non-technical individuals can interface with the device and rapidly obtain improved performance for many algorithms. Many applications can take advantage of the parallelism present in distributed computing and multithreading to achieve higher levels of performance for the computationally intensive parts of the application. The work presented in this thesis implements three applications for use in a performance study of the GPGPU architecture and multi-GPGPU systems. The first application study in this research is a K-Means clustering algorithm that categorizes each data point into the closest cluster. The second algorithm implemented is a spiking neural network algorithm that is used as a computational model for machine learning. The third, and final, study is the longest common subsequences problem, which attempts to enumerate comparisons between sequences (namely, DNA sequences). The results for the aforementioned applications with varying problem sizes and architectural configurations are presented and discussed in this thesis. The K-Means clustering algorithm achieved approximately 97x speedup when utilizing an architecture consisting of 32 CPU/GPGPU pairs. To achieve this substantial speedup, up to 750,000 data points were used with up 30,000 centroids (means). The spiking neural network algorithm resulted in speedups of about 33x for the entire algorithm and 160x for each iteration with a two-level network with 1000 total neurons (800 excitatory and 200 inhibitory neurons). The longest common subsequences problem achieved speedup of greater than 10x with 100 random sequences up to 500 characters in length. The maximum speedup values for each application were achieved by utilizing the GPGPU as well as multi-core devices simultaneously. The computations were scattered over multiple CPU/GPGPU pairs with the computationally intensive pieces of the algorithms offloaded onto the GPGPU device. The research in this thesis illustrates the ability to scale a heterogeneous cluster (i.e. CPUs and GPUs working collaboratively) for large-scale scientific application performance improvements. Each algorithm demonstrates slightly different types of computations and communications, which can be compared to other algorithms to predict how they would perform on an accelerator. The results show that substantial speedups can be achieved for scientific applications when utilizing the GPGPU and multi-core architectures
    • 

    corecore