416 research outputs found

    A CPU-GPU Hybrid Approach for Accelerating Cross-correlation Based Strain Elastography

    Get PDF
    Elastography is a non-invasive imaging modality that uses ultrasound to estimate the elasticity of soft tissues. The resulting images are called 'elastograms'. Elastography techniques are promising as cost-effective tools in the early detection of pathological changes in soft tissues. The quality of elastographic images depends on the accuracy of the local displacement estimates. Cross-correlation based displacement estimators are precise and sensitive. However cross-correlation based techniques are computationally intense and may limit the use of elastography as a real-time diagnostic tool. This study investigates the use of parallel general purpose graphics processing unit (GPGPU) engines for speeding up generation of elastograms at real-time frame rates while preserving elastographic image quality. To achieve this goal, a cross-correlation based time-delay estimation algorithm was developed in C programming language and was profiled to locate performance blocks. The hotspots were addressed by employing software pipelining, read-ahead and eliminating redundant computations. The algorithm was then analyzed for parallelization on GPGPU and the stages that would map well to the GPGPU hardware were identified. By employing optimization principles for efficient memory access and efficient execution, a net improvement of 67x with respect to the original optimized C version of the estimator was achieved. For typical diagnostic depths of 3-4cm and elastographic processing parameters, this implementation can yield elastographic frame rates in the order of 50fps. It was also observed that all of the stages in elastography cannot be offloaded to the GPGPU for computation because some stages have sub-optimal memory access patterns. Additionally, data transfer from graphics card memory to system memory can be efficiently overlapped with concurrent CPU execution. Therefore a hybrid model of computation where computational load is optimally distributed between CPU and GPGPU was identified as an optimal approach to adequately tackle the speed-quality problem in real-time imaging. The results of this research suggest that use of GPGPU as a co-processor to CPU may allow generation of elastograms at real time frame rates without significant compromise in image quality, a scenario that could be very favorable in real-time clinical elastography

    Real-time Shadows for Gigapixel Displacement Maps

    Get PDF
    Shadows portray helpful information in scenes. From a scientific visualization standpoint, they help to add data without unnecessary clutter. In video games they add realism and depth. In common graphics pipelines, due to the independent and parallel rendering of geometric primitives, shadows are difficult to achieve. Objects require knowledge of each other and therefore multiple renders are needed to collect the necessary data. The collection of this data comes with its own set of trade offs. Our research involves adding shadows into a lunar rendering framework developed by Dr. Robert Kooima. The NASA-collected data contains a multi-gigapixel displacement map describing the lunar topology. This map does not fit entirely into main memory and therefore out-of-core paging is utilized to achieve real-time speeds. Current shadow techniques do not attempt to generate occluder data on such a scale, and therefore we have developed a novel approach to fit this situation. By using a chain of pre-processing steps, we analyze the structure of the displacement map and calculate horizon lines at each vertex. This information is saved into several images and used to generate shadows in a single pass, maintaining real-time speeds. The algorithm is even capable of generating soft shadows without extra information or loss of speed. We compare our algorithm with common approaches in the field as well as two forms of ground truth; one from ray tracing and the other from the gigapixel lunar texture data, showing real shadows at the time it was collected

    Efficient algorithms for the realistic simulation of fluids

    Get PDF
    Nowadays there is great demand for realistic simulations in the computer graphics field. Physically-based animations are commonly used, and one of the more complex problems in this field is fluid simulation, more so if real-time applications are the goal. Videogames, in particular, resort to different techniques that, in order to represent fluids, just simulate the consequence and not the cause, using procedural or parametric methods and often discriminating the physical solution. This need motivates the present thesis, the interactive simulation of free-surface flows, usually liquids, which are the feature of interest in most common applications. Due to the complexity of fluid simulation, in order to achieve real-time framerates, we have resorted to use the high parallelism provided by actual consumer-level GPUs. The simulation algorithm, the Lattice Boltzmann Method, has been chosen accordingly due to its efficiency and the direct mapping to the hardware architecture because of its local operations. We have created two free-surface simulations in the GPU: one fully in 3D and another restricted only to the upper surface of a big bulk of fluid, limiting the simulation domain to 2D. We have extended the latter to track dry regions and is also coupled with obstacles in a geometry-independent fashion. As it is restricted to 2D, the simulation loses some features due to the impossibility of simulating vertical separation of the fluid. To account for this we have coupled the surface simulation to a generic particle system with breaking wave conditions; the simulations are totally independent and only the coupling binds the LBM with the chosen particle system. Furthermore, the visualization of both systems is also done in a realistic way within the interactive framerates; raycasting techniques are used to provide the expected light-related effects as refractions, reflections and caustics. Other techniques that improve the overall detail are also applied as low-level detail ripples and surface foam

    Accelerating Number Theoretic Transformations for Bootstrappable Homomorphic Encryption on GPUs

    Full text link
    Homomorphic encryption (HE) draws huge attention as it provides a way of privacy-preserving computations on encrypted messages. Number Theoretic Transform (NTT), a specialized form of Discrete Fourier Transform (DFT) in the finite field of integers, is the key algorithm that enables fast computation on encrypted ciphertexts in HE. Prior works have accelerated NTT and its inverse transformation on a popular parallel processing platform, GPU, by leveraging DFT optimization techniques. However, these GPU-based studies lack a comprehensive analysis of the primary differences between NTT and DFT or only consider small HE parameters that have tight constraints in the number of arithmetic operations that can be performed without decryption. In this paper, we analyze the algorithmic characteristics of NTT and DFT and assess the performance of NTT when we apply the optimizations that are commonly applicable to both DFT and NTT on modern GPUs. From the analysis, we identify that NTT suffers from severe main-memory bandwidth bottleneck on large HE parameter sets. To tackle the main-memory bandwidth issue, we propose a novel NTT-specific on-the-fly root generation scheme dubbed on-the-fly twiddling (OT). Compared to the baseline radix-2 NTT implementation, after applying all the optimizations, including OT, we achieve 4.2x speedup on a modern GPU.Comment: 12 pages, 13 figures, to appear in IISWC 202

    Simple Hardware-Efficient Long Convolutions for Sequence Modeling

    Full text link
    State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions--such as squashing the kernel weights--result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2.2×\times, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29.1 points while training 7.2×\times faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0.2 PPL with 30% fewer parameters

    U-DeepONet: U-Net Enhanced Deep Operator Network for Geologic Carbon Sequestration

    Full text link
    FNO and DeepONet are by far the most popular neural operator learning algorithms. FNO seems to enjoy an edge in popularity due to its ease of use, especially with high dimensional data. However, a lesser-acknowledged feature of DeepONet is its modularity. This feature allows the user the flexibility of choosing the kind of neural network to be used in the trunk and/or branch of the DeepONet. This is beneficial because it has been shown many times that different types of problems require different kinds of network architectures for effective learning. In this work, we will take advantage of this feature by carefully designing a more efficient neural operator based on the DeepONet architecture. We introduce U-Net enhanced DeepONet (U-DeepONet) for learning the solution operator of highly complex CO2-water two-phase flow in heterogeneous porous media. The U-DeepONet is more accurate in predicting gas saturation and pressure buildup than the state-of-the-art U-Net based Fourier Neural Operator (U-FNO) and the Fourier-enhanced Multiple-Input Operator (Fourier-MIONet) trained on the same dataset. In addition, the proposed U-DeepONet is significantly more efficient in training times than both the U-FNO (more than 18 times faster) and the Fourier-MIONet (more than 5 times faster), while consuming less computational resources. We also show that the U-DeepONet is more data efficient and better at generalization than both the U-FNO and the Fourier-MIONet

    Aceleración de algoritmos de procesamiento de imágenes para el análisis de partículas individuales con microscopia electrónica

    Full text link
    Tesis Doctoral inédita cotutelada por la Masaryk University (República Checa) y la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingeniería Informática. Fecha de Lectura: 24-10-2022Cryogenic Electron Microscopy (Cryo-EM) is a vital field in current structural biology. Unlike X-ray crystallography and Nuclear Magnetic Resonance, it can be used to analyze membrane proteins and other samples with overlapping spectral peaks. However, one of the significant limitations of Cryo-EM is the computational complexity. Modern electron microscopes can produce terabytes of data per single session, from which hundreds of thousands of particles must be extracted and processed to obtain a near-atomic resolution of the original sample. Many existing software solutions use high-Performance Computing (HPC) techniques to bring these computations to the realm of practical usability. The common approach to acceleration is parallelization of the processing, but in praxis, we face many complications, such as problem decomposition, data distribution, load scheduling, balancing, and synchronization. Utilization of various accelerators further complicates the situation, as heterogeneous hardware brings additional caveats, for example, limited portability, under-utilization due to synchronization, and sub-optimal code performance due to missing specialization. This dissertation, structured as a compendium of articles, aims to improve the algorithms used in Cryo-EM, esp. the SPA (Single Particle Analysis). We focus on the single-node performance optimizations, using the techniques either available or developed in the HPC field, such as heterogeneous computing or autotuning, which potentially needs the formulation of novel algorithms. The secondary goal of the dissertation is to identify the limitations of state-of-the-art HPC techniques. Since the Cryo-EM pipeline consists of multiple distinct steps targetting different types of data, there is no single bottleneck to be solved. As such, the presented articles show a holistic approach to performance optimization. First, we give details on the GPU acceleration of the specific programs. The achieved speedup is due to the higher performance of the GPU, adjustments of the original algorithm to it, and application of the novel algorithms. More specifically, we provide implementation details of programs for movie alignment, 2D classification, and 3D reconstruction that have been sped up by order of magnitude compared to their original multi-CPU implementation or sufficiently the be used on-the-fly. In addition to these three programs, multiple other programs from an actively used, open-source software package XMIPP have been accelerated and improved. Second, we discuss our contribution to HPC in the form of autotuning. Autotuning is the ability of software to adapt to a changing environment, i.e., input or executing hardware. Towards that goal, we present cuFFTAdvisor, a tool that proposes and, through autotuning, finds the best configuration of the cuFFT library for given constraints of input size and plan settings. We also introduce a benchmark set of ten autotunable kernels for important computational problems implemented in OpenCL or CUDA, together with the introduction of complex dynamic autotuning to the KTT tool. Third, we propose an image processing framework Umpalumpa, which combines a task-based runtime system, data-centric architecture, and dynamic autotuning. The proposed framework allows for writing complex workflows which automatically use available HW resources and adjust to different HW and data but at the same time are easy to maintainThe project that gave rise to these results received the support of a fellowship from the “la Caixa” Foundation (ID 100010434). The fellowship code is LCF/BQ/DI18/11660021. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 71367