36 research outputs found

    CGiS : high-level data-parallel GPU programming

    Get PDF
    In the last few years, PC technology underwent a paradigm shift. The current trend leads aways from raising sequential performance to enhancing the available parallelism. The rapid performance increase of Graphics Processing Units (GPUs) is a part of this trend. However, it is difficult to harness the computational potential because for the longest time GPUs could be directed only through graphics APIs and in low-level code. The language CGiS has been developed to remedy this situation. CGiS is a data-parallel programming language, which offers a high-level abstraction of GPUs, letting programmers use GPUs as co-processors for massively parallel algorithms. This work presents the language and the compiler for CGiS in the context of general purpose programming on GPUs (GPGPU).Seit einigen Jahren zeichnet sich bei handelsüblichen PCs ein Trend weg von der Erhöhung der sequentiellen Leistung hin zur Parallelverarbeitung ab. Ein Bestandteil dieses Trends ist die rasche Leistungsentwicklung der Grafikkarten (GPUs), deren Rechenleistung die aktueller CPUs mittlerweile übertrifft. Es ist jedoch schwierig, diese Leistung auch abzurufen, da diese Geräte lange Zeit nur hardwarenah und über Grafik-APIs ansteuerbar waren. Um dies zu ändern, ist CGiS entwickelt worden, eine datenparallele Programmiersprache, die die GPUs abstrahiert und ihre Benutzung als Co-Prozessoren für massiv-datenparallele Algorithmen ermöglicht. Diese Arbeit stellt die Sprache und den Compiler im Kontext dieser Entwicklung vor

    SIMD code generation in data-parallel programming

    Get PDF
    Today';s desktop PCs feature a variety of parallel processing units. Developing applications that exploit this parallelism is a demanding task, and a programmer has to obtain detailed knowledge about the hardware for efficient implementation. CGiS is a data-parallel programming language providing a unified abstraction for two parallel processing units: graphics processing units (GPUs) and the vector processing units of CPUs. The CGiS compiler framework fully virtualizes the differences in capability and accessibility by mapping an abstract data-parallel programming model on those targets. The applicability of CGiS for GPUs has been shown in previous work; this work focuses on applying the abstract programming model of CGiS to CPUs with SIMD (Single Instruction Multiple Data) instruction sets. We have identified, adapted and implemented a set of program analyses to expose and access the available parallelism. The code generation phase is based on selected optimization algorithms tailored to SIMD code generation. Via code generation profiles, it is possible to adapt the code generation strategy to different target architectures. To assess the effectiveness of our approach, we have implemented backends for the two most widespread SIMD instruction sets, namely Intel';s Streaming SIMD Extensions and Freescale';s AltiVec. Additionally, we integrated a prototypical backend for the Cell Broadband Engine as an example for a multi-core architecture. Our experimental results show excellent average performance gains by a factor of 3 compared to standard scalar C++ implementations and underline the viability of this approach: real-world applications can be implemented easily with CGiS and result in efficient code.Parallelverarbeitung wird heutzutage in handelsüblichen PCs von einer Reihe verschiedener Komponenten unterstützt. Grafikprozessoren (GPUs) und Vektoreinheiten in CPUs sind zwei dieser Komponenten. Da die Entwicklung von Anwendungen, die diese Parallelität nutzen, eine anspruchsvolle Aufgabe ist, muss sich ein Programmierer detaillierte Kenntnisse der internen Hardwarestruktur aneignen. Mit CGiS stellen wir eine datenparallele Programmiersprache vor, die eine gemeinsame Abstraktion für Grafikprozessoren und Vektoreinheiten in CPUs bietet und ein einheitliches Programmiermodell für beide bereitstellt. In vorherigen Arbeiten haben wir bereits die Nutzbarkeit von CGiS für GPUs gezeigt. In der vorliegenden Arbeit bilden wir das abstrakte Programmiermodel von CGiS auf CPUs mit SIMD (Single Instruction Multiple Data) Instruktionssatz ab. Wir haben eine Reihe relevanter Programmanalysen angepasst und implementiert um Parallelität aufzudecken und zu nutzen. Die Codegenerierungsphase basiert auf ausgewählten Optimierungsalgorithmen, die speziell auf die Generierung von SIMD-Code zugeschnitten sind. Durch Profile für verschiedene Architekturen ist es uns möglich, die Codegenierung zu steuern. Um die Effektivität unseres Ansatzes unter Beweis zu stellen, haben wir Backends für die beiden am weitesten verbreiteten SIMD-Instruktionssätze implementiert: Die "Streaming SIMD Extensions" von Intel und AltiVec von Freescale. Zusätzlich haben wir ein prototypisches Backend für den Cell Prozessor von IBM, als Beispiel für eine Multi-Core-Architektur, integriert. Die Ergebnisse unserer Experimente belegen eine ausgezeichnete durchschnittliche Beschleunigung um einen Faktor von 3 im Vergleich zu handgeschriebenen C++-Implementierungen. Diese Resultate untermauern unseren Ansatz: Mittels CGiS lässt sich leistungsstarker Code für SIMD- und Multi-Core-Applikationen generieren

    Implementation of digital pheromones in PSO accelerated by commodity Graphics Hardware

    Get PDF
    In this paper, a model for Graphics Processing Unit (GPU) implementation of Particle Swarm Optimization (PSO) using digital pheromones to coordinate swarms within ndimensional design spaces is presented. Previous work by the authors demonstrated the capability of digital pheromones within PSO for searching n-dimensional design spaces with improved accuracy, efficiency and reliability in both serial and parallel computing environments using traditional CPUs. Modern GPUs have proven to outperform the number of floating point operations when compared to CPUs through inherent data parallel architecture and higher bandwidth capabilities. The advent of programmable graphics hardware in the recent times further provided a suitable platform for scientific computing particularly in the field of design optimization. However, the data parallel architecture of GPUs requires a specialized formulation for leveraging its computational capabilities. When the objective function computations are appropriately formulated for GPUs, it is theorized that the solution efficiency (speed) can be significantly increased while maintaining solution accuracy. The development of this method together with a number of multi-modal unconstrained test problems are tested and presented in this paper

    Digital Pheromone Implementation of PSO with Velocity Vector Accelerated by Commodity Graphics Hardware

    Get PDF
    In this paper, a model for Graphics Processing Unit (GPU) implementation of Particle Swarm Optimization (PSO) using digital pheromones to coordinate swarms within ndimensional design spaces is presented. Particularly, the velocity vector computations are carried out on graphics hardware. Previous work by the authors demonstrated the capability of digital pheromones within PSO for searching n-dimensional design spaces with improved accuracy, efficiency and reliability in serial, parallel and GPU computing environments. The GPU implementation was limited to computing the objective function values alone. Modern GPUs have proven to outperform the number of floating point operations when compared to CPUs through inherent data parallel architecture and higher bandwidth capabilities. This paper presents a method to implement velocity vector computations on a GPU along with objective function evaluations. Three different modes of implementation are studied and presented - First, CPU-CPU where objective function and velocity vector are calculated on CPU alone. Second, GPU-CPU where objective function is computed on the GPU and velocity vector is computed on GPU. Third, GPU-GPU where objective function and velocity vector are both evaluated on the GPU. The results from these three implementations are presented followed by conclusions and recommendations on the best approach for utilizing the full potential of GPUs for PSO

    Whole-function vectorization

    Full text link
    Abstract—Data-parallel programming languages are an impor-tant component in today’s parallel computing landscape. Among those are domain-specific languages like shading languages in graphics (HLSL, GLSL, RenderMan, etc.) and “general-purpose” languages like CUDA or OpenCL. Current implementations of those languages on CPUs solely rely on multi-threading to imple-ment parallelism and ignore the additional intra-core parallelism provided by the SIMD instruction set of those processors (like Intel’s SSE and the upcoming AVX or Larrabee instruction sets). In this paper, we discuss several aspects of implementing data-parallel languages on machines with SIMD instruction sets. Our main contribution is a language- and platform-independent code transformation that performs whole-function vectorization on low-level intermediate code given by a control flow graph in SSA form. We evaluate our technique in two scenarios: First, incorpo-rated in a compiler for a domain-specific language used in real-time ray tracing. Second, in a stand-alone OpenCL driver. We observe average speedup factors of 3.9 for the ray tracer and factors between 0.6 and 5.2 for different OpenCL kernels. I

    Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

    Get PDF
    [Resumen]El interés en Java para computación paralela está motivado por sus interesantes características, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicación eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el máximo provecho de las redes de baja latencia, obstaculizando la adopción de Java en computación de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el diseño, implementación y evaluación de soluciones de comunicación en Java que superan esta limitación. En consecuencia, se desarrollaron múltiples dispositivos de comunicación a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al máximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. También se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicación. La evaluación experimental ha mostrado que las primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computación en la nube (cloud computing) para HPC, donde el modelo de distribución de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluación del rendimiento de recursos cloud específicos para HPC del proveedor líder, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualización impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualización apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimización del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computación paralela está motivado polas súas interesantes características, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicacións paralelas en Java carecen xeralmente de mecanismos de comunicación e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o máximo proveito das redes de baixa latencia, obstaculizando a adopción de Java na computación de altas prestacións (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o deseño, implementaci ón e avaliación de solucións de comunicación en Java que superan esta limitación. En consecuencia, desenvolvéronse múltiples dispositivos de comunicación a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao máaximo o hardware de rede subxacente mediante operacións de acceso directo a memoria remota que proporcionan comunicacións de baixa latencia. Tamén se inclúe unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicación. A avaliación experimental amosou que as primitivas de comunicación de FastMPJ son competitivas en comparación con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicacións MPJ. Por outra banda, esta Tese analiza o potencial da computación na nube (cloud computing) para HPC, onde o modelo de distribución de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliación do rendemento de recursos cloud específi cos para HPC do proveedor líder, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualización impón na rede, impedindo mover as aplicacións intensivas en comunicacións á nube. A clave atópase no soporte de virtualización apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimización do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

    Methods for Epigenetic Analyses from Long-Read Sequencing Data

    Get PDF
    Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease. DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity. Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads. With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another. Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures. Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data. The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies. Methods for storage, retrieval, and analysis of such data therefore require careful consideration. Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation. These avenues had not been considered in existing tools. In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods. I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information. This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing. It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface. Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing. This implementation takes advantage of the performance benefits provided by my high performance storage container. It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties. Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction. I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions. Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma. I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures. Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation. These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding. In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing

    Modern Computing Techniques for Solving Genomic Problems

    Get PDF
    With the advent of high-throughput genomics, biological big data brings challenges to scientists in handling, analyzing, processing and mining this massive data. In this new interdisciplinary field, diverse theories, methods, tools and knowledge are utilized to solve a wide variety of problems. As an exploration, this dissertation project is designed to combine concepts and principles in multiple areas, including signal processing, information-coding theory, artificial intelligence and cloud computing, in order to solve the following problems in computational biology: (1) comparative gene structure detection, (2) DNA sequence annotation, (3) investigation of CpG islands (CGIs) for epigenetic studies. Briefly, in problem #1, sequences are transformed into signal series or binary codes. Similar to the speech/voice recognition, similarity is calculated between two signal series and subsequently signals are stitched/matched into a temporal sequence. In the nature of binary operation, all calculations/steps can be performed in an efficient and accurate way. Improving performance in terms of accuracy and specificity is the key for a comparative method. In problem #2, DNA sequences are encoded and transformed into numeric representations for deep learning methods. Encoding schemes greatly influence the performance of deep learning algorithms. Finding the best encoding scheme for a particular application of deep learning is significant. Three applications (detection of protein-coding splicing sites, detection of lincRNA splicing sites and improvement of comparative gene structure identification) are used to show the computing power of deep neural networks. In problem #3, CpG sites are assigned certain energy and a Gaussian filter is applied to detection of CpG islands. By using the CpG box and Markov model, we investigate the properties of CGIs and redefine the CGIs using the emerging epigenetic data. In summary, these three problems and their solutions are not isolated; they are linked to modern techniques in such diverse areas as signal processing, information-coding theory, artificial intelligence and cloud computing. These novel methods are expected to improve the efficiency and accuracy of computational tools and bridge the gap between biology and scientific computing

    Shader optimization and specialization

    Get PDF
    In the field of real-time graphics for computer games, performance has a significant effect on the player’s enjoyment and immersion. Graphics processing units (GPUs) are hardware accelerators that run small parallelized shader programs to speed up computationally expensive rendering calculations. This thesis examines optimizing shader programs and explores ways in which data patterns on both the CPU and GPU can be analyzed to automatically speed up rendering in games. Initially, the effect of traditional compiler optimizations on shader source-code was explored. Techniques such as loop unrolling or arithmetic reassociation provided speed-ups on several devices, but different GPU hardware responded differently to each set of optimizations. Analyzing execution traces from numerous popular PC games revealed that much of the data passed from CPU-based API calls to GPU-based shaders is either unused, or remains constant. A system was developed to capture this constant data and fold it into the shaders’ source-code. Re-running the game’s rendering code using these specialized shader variants resulted in performance improvements in several commercial games without impacting their visual quality

    Jump flooding algorithm on graphics hardware and its applications

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore