213 research outputs found

    Gunrock: GPU Graph Analytics

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs, have presented two significant challenges to developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We characterize the performance of various optimization strategies and evaluate Gunrock's overall performance on different GPU architectures on a wide range of graph primitives that span from traversal-based algorithms and ranking algorithms, to triangle counting and bipartite-graph-based algorithms. The results show that on a single GPU, Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives and CPU shared-memory graph libraries such as Ligra and Galois, and better performance than any other GPU high-level graph library.Comment: 52 pages, invited paper to ACM Transactions on Parallel Computing (TOPC), an extended version of PPoPP'16 paper "Gunrock: A High-Performance Graph Processing Library on the GPU

    Gunrock: A High-Performance Graph Processing Library on the GPU

    Full text link
    For large-scale graph analytics on the GPU, the irregularity of data access and control flow, and the complexity of programming GPUs have been two significant challenges for developing a programmable high-performance graph library. "Gunrock", our graph-processing system designed specifically for the GPU, uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge. We evaluate Gunrock on five key graph primitives and show that Gunrock has on average at least an order of magnitude speedup over Boost and PowerGraph, comparable performance to the fastest GPU hardwired primitives, and better performance than any other GPU high-level graph library.Comment: 14 pages, accepted by PPoPP'16 (removed the text repetition in the previous version v5

    A GPU-based algorithm for fast node label learning in large and unbalanced biomolecular networks

    Get PDF
    Background: Several problems in network biology and medicine can be cast into a framework where entities are represented through partially labeled networks, and the aim is inferring the labels (usually binary) of the unlabeled part. Connections represent functional or genetic similarity between entities, while the labellings often are highly unbalanced, that is one class is largely under-represented: for instance in the automated protein function prediction (AFP) for most Gene Ontology terms only few proteins are annotated, or in the disease-gene prioritization problem only few genes are actually known to be involved in the etiology of a given disease. Imbalance-aware approaches to accurately predict node labels in biological networks are thereby required. Furthermore, such methods must be scalable, since input data can be large-sized as, for instance, in the context of multi-species protein networks. Results: We propose a novel semi-supervised parallel enhancement of COSNet, an imbalance-aware algorithm build on Hopfield neural model recently suggested to solve the AFP problem. By adopting an efficient representation of the graph and assuming a sparse network topology, we empirically show that it can be efficiently applied to networks with millions of nodes. The key strategy to speed up the computations is to partition nodes into independent sets so as to process each set in parallel by exploiting the power of GPU accelerators. This parallel technique ensures the convergence to asymptotically stable attractors, while preserving the asynchronous dynamics of the original model. Detailed experiments on real data and artificial big instances of the problem highlight scalability and efficiency of the proposed method. Conclusions: By parallelizing COSNet we achieved on average a speed-up of 180x in solving the AFP problem in the S. cerevisiae, Mus musculus and Homo sapiens organisms, while lowering memory requirements. In addition, to show the potential applicability of the method to huge biomolecular networks, we predicted node labels in artificially generated sparse networks involving hundreds of thousands to millions of nodes

    Massively parallel hybrid search for the partial Latin square extension problem

    Full text link
    The partial Latin square extension problem is to fill as many as possible empty cells of a partially filled Latin square. This problem is a useful model for a wide range of relevant applications in diverse domains. This paper presents the first massively parallel hybrid search algorithm for this computationally challenging problem based on a transformation of the problem to partial graph coloring. The algorithm features the following original elements. Based on a very large population (with more than 10410^4 individuals) and modern graphical processing units, the algorithm performs many local searches in parallel to ensure an intensified exploitation of the search space. It employs a dedicated crossover with a specific parent matching strategy to create a large number of diversified and information-preserving offspring at each generation. Extensive experiments on 1800 benchmark instances show a high competitiveness of the algorithm compared with the current best performing methods. Competitive results are also reported on the related Latin square completion problem. Analyses are performed to shed lights on the understanding of the main algorithmic components. The code of the algorithm will be made publicly available

    GPU accelerating distributed succinct de Bruijn graph construction

    Get PDF
    The research and methods in the field of computational biology have grown in the last decades, thanks to the availability of biological data. One of the applications in computational biology is genome sequencing or sequence alignment, a method to arrange sequences of, for example, DNA or RNA, to determine regions of similarity between these sequences. Sequence alignment applications include public health purposes, such as monitoring antimicrobial resistance. Demand for fast sequence alignment has led to the usage of data structures, such as the de Bruijn graph, to store a large amount of information efficiently. De Bruijn graphs are currently one of the top data structures used in indexing genome sequences, and different methods to represent them have been explored. One of these methods is the BOSS data structure, a special case of Wheeler graph index, which uses succinct data structures to represent a de Bruijn graph. As genomes can take a large amount of space, the construction of succinct de Bruijn graphs is slow. This has led to experimental research on using large-scale cluster engines such as Apache Spark and Graphic Processing Units (GPUs) in genome data processing. This thesis explores the use of Apache Spark and Spark RAPIDS, a GPU computing library for Apache Spark, in the construction of a succinct de Bruijn graph index from genome sequences. The experimental results indicate that Spark RAPIDS can provide up to 8 times speedups to specific operations, but for some other operations has severe limitations that limit its processing power in terms of succinct de Bruijn graph index construction

    GPU parallelization strategies for metaheuristics: a survey

    Get PDF
    Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. RecentresultsinthisareashowthatGPUstendtobeeffectiveco-processors forleveraging complex optimization problems.In thissurvey, mechanisms involvedinGPUprogrammingforimplementingparallelmetaheuristicsare presentedanddiscussedthroughastudyofrelevantresearchpapers. Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex highdimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallelmetaheuristicsachievedbetterresultsintermsofcomputation,and evensolutionquality

    High Performance Computing for DNA Sequence Alignment and Assembly

    Get PDF
    Recent advances in DNA sequencing technology have dramatically increased the scale and scope of DNA sequencing. These data are used for a wide variety of important biological analyzes, including genome sequencing, comparative genomics, transcriptome analysis, and personalized medicine but are complicated by the volume and complexity of the data involved. Given the massive size of these datasets, computational biology must draw on the advances of high performance computing. Two fundamental computations in computational biology are read alignment and genome assembly. Read alignment maps short DNA sequences to a reference genome to discover conserved and polymorphic regions of the genome. Genome assembly computes the sequence of a genome from many short DNA sequences. Both computations benefit from recent advances in high performance computing to efficiently process the huge datasets involved, including using highly parallel graphics processing units (GPUs) as high performance desktop processors, and using the MapReduce framework coupled with cloud computing to parallelize computation to large compute grids. This dissertation demonstrates how these technologies can be used to accelerate these computations by orders of magnitude, and have the potential to make otherwise infeasible computations practical

    Модифікований генетичний алгоритм визначення хроматичного числа графа

    Get PDF
    Кваліфікаційна робота включає пояснювальну записку (73 сторінки, 15 рисунків., 21 псевдокод, 11 таблиць, 5 графіків, 4 додатки). Метою даного дипломного проекту є створення модифікації генетичного алгоритму визначення хроматичного числа графа. В роботі розглянуто та проаналізовано етапи роботи класичного генетичного алгоритму. Розібрано різні види селекції та схрещування, найбільш поширені методи кодування даних, існуючі умови завершення роботи алгоритму. З усіх можливих варіантів вибрано такі, що найбільш підходять до вирішення задачі знаходження хроматичного числа, вибір обґрунтовано. В програмі передбачена можливість: задавати кількість вершин та ребер в графі; створювати випадкові графи; задавати кількість початкової популяції, кількість батьків, що мутують при кожній ітерації, кількість батьків для кросоверу; зміни штрафу за некоректний колір вершини та штрафу за кожен новий колір; вибирати функцію мутації та схрещування; вказати бажане хроматичне число та перевірити його на конфліктність. Розроблено такі тестові моделі, що найбільш підходять для демонстрації функцій алгоритму. Проаналізована швидкість та точність в залежності від параметрів. Підібрано оптимальний набір функцій мутації та кросоверу, їх послідовність та ознака початку дії кожної функції. Для реалізації генетичного алгоритму було обрано мову програмування Python. Даний алгоритм може бути використаний для складання розкладів, розподілу частот, регістрів у мікропроцесорах, обчислення похідних, розпаралелювання обчислень за числовими методами.The qualifying work includes an explanatory note (73 pages, 15 figures, 21 pseudocodes, 11 tables, 5 charts, 4 annexes). The purpose of this diploma project is to create a modification of the genetic algorithm for determining the chromatic number of a graph. The stages of the work of the classical genetic algorithm are considered and analyzed in this work. Different types of breeding and crossover are analyzed, the most common methods of encoding data, the existing conditions for the completion of the algorithm. Of all possible options, those that are most suited to the solution of the problem of finding a chromatic number are selected, the choice is justified. The program provides the ability to: set the number of vertices and edges in the graph; create random graphs; set the number of initial population, the number of parents, mutate each iteration, the number of parents for the crossover; change the fine for the wrong color of the summit and fine for each new color; select mutation and crossing function; specify the desired chromatic number and check it for conflicts. The following test models are developed that are most suitable for demonstrating the functions of the algorithm. The speed and accuracy are analyzed, depending on the parameters. The optimal set of mutation and crossover functions, their sequence and a sign of the beginning of each function's operation are selected. To implement the genetic algorithm was selected the Python programming language. This algorithm can be used for scheduling, frequency distribution, registers in microprocessors, derivative calculations, parallelization of computations by numerical methods

    Automatische Codegenerierung für Massiv Parallele Applikationen in der Numerischen Strömungsmechanik

    Get PDF
    Solving partial differential equations (PDEs) is a fundamental challenge in many application domains in industry and academia alike. With increasingly large problems, efficient and highly scalable implementations become more and more crucial. Today, facing this challenge is more difficult than ever due to the increasingly heterogeneous hardware landscape. One promising approach is developing domain‐specific languages (DSLs) for a set of applications. Using code generation techniques then allows targeting a range of hardware platforms while concurrently applying domain‐specific optimizations in an automated fashion. The present work aims to further the state of the art in this field. As domain, we choose PDE solvers and, in particular, those from the group of geometric multigrid methods. To avoid having a focus too broad, we restrict ourselves to methods working on structured and patch‐structured grids. We face the challenge of handling a domain as complex as ours, while providing different abstractions for diverse user groups, by splitting our external DSL ExaSlang into multiple layers, each specifying different aspects of the final application. Layer 1 is designed to resemble LaTeX and allows inputting continuous equations and functions. Their discretization is expressed on layer 2. It is complemented by algorithmic components which can be implemented in a Matlab‐like syntax on layer 3. All information provided to this point is summarized on layer 4, enriched with particulars about data structures and the employed parallelization. Additionally, we support automated progression between the different layers. All ExaSlang input is processed by our jointly developed Scala code generation framework to ultimately emit C++ code. We particularly focus on how to generate applications parallelized with, e.g., MPI and OpenMP that are able to run on workstations and large‐scale cluster alike. We showcase the applicability of our approach by implementing simple test problems, like Poisson’s equation, as well as relevant applications from the field of computational fluid dynamics (CFD). In particular, we implement scalable solvers for the Stokes, Navier‐Stokes and shallow water equations (SWE) discretized using finite differences (FD) and finite volumes (FV). For the case of Navier‐Stokes, we also extend our implementation towards non‐uniform grids, thereby enabling static mesh refinement, and advanced effects such as the simulated fluid being non‐Newtonian and non‐isothermal

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code
    corecore