    Simulating the behavior of the human brain on GPUS

    The simulation of the behavior of the Human Brain is one of the most important challenges in computing today. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm and, although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. Two different approaches are studied, one for mono-morphology simulations (batch of neurons with the same shape) and one for multi-morphology simulations (batch of neurons where every neuron has a different shape). In mono-morphology simulations we obtain a good performance using just a single kernel to compute all the neurons. However this turns out to be inefficient on multi-morphology simulations. Unlike the previous scenario, in multi-morphology simulations a much more complex implementation is necessary to obtain a good performance. In this case, we must execute more than one single GPU kernel. In every execution (kernel call) one specific part of the batch of the neurons is solved. These parts can be seen as multiple and independent tridiagonal systems. Although the present paper is focused on the simulation of the behavior of the Human Brain, some of these techniques, in particular those related to the solving of tridiagonal systems, can be also used for multiple oil and gas simulations. Our studies have proven that the optimizations proposed in the present work can achieve high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two NVIDIA K80 GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling, even when dealing with a very high number of neurons.This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under Grant Agreement No. 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P), the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Parallels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence, and the European Union’s Horizon 2020 Research and Innovation Program under the Marie Sklodowska-Curie Grant Agreement No. 749516.Peer ReviewedPostprint (published version

    Towards a GPU-based implementation of interaction nets

    We present ingpu, a GPU-based evaluator for interaction nets that heavily utilizes their potential for parallel evaluation. We discuss advantages and challenges of the ongoing implementation of ingpu and compare its performance to existing interaction nets evaluators.Comment: In Proceedings DCM 2012, arXiv:1403.757

    Faster Concurrent Range Queries with Contention Adapting Search Trees Using Immutable Data

    The need for scalable concurrent ordered set data structures with linearizable range query support is increasing due to the rise of multicore computers, data processing platforms and in-memory databases. This paper presents a new concurrent ordered set with linearizable range query support. The new data structure is based on the contention adapting search tree and an immutable data structure. Experimental results show that the new data structure is as much as three times faster compared to related data structures. The data structure scales well due to its ability to adapt the sizes of its immutable parts to the contention level and the sizes of the range queries

    Order out of Chaos: Proving Linearizability Using Local Views

    Proving the linearizability of highly concurrent data structures, such as those using optimistic concurrency control, is a challenging task. The main difficulty is in reasoning about the view of the memory obtained by the threads, because as they execute, threads observe different fragments of memory from different points in time. Until today, every linearizability proof has tackled this challenge from scratch. We present a unifying proof argument for the correctness of unsynchronized traversals, and apply it to prove the linearizability of several highly concurrent search data structures, including an optimistic self-balancing binary search tree, the Lazy List and a lock-free skip list. Our framework harnesses sequential reasoning about the view of a thread, considering the thread as if it traverses the data structure without interference from other operations. Our key contribution is showing that properties of reachability along search paths can be deduced for concurrent traversals from such interference-free traversals, when certain intuitive conditions are met. Basing the correctness of traversals on such local view arguments greatly simplifies linearizability proofs. At the heart of our result lies a notion of order on the memory, corresponding to the order in which locations in memory are read by the threads, which guarantees a certain notion of consistency between the view of the thread and the actual memory. To apply our framework, the user proves that the data structure satisfies two conditions: (1) acyclicity of the order on memory, even when it is considered across intermediate memory states, and (2) preservation of search paths to locations modified by interfering writes. Establishing the conditions, as well as the full linearizability proof utilizing our proof argument, reduces to simple concurrent reasoning. The result is a clear and comprehensible correctness proof, and elucidates common patterns underlying several existing data structures

    Компьютерное моделирование дробно-дифференциальной динамики некоторых фильтрационно-консолидационных процессов

    В рамках моделей, базирующихся на понятии дробной производной Капуто-Фабрицио, выполнено компьютерное моделирование дробно-дифференциальной фильтрационно-консолидационной динамики насыщенных солевыми растворами грунтовых массивов. Разработана методика получения численного решения соответствующих краевых задач для систем дробно-дифференциальных уравнений фильтрации и солепереноса. Изложен подход к распараллеливанию вычислительного процесса, приведены результаты численных экспериментов по моделированию динамики изучаемого процесса.В рамках моделей, що базуються на понятті дробової похідної Капуто-Фабріціо, виконано комп'ютерне моделювання дробово-диференційної фільтраційно-консолідаційної динаміки насичених сольовими розчинами ґрунтових масивів. Розроблено методику отримання чисельного розв'язання відповідних крайових задач для систем дробово-диференційних рівнянь фільтрації та солепереносу. Викладено підхід до розпаралелювання обчислювального процесу, наведені результати чисельних експериментів з моделювання динаміки досліджуваного процесу.Within the framework of models based on the concept of Caputo—Fabrizio fractional derivative, the computer simulation of the fractional differential filtration consolidation dynamics of salt saturated groundwater massifs is performed. The technique to obtain numerical solutions of the corresponding boundary value problems for systems of fractional differential equations of filtration and salt transfer is developed, the approach to the parallelization of the computational process is described, and the results of numerical experiments on modeling the dynamics of the considered process are presented

    Mapping parallel loops on multicore systems

    The compute nodes in contemporary HPC systems contain one or more multicore processors. As a result, these nodes constitute a shared-memory multiprocessor, often combining CMP and SMT concurrency technologies. This configuration introduces different levels of sharing in the cache hierarchy, resulting in non-uniform data sharing overheads. In this paper we analyze the data-sharing patterns that exhibit a real multithreaded application when executing on a multicore system, with emphasis in the use of the shared last level cache (LLC) for the concurrent threads. As a consequence of this study, we explore the loop mapping problem in such systems with the aim of optimizing the shared use of the the LLC by all parallel threads. We propose a three-phase loop mapping strategy that deals with workload imbalances, minimizes cache sharing interferences, and maximizes intra-core and inter-core data reuse in the cache hierarchy. Preliminary results show some benefits of our approach. However, this is a work in progress and much more research is being done.Postprint (author’s final draft

    cuHinesBatch: solving multiple hines systems on GPUs Human Brain Project

    The simulation of the behavior of the Human Brain is one of the most important challenges today in computing. The main problem consists of finding efficient ways to manipulate and compute the huge volume of data that this kind of simulations need, using the current technology. In this sense, this work is focused on one of the main steps of such simulation, which consists of computing the Voltage on neurons’ morphology. This is carried out using the Hines Algorithm. Although this algorithm is the optimum method in terms of number of operations, it is in need of non-trivial modifications to be efficiently parallelized on NVIDIA GPUs. We proposed several optimizations to accelerate this algorithm on GPU-based architectures, exploring the limitations of both, method and architecture, to be able to solve efficiently a high number of Hines systems (neurons). Each of the optimizations are deeply analyzed and described. To evaluate the impact of the optimizations on real inputs, we have used 6 different morphologies in terms of size and branches. Our studies have proven that the optimizations proposed in the present work can achieve a high performance on those computations with a high number of neurons, being our GPU implementations about 4× and 8× faster than the OpenMP multicore implementation (16 cores), using one and two K80 NVIDIA GPUs respectively. Also, it is important to highlight that these optimizations can continue scaling even when dealing with number of neurons.This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1), from the Spanish Ministry of Economy and Competitiveness under the project Computación de Altas Prestaciones VII (TIN2015-65316-P) and the Departament d’Innovació, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051). We thank the support of NVIDIA through the BSC/UPC NVIDIA GPU Center of Excellence. Antonio J. Peña is cofinanced by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva fellowship number IJCI-2015-23266.Peer ReviewedPostprint (published version

    MPI+OpenMP tasking scalability for the simulation of the human brain

    The simulation of the behavior of the Human Brain is one of the most ambitious challenges today with a non-end of important applications. We can find many different initiatives in the USA, Europe and Japan which attempt to achieve such a challenging target. In this work we focus on the most important European initiative (Human Brain Project) and on one of the tools (Arbor). This tool simulates the spikes triggered in a neuronal network by computing the voltage capacitance on the neurons' morphology, being one of the most precise simulators today. In the present work, we have evaluated the use of MPI+OpenMP tasking on top of the Arbor simulator. In this paper, we present the main characteristics of the Arbor tool and how these can be efficiently managed by using MPI+OpenMP tasking. We prove that this approach is able to achieve a good scaling even when computing a relatively low workload (number of neurons) per node using up to 32 nodes. Our target consists of achieving not only a highly scalable implementation based on MPI, but also to develop a tool with a high degree of abstraction without losing control and performance by using MPI+OpenMP tasking.We would like to apreciate the valuable feedback and help provided by Benjamin Cumming and Alexander Peyser. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 720270 (HBP SGA1 and HBP SGA2), from the Spanish Ministry of Economy and Competitiveness under the project Computacion de Altas Prestaciones VII (TIN2015- ´ 65316-P) and the Departament d’Innovacio, Universitats i ´ Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programacio i Entorns d’Execuci ´ o Paral ´ ·lels (2014-SGR-1051). This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Sklodowska Curie grand agreement No.749516Peer ReviewedPostprint (author version