64 research outputs found

    Acceleration strategies for large-scale sequential simulations using parallel neighbour search: Non-LVA and LVA scenarios

    Get PDF
    This paper describes the application of acceleration techniques into existing implementations of Sequential Gaussian Simulation and Sequential Indicator Simulation. These implementations might incorporate Locally Varying Anisotropy (LVA) to capture non-linear features of the underlying physical phenomena. The imple- mentation focuses on a novel parallel neighbour search algorithm, which can be used on both non-LVA and LVA codes. Additionally, parallel shortest path executions and optimized linear algebra libraries are applied with focus on LVA codes. Execution time, speedup and accuracy results are presented. Non-LVA codes are benchmarked using two scenarios with approximately 50 million domain points each. Speedup results of 2× and 4× were obtained on SGS and SISIM respectively, where each scenario is compared against a baseline code published in Peredo et al. (2018). The aggregated contribution to speedup of both works results in 12× and 50× respectively. LVA codes are benchmarked using two scenarios with approximately 1.7 million domain points each. Speedup results of 56× and 1822× were obtained on SGS and SISIM respectively, where each scenario is compared against the original baseline sequential codes.The authors acknowledge the donated resources from project PID2019-107255GB of the Spanish Ministerio de Economía y Competitividad, and project 2017-SGR-1414 from the Generalitat de Catalunya, Spain.Peer ReviewedPostprint (published version

    A path-level exact parallelization strategy for sequential simulation

    Get PDF
    Sequential Simulation is a well known method in geostatistical modelling. Following the Bayesian approach for simulation of conditionally dependent random events, Sequential Indicator Simulation (SIS) method draws simulated values for K categories (categorical case) or classes defined by K different thresholds (continuous case). Similarly, Sequential Gaussian Simulation (SGS) method draws simulated values from a multivariate Gaussian field. In this work, a path-level approach to parallelize SIS and SGS methods is presented. A first stage of re-arrangement of the simulation path is performed, followed by a second stage of parallel simulation for non-conflicting nodes. A key advantage of the proposed parallelization method is to generate identical realizations as with the original non-parallelized methods. Case studies are presented using two sequential simulation codes from GSLIB: SISIM and SGSIM. Execution time and speedup results are shown for large-scale domains, with many categories and maximum kriging neighbours in each case, achieving high speedup results in the best scenarios using 16 threads of execution in a single machine.Peer ReviewedPostprint (author's final draft

    Operación stencil en CUDA

    Get PDF
    Los problemas derivados de la disipación de energía en la computación secuencial, están haciendo que cada vez se popularice más el uso de máquinas y sistemas con mayor cantidad de núcleos de proceso, pasando desde pequeños procesadores con un número reducido de núcleos, por clusters con varias máquinas secuenciales distribuidas, e incluso por dispositivos de coprocesamiento gráfico con varios cientos de núcleos que permiten asignar tareas generales a estos. Muchos algoritmos están siendo adaptados a estos modelos de paralelización.Preprin

    Low-latency multi-threaded ensemble learning for dynamic big data streams

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Real–time mining of evolving data streams involves new challenges when targeting today’s application domains such as the Internet of the Things: increasing volume, velocity and volatility requires data to be processed on–the–fly with fast reaction and adaptation to changes. This paper presents a high performance scalable design for decision trees and ensemble combinations that makes use of the vector SIMD and multicore capabilities available in modern processors to provide the required throughput and accuracy. The proposed design offers very low latency and good scalability with the number of cores on commodity hardware when compared to other state–of–the art implementations. On an Intel i7-based system, processing a single decision tree is 6x faster than MOA (Java), and 7x faster than StreamDM (C++), two well- known reference implementations. On the same system, the use of the 6 cores (and 12 hardware threads) available allow to process an ensemble of 100 learners 85x faster that MOA while providing the same accuracy. Furthermore, our solution is highly scalable: on an Intel Xeon socket with large core counts, the proposed ensemble design achieves up to 16x speed-up when employing 24 cores with respect to a single threaded execution.This work is partially supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316-P project, by the Generalitat de Catalunya (contract 2014-SGR-1051), by the Universitat Politècnica de Catalunya through an FPI/UPC scholarship and by NVIDIA through the UPC/BSC GPU Center of Excellence.Peer ReviewedPostprint (author's final draft

    A new generation of task-parallel algorithms for matrix inversion in many-threaded CPUs

    Get PDF
    We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and project PR65/19-22445 of Universidad Complutense de Madrid.Peer ReviewedPostprint (author's final draft

    Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

    Get PDF
    We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version

    Evaluation and assessment of professional skills in the Final Year Project

    Get PDF
    In this paper, we present a methodology for Final Year Project (FYP) monitoring and assessment that considers the inclusion of the professional skills required in the particular engineering degree. This proper monitoring and clear evaluation framework provides the student with valuable support for the project implementation as well as for improving the quality of the projects, thereby reducing the academic drop-out rate. The proposed methodology has been implemented at the Barcelona School of Informatics at the Universitat Politècnica de Catalunya - BarcelonaTech. The FYP is structured around three milestones: project definition, project monitoring and project completion. Skills are assigned to each milestone according to the tasks required in that phase, and a list of indicators is defined for each phase. The evaluation criteria for each indicator at each phase are specified in a rubric, and are made public both to students and teachers. Thus, the FYP includes an exhaustive evaluation method distributed throughout the whole project implementation, thereby facilitating project organization for the student as well as providing a clear and homogeneous assessment framework. The methodology for the FYP organization, assessment and evaluation was launched and piloted over two semesters. We believe the experience to be general in the sense that it has been conducted as part of an ICT engineering degree, but may easily be extended to any other engineering degree.Postprint (author’s final draft

    Elaboració d’un pla de tutories per a la FIB

    Get PDF
    La majoria d'estudiants que accedeixen a la universitat, i en concret a la FIB, no són conscients del que comporta cursar una carrera universitària. Els estudis universitaris requereixen molta dedicació. Sovint els estudiants tenen mancances de coneixements, però també hàbits i actituds no adequats. En aquest sentit és interessant la figura del tutor, especialment en l’inici dels estudis, per tal de guiar i orientar els estudiants i estudiantes amb l’objectiu principal de facilitar la seva adaptació al mon universitari i millorar el seu rendiment acadèmic. Actualment les tutories són voluntàries a la FIB i hi ha pocs estudiants que demanin un tutor i pocs professors que s'impliquin en aquesta tasca. Per una banda, molts estudiants no veuen la necessitat de tenir un tutor. Per altra banda, hi ha professors que no disposen ni del material ni de la formació que seria desitjable per a fer aquesta tasca i d’altres que no creuen en la seva utilitat. L’objectiu d’aquest projecte és l’elaboració d’un Pla de Tutories per a la FIB amb les directrius i material necessari per a desenvolupar aquesta tasca. D’aquesta manera esperem captar més tutors i tutores i incidir en una millora del rendiment acadèmic.Peer Reviewe

    Tareador: a tool to unveil parallelization strategies at undergraduate level

    Get PDF
    This paper presents a methodology and framework designed to assist students in the process of finding appropriate task decomposition strategies for their sequential program, as well as identifying bottlenecks in the later execution of the parallel program. One of the main components of this framework is Tareador, which provides a simple API to specify potential task decomposition strategies for a sequential program. Once the student proposes how to break the sequential code into tasks, Tareador 1) provides information about the dependences between tasks that should be honored when implementing that task decomposition using a parallel programming model; and 2) estimates the potential parallelism that could be achieved in an ideal parallel architecture with infinite processors; and 3) sim- ulates the parallel execution on an ideal architecture estimating the potential speed–up that could be achieved on a number of processors. The pedagogical style of the methodology is currently applied to teach parallelism in a third-year compulsory subject in the Bachelor Degree in Informatics Engineering at the Barcelona School of Informatics of the Universitat Politècnica de Catalunya (UPC) - BarcelonaTech.Peer ReviewedPostprint (published version
    corecore