26 research outputs found

    Tracing and profiling machine learning dataflow applications on GPU

    Get PDF
    In this paper, we propose a profiling and tracing method for dataflow applications with GPU acceleration. Dataflow models can be represented by graphs and are widely used in many domains like signal processing or machine learning. Within the graph, the data flows along the edges, and the nodes correspond to the computing units that process the data. To accelerate the execution, some co-processing units, like GPUs, are often used for computing intensive nodes. The work in this paper aims at providing useful information about the execution of the dataflow graph on the available hardware, in order to understand and possibly improve the performance. The collected traces include low-level information about the CPU, from the Linux Kernel (system calls), as well as mid-level and high-level information respectively about intermediate libraries like CUDA, HIP or HSA, and the dataflow model. This is followed by post-mortem analysis and visualization steps in order to enhance the trace and show useful information to the user. To demonstrate the effectiveness of the method, it was evaluated for TensorFlow, a well-known machine learning library that uses a dataflow computational graph to represent the algorithms. We present a few examples of machine learning applications that can be optimized with the help of the information provided by our proposed method. For example, we reduce the execution time of a face recognition application by a factor of 5X. We suggest a better placement of the computation nodes on the available hardware components for a distributed application. Finally, we also enhance the memory management of an application to speed up the execution

    Tem_357 Harnessing the Power of Digital Transformation, Artificial Intelligence and Big Data Analytics with Parallel Computing

    Get PDF
    Traditionally, 2D and especially 3D forward modeling and inversion of large geophysical datasets are performed on supercomputing clusters. This was due to the fact computing time taken by using PC was too time consuming. With the introduction of parallel computing, attempts have been made to perform computationally intensive tasks on PC or clusters of personal computers where the computing power was based on Central Processing Unit (CPU). It is further enhanced with Graphical Processing Unit (GPU) as the GPU has become affordable with the launch of GPU based computing devices. Therefore this paper presents a didactic concept in learning and applying parallel computing with the use of General Purpose Graphical Processing Unit (GPGPU) was carried out and perform preliminary testing in migrating existing sequential codes for solving initially 2D forward modeling of geophysical dataset. There are many challenges in performing these tasks mainly due to lack of some necessary development software tools, but the preliminary findings are promising. Traditionally, 2D and especially 3D forward modeling and inversion of large geophysical datasets are performed on supercomputing clusters. This was due to the fact computing time taken by using PC was too time consuming. With the introduction of parallel computing, attempts have been made to perform computationally intensive tasks on PC or clusters of personal computers where the computing power was based on Central Processing Unit (CPU). It is further enhanced with Graphical Processing Unit (GPU) as the GPU has become affordable with the launch of GPU based computing devices. Therefore this paper presents a didactic concept in learning and applying parallel computing with the use of General Purpose Graphical Processing Unit (GPGPU) was carried out and perform preliminary testing in migrating existing sequential codes for solving initially 2D forward modeling of geophysical dataset. There are many challenges in performing these tasks mainly due to lack of some necessary development software tools, but the preliminary findings are promising.Traditionally, 2D and especially 3D forward modeling and inversion of large geophysical datasets are performed on supercomputing clusters. This was due to the fact computing time taken by using PC was too time consuming. With the introduction of parallel computing, attempts have been made to perform computationally intensive tasks on PC or clusters of personal computers where the computing power was based on Central Processing Unit (CPU). It is further enhanced with Graphical Processing Unit (GPU) as the GPU has become affordable with the launch of GPU based computing devices. Therefore this paper presents a didactic concept in learning and applying parallel computing with the use of General Purpose Graphical Processing Unit (GPGPU) was carried out and perform preliminary testing in migrating existing sequential codes for solving initially 2D forward modeling of geophysical dataset. There are many challenges in performing these tasks mainly due to lack of some necessary development software tools, but the preliminary findings are promising

    Traçage et profilage d'applications d'apprentissage automatique de type flot de données utilisant un processeur graphique

    Get PDF
    Actuellement, les besoins en puissance de calcul sont de plus en plus importants, alors que les améliorations au niveau du matériel commencent à ralentir. La puissance des processeurs et notamment leur fréquence de fonctionnement stagnent pour des raisons physiques comme la finesse de gravure ou la dissipation de chaleur. Afin de surpasser ces limites, le calcul en parallèle semble être une solution prometteuse avec l’utilisation d’architectures hétérogènes. Ces dernières mettent en oeuvre une combinaison de plusieurs unités de calculs de types possiblement différents, ce qui leur permet d’offrir un fonctionnement hautement parallèle. Malgré tout, utiliser l’ensemble du matériel efficacement reste difficile, et la programmation au niveau logiciel de ces architectures pose problème. Par conséquent, différents modèles ont émergé avec notamment les approches flot de données. Ces dernières proposent des caractéristiques très adaptées pour ce genre de contexte parallèle. Elles permettent de programmer plus facilement les différentes unités de calcul afin de bénéficier au maximum du matériel disponible. Dans un contexte de recherche de performance optimale, il est essentiel d’avoir des outils permettant de diagnostiquer d’éventuels problèmes. Quelques solutions ont déjà pu démontrer leur efficacité dans le cas d’un modèle de programmation plus traditionnel et séquentiel, utilisant ou non un processeur graphique. On retrouve par exemple des outils comme LTTng ou Ftrace destinés à l’analyse du processeur central. Concernant les processeurs graphiques, les outils propriétaires et à sources fermées, proposés par les constructeurs sont en général les plus complets et privilégiés par les programmeurs. Cela présente toutefois une limite, puisque les solutions ne sont pas générales et restent dépendantes du matériel proposé par un constructeur. Par ailleurs, elles offrent une flexibilité limitée avec des visualisations et analyses définies et fixes qui ne peuvent ni être modifiées ni améliorées en fonction des besoins d’un utilisateur. Finalement, aucun outil existant ne s’intéresse spécifiquement aux modèles flot de données.----------ABSTRACT: Recently, increasing computing capabilities have been required in various areas like scientific computing, video games and graphical rendering or artificial intelligence. These domains usually involve the processing of a large amount of data, intended to be performed as fast as possible. Unfortunately, hardware improvements have recently slowed down. The CPU clock speed, for example, is not increasing much any more, possibly nearing technological limits. Physical constraints like the heat dissipation or fine etching are the main reasons for that. Consequently, new opportunities like parallel processing using heterogeneous architectures became popular. In this context, the traditional processors get support from other computing units like graphical processors. In order to program these, the dataflow model offers several advantages. It is inherently parallel and thus well adapted. In this context, guaranteeing optimal performances is another main concern. For that, tracing and profiling central processing and graphical processing units are two useful techniques that can be considered. Several tools exist, like LTTng and FTrace that can trace the operating system and focus on the central processor. In addition, proprietary tools offered by hardware vendors can help to analyze and monitor the graphical processor. However, these tools are specific to one type of hardware and lack flexibility. Moreover, none of them target in particular dataflow applications executed on a heterogeneous platform

    Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones. La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union’s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    PERBANDINGAN KECEPATAN KOMPUTASI SUDUT FOUR BAR LINKAGE ANTARA CPU DAN GPU

    Get PDF
    Abstrak: Four Bar Linkage adalah salah satu mekanisme paling tua yang dipelajari manusia. Kelebihan dari Four Bar Linkage adalah bentuknya yang relatif sederhana sekaligus serba guna. Karena banyaknya kombinasi parameter dari Four Bar Linkage sangat penting untuk menghasilkan sebanyak mungkin posisi untuk mempelajari gerakan dari Four Bar Linkage. Telah dibuat sebuah studi yang membandingkan kecepatan Intel i7-4790 dan AMD Radeon R9-280 dalam melakukan komputasi sudut dari Four Bar Linkage. Penelitian menunjukkan AMD Radeon R9-280 5x lebih cepat dalam melakukan komputasi sudut R9-280 untuk jumlah titik yang relatif lebih kecil (ratusan ribu) dan 8x lebih cepat untuk jumlah titik yang lebih besar (puluhan juta).Kata Kunci: Four Bar Linkage, Komputasi, Mekanisme, OpenCLAbstract: Four Bar Linkage is one of the oldest mechanism ever studied by humankind. Among its advantages are its relative simplicity while remaining versatile. Due to its large number of parameter combinations it is quite important to compute as many points as possible to study the movement of Four Bar Linkage. This paper attempts to study Intel i7-4790 speed in finding Fourbar’s coupler angle as compared to AMD Radeon R9-280. This study found that R9-280 computes coupler angle 5 times faster as compared to Intel i7-4790 when performing relatively low number of points (in hundreds of thousands) while being 8 times faster when computing large number of points (tens of millions). Key Words: Four Bar Linkage, Computation, Mechanism, OpenC

    Parallelization of the training of neural networks in heterogeneous systems

    Get PDF
    RESUMEN: En los últimos años el uso de las redes neuronales ha incrementado drásticamente, debido a su capacidad de adaptación a infinidad de problemas. Sin embargo, para que una red neuronal funcione correctamente es necesario un proceso de entrenamiento previo, donde ésta aprende a identificar patrones en los datos que recibe. Este proceso es muy largo y computacionalmente intensivo, debido a las operaciones que se realizan y a que es necesaria una gran cantidad de datos de ejemplo. Para mitigar esto, han ido surgiendo a lo largo del tiempo varias soluciones que reducen el tiempo, la complejidad y el consumo energético del proceso de entrenamiento. Una de estas soluciones es el uso de aceleradores dedicados como GPUs, en vez de procesadores convencionales. Esto se debe a su velocidad a la hora de realizar ciertas operaciones y a su excelente eficiencia energética respecto a las CPUs tradicionales. Sin embargo, debido a la creciente complejidad de las redes neuronales y de los problemas a los que éstas se aplican, el uso de un solo acelerador para entrenarlas se ha vuelto insuficiente. Es por esto por lo que es necesario paralelizar el entrenamiento de redes neuronales, distribuyendo la carga de trabajo entre los dispositivos disponibles de manera que se optimice el rendimiento. Existen una gran variedad de técnicas de paralelización de redes neuronales y cada framework de ML proporciona sus propias estrategias de paralelización, con lo que es difícil saber cuál es la más beneficiosa para cada situación y cuáles son sus efectos. En este proyecto se estudia cuál es exactamente el impacto que tienen estas técnicas de paralelización en el proceso de entrenamiento de una red neuronal. Para ello, se han evaluado varios frameworks de redes neuronales y sus estrategias de paralelización, eligiendo el más adecuado para este proyecto. Además, se ha desarrollado un benchmark en uno de estos frameworks, Pytorch. Este benchmark entrena un modelo ResNet-34 usando un dataset de clasificación de imágenes, grabando varias métricas del proceso, como la duración del mismo y de cada una de sus fases, la evolución de la precisión del modelo a lo largo del tiempo o la precisión final obtenida. Para poder realizar un estudio en profundidad, se han diseñado y realizado experimentos en torno a este benchmark, ejecutándolo en varios ambientes de paralelización: usando solo la CPU, usando una GPU, usando varias GPUs de manera paralela, etc; y guardando sus salidas, además del consumo energético del benchmark. También se propone y estudia un modelo de paralelización híbrido, que explota tanto las GPUs como los CPUs disponibles para entrenar un modelo de red neuronal distribuyendo para ello una copia del modelo a cada dispositivo y parte de los datos de entrenamiento, determinando al final si es viable o no. Los resultados obtenidos de los experimentos han sido positivos, se ha obtenido una escalabilidad casi lineal para el tiempo de la parte paralelizada; además, el consumo energético no se ha visto incrementado significativamente como resultado de la paralelización, obteniendo casi el doble de eficiencia energética respecto del entrenamiento no paralelizado.ABSTRACT: In recent times, Artificial Neural Networks have seen a significant increase in their use, due to their flexibility and adaptability to a myriad of tasks. However, in order for a neural network to work correctly, a training period is necessary, where the model learns to identify certain patterns in the input data it receives. This process is long and computationally expensive, due to the complexity of its operations and the sheer amount of training data required. To mitigate this, some techniques have appeared over the years that attempt to reduce training time, complexity and energy consumption. One of the most commonly used approaches is the use of dedicated accelerators, such as GPUs, instead of conventional processors. This is because of their higher speed and energy efficiency regarding some tasks relative to general purpose processors. However, due to the rising complexity of neural networks and the problems the attempt to solve, the use of a single accelerator has become insufficient. This has made the use of several accelerators in parallel necessary, distributing the workload equally between them in order to optimize performance. Several parallelization techniques exist nowadays and almost every ML framework implements its own distribution strategies, so knowing which is best for each situation and its effects has become a very difficult task. In this project, a benchmark is proposed to study the impact of these techniques, recording for that several metrics, such as training time or model precision. For this, an evaluation of several neural network frameworks is conducted, studying their parallelization strategies, picking one of them to use for the rest of the project. From this framework, a benchmark is created that trains a ResNet-34 model on an image classification dataset, measuring certain variables such as end-to-end training time, evolution of the model’s precision over time or final model precision. To gain more insight into these metrics, various experiments have been designed and conducted around this benchmark, using each of them in a different execution environment: using only CPU, using 1 GPU, using multiple GPUs in parallel, etc.; documenting not only the output from the benchmark but its energy consumption as well, in order to evaluate energy efficiency. A hybrid parallelization model is also proposed, in which the available GPUs are used in conjunction with the CPU to train the network, giving each of the components a copy of the model and a subset of the data, and evaluating afterwards its effectiveness and viability. The results obtained from these experiments are very positive, the scalability of the distributed model is almost lineal regarding the parallelized part; on top of that, energy consumption has not seen a significant increase as a result of the parallelization, meaning the energy efficiency of this paradigm is almost double the non-distributed training.Grado en Ingeniería Informátic
    corecore