10 research outputs found

    On the Enhancement of Remote GPU Virtualization in High Performance Clusters

    Full text link
    Graphics Processing Units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share one or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this dissertation we enhance one framework offering remote GPU virtualization capabilities, referred to as rCUDA, for its use in high-performance clusters. While the initial prototype version of rCUDA demonstrated its functionality, it also revealed concerns with respect to usability, performance, and support for new GPU features, which prevented its used in production environments. These issues motivated this thesis, in which all the research is primarily conducted with the aim of turning rCUDA into a production-ready solution for eventually transferring it to industry. The new version of rCUDA resulting from this work presents a reduction of up to 35% in execution time of the applications analyzed with respect to the initial version. Compared to the use of local GPUs, the overhead of this new version of rCUDA is below 5% for the applications studied when using the latest high-performance computing networks available.Las unidades de procesamiento gr谩fico (Graphics Processing Units, GPUs) est谩n siendo utilizadas en muchas instalaciones de computaci贸n dada su extraordinaria capacidad de c谩lculo, la cual hace posible acelerar muchas aplicaciones de prop贸sito general de diferentes dominios. Sin embargo, las GPUs tambi茅n presentan algunas desventajas, como el aumento de los costos de adquisici贸n, as铆 como mayores requerimientos de espacio. Asimismo, tambi茅n requieren un suministro de energ铆a m谩s potente. Adem谩s, las GPUs consumen una cierta cantidad de energ铆a a煤n estando inactivas, y su utilizaci贸n suele ser baja para la mayor铆a de las cargas de trabajo. De manera similar a las m谩quinas virtuales, el uso de GPUs virtuales podr铆a hacer frente a los inconvenientes mencionados. En este sentido, el mecanismo de virtualizaci贸n remota de GPUs permite que una aplicaci贸n que se ejecuta en un nodo de un cl煤ster utilice de forma transparente las GPUs instaladas en otros nodos de dicho cl煤ster. Adem谩s, esta t茅cnica permite compartir las GPUs presentes en el cl煤ster entre las aplicaciones que se ejecutan en el mismo. De esta manera, varias aplicaciones que se ejecutan en diferentes nodos de cl煤ster (o los mismos) pueden compartir una o m谩s GPUs ubicadas en otros nodos del cl煤ster. Compartir GPUs aumenta la utilizaci贸n general de la GPU, reduciendo as铆 el impacto negativo de las desventajas anteriormente mencionadas. De igual forma, este mecanismo tambi茅n permite reducir la cantidad total de GPUs instaladas en el cl煤ster. En esta tesis mejoramos un entorno de trabajo llamado rCUDA, el cual ofrece funcionalidades de virtualizaci贸n remota de GPUs para su uso en clusters de altas prestaciones. Si bien la versi贸n inicial del prototipo de rCUDA demostr贸 su funcionalidad, tambi茅n revel贸 dificultades con respecto a la usabilidad, el rendimiento y el soporte para nuevas caracter铆sticas de las GPUs, lo cual imped铆a su uso en entornos de producci贸n. Estas consideraciones motivaron la presente tesis, en la que toda la investigaci贸n llevada a cabo tiene como objetivo principal convertir rCUDA en una soluci贸n lista para su uso entornos de producci贸n, con la finalidad de transferirla eventualmente a la industria. La nueva versi贸n de rCUDA resultante de este trabajo presenta una reducci贸n de hasta el 35% en el tiempo de ejecuci贸n de las aplicaciones analizadas con respecto a la versi贸n inicial. En comparaci贸n con el uso de GPUs locales, la sobrecarga de esta nueva versi贸n de rCUDA es inferior al 5% para las aplicaciones estudiadas cuando se utilizan las 煤ltimas redes de computaci贸n de altas prestaciones disponibles.Les unitats de processament gr脿fic (Graphics Processing Units, GPUs) estan sent utilitzades en moltes instal路lacions de computaci贸 donada la seva extraordin脿ria capacitat de c脿lcul, la qual fa possible accelerar moltes aplicacions de prop貌sit general de diferents dominis. No obstant aix貌, les GPUs tamb茅 presenten alguns desavantatges, com l'augment dels costos d'adquisici贸, aix铆 com major requeriment d'espai. Aix铆 mateix, tamb茅 requereixen un subministrament d'energia m茅s potent. A m茅s, les GPUs consumeixen una certa quantitat d'energia encara estant inactives, i la seua utilitzaci贸 sol ser baixa per a la majoria de les c脿rregues de treball. D'una manera semblant a les m脿quines virtuals, l'煤s de GPUs virtuals podria fer front als inconvenients esmentats. En aquest sentit, el mecanisme de virtualitzaci贸 remota de GPUs permet que una aplicaci贸 que s'executa en un node d'un cl煤ster utilitze de forma transparent les GPUs instal路lades en altres nodes d'aquest cl煤ster. A m茅s, aquesta t猫cnica permet compartir les GPUs presents al cl煤ster entre les aplicacions que s'executen en el mateix. D'aquesta manera, diverses aplicacions que s'executen en diferents nodes de cl煤ster (o els mateixos) poden compartir una o m茅s GPUs ubicades en altres nodes del cl煤ster. Compartir GPUs augmenta la utilitzaci贸 general de la GPU, reduint aix铆 l'impacte negatiu dels desavantatges anteriorment esmentades. A m茅s a m茅s, aquest mecanisme tamb茅 permet reduir la quantitat total de GPUs instal路lades al cl煤ster. En aquesta tesi millorem un entorn de treball anomenat rCUDA, el qual ofereix funcionalitats de virtualitzaci贸 remota de GPUs per al seu 煤s en cl煤sters d'altes prestacions. Si b茅 la versi贸 inicial del prototip de rCUDA va demostrar la seua funcionalitat, tamb茅 va revelar dificultats pel que fa a la usabilitat, el rendiment i el suport per a noves caracter铆stiques de les GPUs, la qual cosa impedia el seu 煤s en entorns de producci贸. Aquestes consideracions van motivar la present tesi, en qu猫 tota la investigaci贸 duta a terme t茅 com a objectiu principal convertir rCUDA en una soluci贸 preparada per al seu 煤s entorns de producci贸, amb la finalitat de transferir-la eventualment a la ind煤stria. La nova versi贸 de rCUDA resultant d'aquest treball presenta una reducci贸 de fins al 35% en el temps d'execuci贸 de les aplicacions analitzades respecte a la versi贸 inicial. En comparaci贸 amb l'煤s de GPUs locals, la sobrec脿rrega d'aquesta nova versi贸 de rCUDA 茅s inferior al 5% per a les aplicacions estudiades quan s'utilitzen les 煤ltimes xarxes de computaci贸 d'altes prestacions disponibles.Rea帽o Gonz谩lez, C. (2017). On the Enhancement of Remote GPU Virtualization in High Performance Clusters [Tesis doctoral]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/86219TESISPremios Extraordinarios de tesis doctorale

    Resource provisioning in Science Clouds: Requirements and challenges

    Full text link
    Cloud computing has permeated into the information technology industry in the last few years, and it is emerging nowadays in scientific environments. Science user communities are demanding a broad range of computing power to satisfy the needs of high-performance applications, such as local clusters, high-performance computing systems, and computing grids. Different workloads are needed from different computational models, and the cloud is already considered as a promising paradigm. The scheduling and allocation of resources is always a challenging matter in any form of computation and clouds are not an exception. Science applications have unique features that differentiate their workloads, hence, their requirements have to be taken into consideration to be fulfilled when building a Science Cloud. This paper will discuss what are the main scheduling and resource allocation challenges for any Infrastructure as a Service provider supporting scientific applications

    An application-level solution for the dynamic reconfiguration of MPI applications

    Get PDF
    International audienceCurrent parallel environments aggregate large numbers of computational resources with a high rate of change in their availability and load conditions. In order to obtain the best performance in this type of infrastructures, parallel applications must be able to adapt to these changing conditions. This paper presents an application-level proposal to automatically and transparently adapt MPI applications to the available resources. Experimental results show a good degree of adaptability and a good performance in di鈫礶rent availability scenarios

    PCIe Device Lending

    Get PDF
    We have developed a proof of concept for allowing a PCI Express device attached to one computer to be used by another computer without any software intermediate on the data path. The device driver runs on a physically separate machine from the device, but our implementation allows the device driver and device to communicate as if the device and driver were in the same machine, without modifying either the driver or the device. The kernel and higher level software can utilize the device as if it were a local device. A device will not be used by two separate machines at the same time, but a machine can transfer the control of a local device to a remote machine. We have named this concept "device lending". We envision that machines will have, in addition to local PCIe devices, access to a pool of remote PCIe devices. When a machine needs more device resources, additional devices can be dynamically borrowed from other machines with devices to spare. These devices can be located in a dedicated external cabinet, or be devices inserted into internal slots in a normal computer. The device lending is implemented using a Non-Transparent Bridge (NTB), a native PCIe interconnect that should offer performance close to that of a locally connected device. Devices that are not currently being lent to another host will not be affected in any way. NTBs are available as add-ons for any PCIe based computer and are included in newer Intel Xeon CPUs. The proof of concept we created was implemented for Linux, on top of the APIs provided by our NTB vendor, Dolphin. The host borrowing a device has a kernel module to provide the necessary software support and the other host has a user space daemon. No additional software modifications or hardware is required, nor special support from the devices. The current implementation works with some devices, but has some problems with others. We believe however, that we have identified the problems and how to improve the situation. In a later implementation, we believe that all devices we have tested can be made to work correctly and with very high performance

    Design and Evaluation of Low-Latency Communication Middleware on High Performance Computing Systems

    Get PDF
    [Resumen]El inter茅s en Java para computaci贸n paralela est谩 motivado por sus interesantes caracter铆sticas, tales como su soporte multithread, portabilidad, facilidad de aprendizaje,alta productividad y el aumento significativo en su rendimiento omputacional. No obstante, las aplicaciones paralelas en Java carecen generalmente de mecanismos de comunicaci贸n eficientes, los cuales utilizan a menudo protocolos basados en sockets incapaces de obtener el m谩ximo provecho de las redes de baja latencia, obstaculizando la adopci贸n de Java en computaci贸n de altas prestaciones (High Per- formance Computing, HPC). Esta Tesis Doctoral presenta el dise帽o, implementaci贸n y evaluaci贸n de soluciones de comunicaci贸n en Java que superan esta limitaci贸n. En consecuencia, se desarrollaron m煤ltiples dispositivos de comunicaci贸n a bajo nivel para paso de mensajes en Java (Message-Passing in Java, MPJ) que aprovechan al m谩ximo el hardware de red subyacente mediante operaciones de acceso directo a memoria remota que proporcionan comunicaciones de baja latencia. Tambi茅n se incluye una biblioteca de paso de mensajes en Java totalmente funcional, FastMPJ, en la cual se integraron los dispositivos de comunicaci贸n. La evaluaci贸n experimental ha mostrado que las primitivas de comunicaci贸n de FastMPJ son competitivas en comparaci贸n con bibliotecas nativas, aumentando significativamente la escalabilidad de aplicaciones MPJ. Por otro lado, esta Tesis analiza el potencial de la computaci贸n en la nube (cloud computing) para HPC, donde el modelo de distribuci贸n de infraestructura como servicio (Infrastructure as a Service, IaaS) emerge como una alternativa viable a los sistemas HPC tradicionales. La evaluaci贸n del rendimiento de recursos cloud espec铆ficos para HPC del proveedor l铆der, Amazon EC2, ha puesto de manifiesto el impacto significativo que la virtualizaci贸n impone en la red, impidiendo mover las aplicaciones intensivas en comunicaciones a la nube. La clave reside en un soporte de virtualizaci贸n apropiado, como el acceso directo al hardware de red, junto con las directrices para la optimizaci贸n del rendimiento sugeridas en esta Tesis.[Resumo]O interese en Java para computaci贸n paralela est谩 motivado polas s煤as interesantes caracter铆sticas, tales como o seu apoio multithread, portabilidade, facilidade de aprendizaxe, alta produtividade e o aumento signi cativo no seu rendemento computacional. No entanto, as aplicaci贸ns paralelas en Java carecen xeralmente de mecanismos de comunicaci贸n e cientes, os cales adoitan usar protocolos baseados en sockets que son incapaces de obter o m谩ximo proveito das redes de baixa latencia, obstaculizando a adopci贸n de Java na computaci贸n de altas prestaci贸ns (High Performance Computing, HPC). Esta Tese de Doutoramento presenta o dese帽o, implementaci 贸n e avaliaci贸n de soluci贸ns de comunicaci贸n en Java que superan esta limitaci贸n. En consecuencia, desenvolv茅ronse m煤ltiples dispositivos de comunicaci贸n a baixo nivel para paso de mensaxes en Java (Message-Passing in Java, MPJ) que aproveitan ao m谩aximo o hardware de rede subxacente mediante operaci贸ns de acceso directo a memoria remota que proporcionan comunicaci贸ns de baixa latencia. Tam茅n se incl煤e unha biblioteca de paso de mensaxes en Java totalmente funcional, FastMPJ, na cal foron integrados os dispositivos de comunicaci贸n. A avaliaci贸n experimental amosou que as primitivas de comunicaci贸n de FastMPJ son competitivas en comparaci贸n con bibliotecas nativas, aumentando signi cativamente a escalabilidade de aplicaci贸ns MPJ. Por outra banda, esta Tese analiza o potencial da computaci贸n na nube (cloud computing) para HPC, onde o modelo de distribuci贸n de infraestrutura como servizo (Infrastructure as a Service, IaaS) xorde como unha alternativa viable aos sistemas HPC tradicionais. A ampla avaliaci贸n do rendemento de recursos cloud espec铆fi cos para HPC do proveedor l铆der, Amazon EC2, puxo de manifesto o impacto signi ficativo que a virtualizaci贸n imp贸n na rede, impedindo mover as aplicaci贸ns intensivas en comunicaci贸ns 谩 nube. A clave at贸pase no soporte de virtualizaci贸n apropiado, como o acceso directo ao hardware de rede, xunto coas directrices para a optimizaci贸n do rendemento suxeridas nesta Tese.[Abstract]The use of Java for parallel computing is becoming more promising owing to its appealing features, particularly its multithreading support, portability, easy-tolearn properties, high programming productivity and the noticeable improvement in its computational performance. However, parallel Java applications generally su er from inefficient communication middleware, most of which use socket-based protocols that are unable to take full advantage of high-speed networks, hindering the adoption of Java in the High Performance Computing (HPC) area. This PhD Thesis presents the design, development and evaluation of scalable Java communication solutions that overcome these constraints. Hence, we have implemented several lowlevel message-passing devices that fully exploit the underlying network hardware while taking advantage of Remote Direct Memory Access (RDMA) operations to provide low-latency communications. Moreover, we have developed a productionquality Java message-passing middleware, FastMPJ, in which the devices have been integrated seamlessly, thus allowing the productive development of Message-Passing in Java (MPJ) applications. The performance evaluation has shown that FastMPJ communication primitives are competitive with native message-passing libraries, improving signi cantly the scalability of MPJ applications. Furthermore, this Thesis has analyzed the potential of cloud computing towards spreading the outreach of HPC, where Infrastructure as a Service (IaaS) o erings have emerged as a feasible alternative to traditional HPC systems. Several cloud resources from the leading IaaS provider, Amazon EC2, which speci cally target HPC workloads, have been thoroughly assessed. The experimental results have shown the signi cant impact that virtualized environments still have on network performance, which hampers porting communication-intensive codes to the cloud. The key is the availability of the proper virtualization support, such as the direct access to the network hardware, along with the guidelines for performance optimization suggested in this Thesis

    Towards Modern, Accessible and Dynamic HPC Using Container-based Virtual Clusters

    Get PDF
    In this thesis, a novel Virtual Container Cluster (VCC) framework is presented. Despite the growing popularity of container virtualisation in order to increase the flexi-bility of the software stack, run time environment virtualisation still poses significant portability challenges; by depending on the underlying cluster execution paradigm,a niche class of HPC only containers has emerged. This trend is detrimental to reusability, reproducibility, and encouraging new communities to HPC. Traditional virtualisation techniques have a rich history within HPC, and have been demonstrated to offer much more than software flexibility. A Virtual Machine by nature requires an OS and full stack environment akin to a physical machine, and this allows it to be instantiated regardless of the underlying machine and what services it provides. This capability is essential in order to implement job forwarding and spanning - where the burden of an entire job can be transferred or shared between hetero-geneous cluster systems - with a high level of confidence that the environments will be compatible. In turn, this brings improvements to global resource performance, reducing the job turnaround time and increasing cluster utilization. The VCC is an innovative solution that combines the full stack and container virtualisation approaches. Therefore, it offers both the flexibility of containers with the improved portability, performance and scalability of the full stack approach. In order to maintain the same accessibility and lower barrier of entry as the run time environment approach, the design incorporates an autonomous configuration and contextualisation mechanism, along with a Software Defined Networking technology, to ensure the full stack container does not place an additional burden on the user. The usefulness and performance is validated through benchmarking and two case studies: virtual clusters in the classroom and inter-institutional spanning

    Enabling Fairness in Cloud Computing Infrastructures

    Full text link
    Cloud computing has emerged as a key technology in many ways over the past few years, evidenced by the fact that 93% of the organizations is either running applications or experimenting with Infrastructure-as-a-Service (IaaS) cloud. Hence, to meet the demands of a large set of target audience, IaaS cloud service providers consolidate applications belonging to multiple tenants. However, consolidation of applications leads to performance interference with each other as these applications end up competing for the shared resources violating QoS of the executing tenants. This dissertation investigates the implications of interference in consolidated cloud computing environments to enable fairness in the execution of applications across tenants. In this context, this dissertation identifies three key issues in cloud computing infrastructures. We observe that tenants using IaaS public clouds share multi-core datacenter servers. In such a situation, we identify that the applications belonging to tenants might compete for shared architectural resources like Last Level Cache (LLC) and bandwidth to memory, slowing down the execution time of applications. This necessitates a need for a technique that can accurately estimate the slowdown in execution time caused due to multi-tenant execution. Such slowdown estimates can be used to bill tenants appropriately enabling fairness among tenants. For private datacenters, where performance degradation cannot be tolerated, it becomes critical to detect interference and investigate its root cause. Under such circumstances, there is a need for a real-time, lightweight and scalable mechanism that can detect performance degradation and identify the root cause resource which applications are contending for (I/O, network, CPU, Shared Cache). Finally, the advent of microservice computing environments, calls for a need to rethink resource management strategies in multi-tenant execution scenarios. Specifically, we observe that the visibility enabled by microservices execution framework can be exploited to achieve high throughput and resource utilization while still meeting Service Level Agreements (SLAs) in multi-tenant execution scenarios. To enable this, we propose techniques that can dynamically batch and reorder requests propagating through individual microservice stages within an application.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/149844/1/ramsri_1.pd
    corecore