9 research outputs found

    OpenCL Actors - Adding Data Parallelism to Actor-based Programming with CAF

    Full text link
    The actor model of computation has been designed for a seamless support of concurrency and distribution. However, it remains unspecific about data parallel program flows, while available processing power of modern many core hardware such as graphics processing units (GPUs) or coprocessors increases the relevance of data parallelism for general-purpose computation. In this work, we introduce OpenCL-enabled actors to the C++ Actor Framework (CAF). This offers a high level interface for accessing any OpenCL device without leaving the actor paradigm. The new type of actor is integrated into the runtime environment of CAF and gives rise to transparent message passing in distributed systems on heterogeneous hardware. Following the actor logic in CAF, OpenCL kernels can be composed while encapsulated in C++ actors, hence operate in a multi-stage fashion on data resident at the GPU. Developers are thus enabled to build complex data parallel programs from primitives without leaving the actor paradigm, nor sacrificing performance. Our evaluations on commodity GPUs, an Nvidia TESLA, and an Intel PHI reveal the expected linear scaling behavior when offloading larger workloads. For sub-second duties, the efficiency of offloading was found to largely differ between devices. Moreover, our findings indicate a negligible overhead over programming with the native OpenCL API.Comment: 28 page

    CUDA implementation of integration rules within an hp-finite element code

    Get PDF
    With the introduction in 2006 of CUDA architecture for Nvidia GPUs a new programming model borned. Large number of articles indicates that this new programming model in a new architecture achieves better performance than previous implementations in traditional languages for CPUs. In this work the author tries to show the capabilities of GPU computing. To perform such a task a hp Finite Element integration method is implemented both in CUDA and in C language. After implementation, parallel executions in CPU and GPU will be compared to demonstrate if it is worth to create new algorimths under this architecture

    Exploiting Heterogeneous Parallelism With the Heterogeneous Programming Library

    Get PDF
    [Abstract] While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper, we present the Heterogeneous Programming Library (HPL), which permits the development of heterogeneous applications addressing both portability and programmability while not sacrificing high performance. This is achieved by means of an embedded language and data types provided by the library with which generic computations to be run in heterogeneous devices can be expressed. A comparison in terms of programmability and performance with OpenCL shows that both approaches offer very similar performance, while outlining the programmability advantages of HPL.This work was funded by the Xunta de Galicia under the project “Consolidación e Estructuración de Unidades de Investigación Competitivas” 2010/06 and the MICINN, cofunded by FEDER funds, under grant TIN2010-16735. Zeki Bozkus is funded by the Scientific and Technological Research Council of Turkey (TUBITAK; 112E191)Scientific and Technological Research Council of Turkey (TUBITAK); 112E19

    Graphics processing unit utilization in circuit simulation

    Get PDF
    Nykypäivän grafiikkaprosessorit (GPU) koostuvat sadoista monisäikeisistä, moniytimisistä prosessoreista ja monimutkaisesta korkean kaistanleveyden muistiarkkitehtuurista. Tämän vuoksi niistä on tullut hyvä vaihtoehto nopeuttamaan rinnakkaistettua yleislaskentaa, jossa suuria datamääriä käsitellään samoilla funktioilla. Myös piirisimuloinnin alalla on esitelty menestyksellisiä GPU-laskennan sovellutuksia. Tämän opinnäytteen tavoitteena on tutkia GPU-laskennan mahdollisuuksia APLAC-piirisimulointiohjelmassa. Työssä esitellään myös diodimallin laskennan toteutus GPU:lla. Epälineaarinen diodimalli toteutettiin NVIDIAn CUDA-arkkitehtuurilla, joka on niin sanottu SIMT-arkkitehtuuri (single-instruction, multiple-thread) eli yksi käsky suoritetaan kerrallaan usealle säikeelle. CUDA-laite ohjelmoitiin CUDA C -ohjelmointirajapinnalla, joka on standardin C-kielen laajennus. Testitulokset paljastivat että diodin yksinkertaisesta epälineaarisuudesta johtuen sen laskenta on liian kevyt, jotta GPU:n tehokkuudesta olisi mitään nopeusetua. Vaadittavat muutokset piirianalyysin rakenteeseen sekä datan hallintaan johtivat marginaalisesti alkuperäistä pidempään kokonaissimulointiaikaan. Kun diodimallia monimutkaistetaan moninkertaistamalla sen laskenta, CUDA-toteutus on nopeampi kuin alkuperäinen malli. Tämä antaa karkean arvion siitä kuinka monimutkainen malli hyötyy GPU-laskennasta. Vaikka diodimalli ei ollutkaan nopeampi GPU:lla, tämä toteutus on hyvä perusta tuleville CUDA-sovelluksille APLACissa. Näistä seuraavana on huomattavasti monimutkaisempi BSIM3-transistorimallin laskenta, joka mitä todennäköisimmin hyötyy GPU:n laskentatehosta.Graphics processing units (GPU) of today include hundreds of multi-threaded, multicore processors and a complex, high-bandwidth memory architecture, making them a good alternative to speed up general-purpose parallel computation where large data quantities are processed with same functions. Some successful applications of GPU computation have also been introduced in the field of circuit simulation. The objective of this thesis is to examine the GPU's computing potential in the APLAC circuit simulation software. The realization of a diode model on a GPU device is also presented. The nonlinear diode model was implemented on NVIDIA's Compute Unified Device Architecture (CUDA), that is a single-instruction, multiple-thread (SIMT) architecture. A CUDA device was programmed using the CUDA C application programming interface, which is an extension of the standard C language. The test results revealed that due to the diode's simple nonlinearity, its evaluation is computationally too light to gain any speed benefit from the GPU's computation power. The required modifications to the circuit analysis structure and data handling resulted in a marginally longer total simulation time than initially. However, when the diode model is made more complex by multiplying its evaluation, the CUDA implementation is faster than the original model. This gives a rough estimate of how complex a model benefits from the GPU computation. Although, the diode model evaluation was not faster on the GPU, this implementation is a good foundation for future CUDA applications in APLAC. The next of these applications will be the computationally more complex BSIM3 transistor model, which will most likely benefit from the computing power of GPU devices

    Visualization challenges in distributed heterogeneous computing environments

    Get PDF
    Large-scale computing environments are important for many aspects of modern life. They drive scientific research in biology and physics, facilitate industrial rapid prototyping, and provide information relevant to everyday life such as weather forecasts. Their computational power grows steadily to provide faster response times and to satisfy the demand for higher complexity in simulation models as well as more details and higher resolutions in visualizations. For some years now, the prevailing trend for these large systems is the utilization of additional processors, like graphics processing units. These heterogeneous systems, that employ more than one kind of processor, are becoming increasingly widespread since they provide many benefits, like higher performance or increased energy efficiency. At the same time, they are more challenging and complex to use because the various processing units differ in their architecture and programming model. This heterogeneity is often addressed by abstraction but existing approaches often entail restrictions or are not universally applicable. As these systems also grow in size and complexity, they become more prone to errors and failures. Therefore, developers and users become more interested in resilience besides traditional aspects, like performance and usability. While fault tolerance is well researched in general, it is mostly dismissed in distributed visualization or not adapted to its special requirements. Finally, analysis and tuning of these systems and their software is required to assess their status and to improve their performance. The available tools and methods to capture and evaluate the necessary information are often isolated from the context or not designed for interactive use cases. These problems are amplified in heterogeneous computing environments, since more data is available and required for the analysis. Additionally, real-time feedback is required in distributed visualization to correlate user interactions to performance characteristics and to decide on the validity and correctness of the data and its visualization. This thesis presents contributions to all of these aspects. Two approaches to abstraction are explored for general purpose computing on graphics processing units and visualization in heterogeneous computing environments. The first approach hides details of different processing units and allows using them in a unified manner. The second approach employs per-pixel linked lists as a generic framework for compositing and simplifying order-independent transparency for distributed visualization. Traditional methods for fault tolerance in high performance computing systems are discussed in the context of distributed visualization. On this basis, strategies for fault-tolerant distributed visualization are derived and organized in a taxonomy. Example implementations of these strategies, their trade-offs, and resulting implications are discussed. For analysis, local graph exploration and tuning of volume visualization are evaluated. Challenges in dense graphs like visual clutter, ambiguity, and inclusion of additional attributes are tackled in node-link diagrams using a lens metaphor as well as supplementary views. An exploratory approach for performance analysis and tuning of parallel volume visualization on a large, high-resolution display is evaluated. This thesis takes a broader look at the issues of distributed visualization on large displays and heterogeneous computing environments for the first time. While the presented approaches all solve individual challenges and are successfully employed in this context, their joint utility form a solid basis for future research in this young field. In its entirety, this thesis presents building blocks for robust distributed visualization on current and future heterogeneous visualization environments.Große Rechenumgebungen sind für viele Aspekte des modernen Lebens wichtig. Sie treiben wissenschaftliche Forschung in Biologie und Physik, ermöglichen die rasche Entwicklung von Prototypen in der Industrie und stellen wichtige Informationen für das tägliche Leben, beispielsweise Wettervorhersagen, bereit. Ihre Rechenleistung steigt stetig, um Resultate schneller zu berechnen und dem Wunsch nach komplexeren Simulationsmodellen sowie höheren Auflösungen in der Visualisierung nachzukommen. Seit einigen Jahren ist die Nutzung von zusätzlichen Prozessoren, z.B. Grafikprozessoren, der vorherrschende Trend für diese Systeme. Diese heterogenen Systeme, welche mehr als eine Art von Prozessor verwenden, finden zunehmend mehr Verbreitung, da sie viele Vorzüge, wie höhere Leistung oder erhöhte Energieeffizienz, bieten. Gleichzeitig sind diese jedoch aufwendiger und komplexer in der Nutzung, da die verschiedenen Prozessoren sich in Architektur und Programmiermodel unterscheiden. Diese Heterogenität wird oft durch Abstraktion angegangen, aber bisherige Ansätze sind häufig nicht universal anwendbar oder bringen Einschränkungen mit sich. Diese Systeme werden zusätzlich anfälliger für Fehler und Ausfälle, da ihre Größe und Komplexität zunimmt. Entwickler sind daher neben traditionellen Aspekten, wie Leistung und Bedienbarkeit, zunehmend an Widerstandfähigkeit gegenüber Fehlern und Ausfällen interessiert. Obwohl Fehlertoleranz im Allgemeinen gut untersucht ist, wird diese in der verteilten Visualisierung oft ignoriert oder nicht auf die speziellen Umstände dieses Feldes angepasst. Analyse und Optimierung dieser Systeme und ihrer Software ist notwendig, um deren Zustand einzuschätzen und ihre Leistung zu verbessern. Die verfügbaren Werkzeuge und Methoden, um die erforderlichen Informationen zu sammeln und auszuwerten, sind oft vom Kontext entkoppelt oder nicht für interaktive Szenarien ausgelegt. Diese Probleme sind in heterogenen Rechenumgebungen verstärkt, da dort mehr Daten für die Analyse verfügbar und notwendig sind. Für verteilte Visualisierung ist zusätzlich Rückmeldung in Echtzeit notwendig, um Interaktionen der Benutzer mit Leistungscharakteristika zu korrelieren und um die Gültigkeit und Korrektheit der Daten und ihrer Visualisierung zu entscheiden. Diese Dissertation präsentiert Beiträge für all diese Aspekte. Zunächst werden zwei Ansätze zur Abstraktion im Kontext von generischen Berechnungen auf Grafikprozessoren und Visualisierung in heterogenen Umgebungen untersucht. Der erste Ansatz verbirgt Details verschiedener Prozessoren und ermöglicht deren Nutzung über einheitliche Schnittstellen. Der zweite Ansatz verwendet pro-Pixel verkettete Listen (per-pixel linked lists) zur Kombination von Pixelfarben und zur Vereinfachung von ordnungsunabhängiger Transparenz in verteilter Visualisierung. Übliche Fehlertoleranz-Methoden im Hochleistungsrechnen werden im Kontext der verteilten Visualisierung diskutiert. Auf dieser Grundlage werden Strategien für fehlertolerante verteilte Visualisierung abgeleitet und in einer Taxonomie organisiert. Beispielhafte Umsetzungen dieser Strategien, ihre Kompromisse und Zugeständnisse, und die daraus resultierenden Implikationen werden diskutiert. Zur Analyse werden lokale Exploration von Graphen und die Optimierung von Volumenvisualisierung untersucht. Herausforderungen in dichten Graphen wie visuelle Überladung, Ambiguität und Einbindung zusätzlicher Attribute werden in Knoten-Kanten Diagrammen mit einer Linsenmetapher sowie ergänzenden Ansichten der Daten angegangen. Ein explorativer Ansatz zur Leistungsanalyse und Optimierung paralleler Volumenvisualisierung auf einer großen, hochaufgelösten Anzeige wird untersucht. Diese Dissertation betrachtet zum ersten Mal Fragen der verteilten Visualisierung auf großen Anzeigen und heterogenen Rechenumgebungen in einem größeren Kontext. Während jeder vorgestellte Ansatz individuelle Herausforderungen löst und erfolgreich in diesem Zusammenhang eingesetzt wurde, bilden alle gemeinsam eine solide Basis für künftige Forschung in diesem jungen Feld. In ihrer Gesamtheit präsentiert diese Dissertation Bausteine für robuste verteilte Visualisierung auf aktuellen und künftigen heterogenen Visualisierungsumgebungen

    Improving the programmability of heterogeneous systems by means of libraries

    Get PDF
    Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Resumo] O emprego de dispositivos heteroxéneos coma co-procesadores en entornos de computación de altas prestacións (HPC) medrou ininterrompidamente nos últimos anos debido ás súas excelentes propiedades en termos de rendemento e consumo de enerx:ía. A ma.ior dispoñibilidade de sistemas HPC híbridos conlevou de forma natural a necesidade de desenrolar ferra.mentas de programación adecuadas para eles, sendo CUDA e OpenCL as máis a.mplamente empregadas na actualidade. Desafortunadamente, estas ferramentas son relativamente de baixo nivel, o cal emparellado co ma.ior número de detalles que deben de ser controlados cando se programan aceleradoras, fa.i da programación destes sistemas mediante elas, moito roáis complexa que a. programación tradicional de CPUs. Isto levou á. proposta de alternativas de roáis alto nivel para facilitar a programación de dispositivos heteroxéneos. Esta tesis contribúe neste campo presentando dúas libreríe.<i que mellora.n ampla.mente a programabilidade de sistemas heteroxéneos en C++, permitindo aos usuarios centrarse no que hai que facer en vez de nas tarefas de baixo nivel. As nosas propostas, a librería. Heterogeneous Progromming Libmry (HPL) e a. librería Heterogene.ous Hiemrchically Tiled Arrays (H2TA), están deseñadas para nodos con unha ou má.is aceleradoras, e para clusters heteroxéneos, respectivamente. Ambas librerías, demostraron ser capaces de incrementar a. productividade dos usuarios mellora.ndo a programabilidade dos sem; códigos, e ó mesmo tempo, lograr un rendemento semella.nte ó de solucións de roáis baixo nivel.[Abstract] The usage of heterogeneous devices as co-processors in high performance computing (HPC) environments has steadily grown during the last years due to their excellent properties in terms of perfonnance and energy consumption. The larger a.vailability of hybrid HPC systems naturally led to the need to develop suitable programming tools for them, being the most widely tL'ied nowadays CUDA and OpenCL. Unfortlmatciy, these tools are relativcly low leve), which coupled with the large DUlllber of deta.ils that must be monaged when programming accelerators, makes the programm.ing of these systems using them much more complex thon that of trad.itional CPUs. This has led to the proposal of higher leve) alternatives that facilitate the progranuning of heterogeneous devices. This thesis contri bu tes to this field presenting two libraries that largely improve the programma.bility of heterogeneous systeins in C++, helping users to focus on what todo rather thtlJl onlow leve) tasks. These two libraries, the Heterogeneous Programming Library (HPL) and the Heterogeneous Hierarch.ically Tiled Arrays (H2TA), are well suited to nodes with one or more accelerators, a.nd to heterogeneous clusters, respectively. Both libraries have proveo to be able to incresse the productivity of the users improving the progro. mmability of their codes, and at the s8llle time, achieving performance similar to that of lower leve) solutions.[Resumen] El empleo de dispositivos heterogéneos como co-procesadores en entornos de computación de altas prestaciones (HPC) ha. crecido ininterrumpidamente durante los últimos años debido a. sus excelentes propiedades en términos de rendimiento y consumo de energía. La mayor disponibilidad de sistemas HPC híbridos conllevó de forma natural la necesidad de desarrollar herramientas de programación adecuadas para. ellos, siendo CUDA y OpenCL las más ampliamente utilizadas en la actualidad. Desafortunadamente, estas herramientas son relativamente de bajo nivel, lo cual emparejado con el mayor número de detalles que han de ser controlados cuando se programan aceleradoras, hacen de la programación de estos sistemas mediante ell8S mucho más compleja que la programación tradicional de CPUs. Esto ha llevado a la propuesta de alternativ8S de más alto nivel para facilitar la programación de dispositivos heterogéneos. Esta tesis contribuye a este campo presentando dos librerías que mejoran ampliamente la programabilidad de sistemas heterogéneos en C++, permitiendo a los usuarios centrarse en lo que hay que hacer en vez de en las tareas de bajo nivel. Nuestras propuestas, la librería Heterogeneous Progromming Librory (HPL) y la librería Heterogeneous Hierorchíoolly Tíled Arrays (H2TA), están diseñadas para nodos con una o más aceleradoras, y para clusters heterogéneos, respectivamente. Ambas librerías, han demostrado ser capaces de incrementar la productividad de los usuarios mejorando la programabilidad de sus códigos, y al mismo tiempo, lograr un rendimiento similar al de soluciones de más bajo nivel

    Application acceleration : an investigation of automatic porting methods for application accelerators

    Get PDF
    Future HPC systems will contain both large collections of multi-core proces sors and specialist many-core co-processors. These specialised many-core co processors are typically classified as Application Accelerators. More specifically, Application Accelerations are devices such as GPUs, CELL Processors, FPGAs and custom application specific integrated circuit devices(ASICs). These devices present new challenges to overcome, including their programming difficulties, their diversity and lack of commonality of programming approach between them and the issue of selecting the most appropriate device for an application. This thesis attempts to tackle these problems by examining the suitability of automatic porting methods. In the course of this research, relevant software, both academic and com mercial, has been analysed to determine how it attempts to solve the problems relating to the use of application acceleration devices. A new approach is then constructed, this approach is an Automatic Self-Modifying Application Porting system that is able to not only port code to an acceleration device, but, using performance data, predict the appropriate device for the code being ported. Additionally, this system is also able to use the performance data that are gathered by the system to modify its own decision making model and improve its future predictions. Once the system has been developed, a series of applications are trialled and their performance, both in terms of execution time and the accuracy of the systems predictions, are analysed. This analysis has shown that, although the system is not able to flawlessly predict the correct device for an unseen application, it is able to achieve an accuracy of over 80% and, just as importantly, the code it produces is within 15% of that produced by an experienced human programmer. This analysis has also shown that while automatically ported code performs favourably in nearly all cases when compared to a single-core CPU, automatically ported code only out performs a quad-core CPU in three out of seven application case studies. From these results, it is also shown that the system is able to utilise this performance data and build a decision model allowing the users to determine if an automatically ported version of their application will provide performance improvement compared to both CPU types considered. The availability of such a system may prove valuable in allowing a diverse range of users to utilise the performance supplied by many-core devices within next generation HPC systems
    corecore