9 research outputs found

    Modelling Performance Loss due to Thread Imbalance in Stochastic Variable-Length SIMT Workloads

    Get PDF
    When designing algorithms for single-instruction multiple-thread (SIMT) devices such as general purpose graphics processing units (GPGPUs), thread imbalance is an important performance consideration. Thread imbalance can emerge in iterative applications where workloads are of variable length, because threads processing larger amounts of work will cause threads with less work to idle. This form of thread imbalance influences the design space of algorithms-particularly in terms of processing granularity-but we lack models to quantify its impact on application performance. In this paper, we present a statistical model for quantifying the performance loss due to thread imbalance for iterative SIMT applications with stochastic, variable-length workloads. Our model is designed to operate with minimal knowledge of the implementation details of the algorithm, relying solely on an understanding of the probability distribution of the lengths of the workloads. We validate our model against a synthetic benchmark based on a Monte Carlo simulation of matrix exponentiation, and show that our model achieves nearly perfect accuracy. Compared to empirical data extracted from real hardware, our model maintains a high degree of accuracy, predicting mean performance loss within a margin of 2%.</p

    Otimização do framework PSkel para o processador manycore MPPA-256

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.A new class of highly parallel low-power chips that deal with a energy restriction was developed. The Sunway SW26010 and Kalray processors are some exemples of them, giving more than two hundred processing cores in a single low-power chip. Despite presenting a better energy efficiency than the general purpose multi core processors, the architects features such as the limited amount of distributed memory on the chip makes the development of efficient scientific applications a challenging task. In this term paper were proposed optimizations to the framework PSkel MPPA, which provides an unique, high-level abstraction for stencil programming in the MPPA-256 processor, exempting programmers from being responsible for the task of explicitly handling with communication and with the parallel hybrid programming model of the MPPA-256.Uma nova classe de chips altamente paralelos de baixo consumo energético que lidam com a restrição de energia foi desenvolvida. Os processadores Sunway SW26010 e Kalray MPPA-256 são exemplos deles, entregando mais de duzentos núcleos de processamento em um único chip. Apesar de apresentarem melhor eficiência energética do que os processadores multicore de propósito geral, características arquiteturais como a limitada quantidade de memória distribuída no chip torna o desenvolvimento de aplicações científicas paralelas eficientes uma tarefa desafiadora. Neste projeto foram propostas otimizações ao framework PSkelMPPA, que provê uma abstração única e de alto nível para programação estêncil no processador MPPA-256, eximindo os programadores de serem responsáveis pela tarefa de explicitamente lidar com a comunicação e com o modelo de programação paralela híbrida do MPPA-256

    On the Energy Efficiency and Performance of Irregular Application Executions on Multicore, NUMA and Manycore Platforms

    No full text
    International audienceUntil the last decade, performance of HPC architectures has been almost exclusively quantifiedby their processing power. However, energy efficiency is being recently considered as importantas raw performance and has become a critical aspect to the development of scalablesystems. These strict energy constraints guided the development of a new class of so-calledlight-weight manycore processors. This study evaluates the computing and energy performanceof two well-known irregular NP-hard problems — the Traveling-Salesman Problem (TSP) andK-Means clustering—and a numerical seismic wave propagation simulation kernel—Ondes3D—on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task ofadapting these applications to a manycore, specifically the novel MPPA-256 manycore processor.Then, we analyze their performance and energy consumption on those di↵erent machines.Our results show that applications able to fully use the resources of a manycore can have betterperformance and may consume from 3.8x to 13x less energy when compared to low-power andgeneral-purpose multicore processors, respectivel

    Corporate influence and the academic computer science discipline. [4: CMU]

    Get PDF
    Prosopographical work on the four major centers for computer research in the United States has now been conducted, resulting in big questions about the independence of, so called, computer science

    Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

    Get PDF

    Multi-tasking scheduling for heterogeneous systems

    Get PDF
    Heterogeneous platforms play an increasingly important role in modern computer systems. They combine high performance with low power consumption. From mobiles to supercomputers, we see an increasing number of computer systems that are heterogeneous. The most well-known heterogeneous system, CPU+GPU platforms have been widely used in recent years. As they become more mainstream, serving multiple tasks from multiple users is an emerging challenge. A good scheduler can greatly improve performance. However, indiscriminately allocating tasks based on availability leads to poor performance. As modern GPUs have a large number of hardware resources, most tasks cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising solution, however, indiscriminately running tasks in parallel causes a slowdown. This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed to determine where to schedule OpenCL kernels. It predicts the best-fit device by using a machine learning-based classifier, then schedules the kernels accordingly to either CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed. Kernels are merged if their predicted co-execution can provide better performance than sequential execution. A machine learning based classifier is developed to find the best kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to schedule kernels separately on either CPU or GPU, and run kernels in pairs if their co-execution can improve performance. The approaches developed in this thesis significantly improve system performance and outperform all existing techniques

    Automatic performance optimisation of parallel programs for GPUs via rewrite rules

    Get PDF
    Graphics Processing Units (GPUs) are now commonplace in computing systems and are the most successful parallel accelerators. Their performance is orders of magnitude higher than traditional Central Processing Units (CPUs) making them attractive for many application domains with high computational demands. However, achieving their full performance potential is extremely hard, even for experienced programmers, as it requires specialised software tailored for specific devices written in low-level languages such as OpenCL. Differences in device characteristics between manufacturers and even hardware generations often lead to large performance variations when different optimisations are applied. This inevitably leads to code that is not performance portable across different hardware. This thesis demonstrates that achieving performance portability is possible using LIFT, a functional data-parallel language which allows programs to be expressed at a high-level in a hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation space using a set of well-defined rewrite rules to transform programs seamlessly between different high-level algorithmic forms before translating them to a low-level OpenCL-specific form. The first contribution of this thesis is the development of techniques to compile functional LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL code. Producing efficient code is non-trivial as many performance sensitive details such as memory allocation, array accesses or synchronisation are not explicitly represented in the functional LIFT language. The thesis shows that the newly developed techniques are essential for achieving performance on par with manually optimised code for GPU programs with the exact same complex optimisations applied. The second contribution of this thesis is the presentation of techniques that enable the LIFT compiler to perform complex optimisations that usually require from tens to hundreds of individual rule applications by grouping them as macro-rules that cut through the optimisation space. Using matrix multiplication as an example, starting from a single high-level program the compiler automatically generates highly optimised and specialised implementations for desktop and mobile GPUs with very different architectures achieving performance portability. The final contribution of this thesis is the demonstration of how low-level and GPU-specific features are extracted directly from the high-level functional LIFT program, enabling building a statistical performance model that makes accurate predictions about the performance of differently optimised program variants. This performance model is then used to drastically speed up the time taken by the optimisation space exploration by ranking the different variants based on their predicted performance. Overall, this thesis demonstrates that performance portability is achievable using LIFT

    El capitalismo y las religiones de China: revisión de los postulados de Max Weber en la China del nuevo siglo

    Get PDF
    The purpose of this doctoral research is to determine to what extent we can apply Max Weber's interpretations, analysis, and conclusions about Imperial China to the country's economic rise in the last decades. In order to find an answer to this problem, thesis director Dr. Josetxo Beriain Razquin and I designed a research methodology aimed to test the applicability of Weber's theories and explanations on some of the most crucial aspects of Chinese economic and social development, such as education, entrepreneurship, and the relative strategies of the state and the local government. The study of the educational aspect started in September 2011, and resulted in 20 interviewed and 100 surveyed students, all of them from Wuhan University, one of the top-10 universities in China. For the question of governmental strategies, I researched mainly through observant participation in the “2012 Gansu International Fellowship Program”, a 2-month experience of international cooperation in which I took part as a delegate of the Spanish province of Navarre. And regarding entrepreneurship, I completed more than 10 collaborations with Spanish and Chinese companies operating in Chinese and inside sectors such as renewable energies, real estate, tourism, food and high technology. The interpretation and analysis phases of this doctoral thesis have been completed in a year and a half period studyig at Jilin University, under the guidance of Professor Tian Yipeng, who helped me achieve a better understanding of the deep changes faced by Chinese society since the decline of Work Unit-oriented communities, and the rise of the State-owned enterprises. Regarding the conclusions of the research, they can be summarized in three main questions: 1) The emphasis of Max Weber on the tension between the intra-mundane and ultra-mundane spheres has led to much confusion among researchers, especially those who use this factor to elaborate anti-materialist explanations of the so called “Chinese miracle” and, thus, ignore “external” factors of pressure such as governmental policies, inter-familiar competition for a higher economic and social status, and a “technical” puritanism that coerces Chinese people to focus on training and work. 2) According to the evidence collected, Weber's plural concept of “the religions of China”, appears far more accurate than the concept of “Confucian capitalism”, since it offers the possibility of including some of the most popular business-related cults and ethical perspectives, the majority of which can't be considered as an exclusive heritage of Confucianism. 3) The results of this research strongly suggest the presence of the “adaptive” rationality that Weber considered as opposed to the one dominant in the West, although it is possible that a good part of the nation's successful efforts towards the so called “Chinese dream” have been driven by this mentality.Programa Oficial de Doctorado en Dinámicas de Cambio en las Sociedades Modernas Avanzadas (RD 1393/2007)Gizarte Moderno Aurreratuen Aldaketa Dinamiketako Doktoretza Programa Ofiziala (ED 1393/2007
    corecore