9 research outputs found
Modelling Performance Loss due to Thread Imbalance in Stochastic Variable-Length SIMT Workloads
When designing algorithms for single-instruction multiple-thread (SIMT) devices such as general purpose graphics processing units (GPGPUs), thread imbalance is an important performance consideration. Thread imbalance can emerge in iterative applications where workloads are of variable length, because threads processing larger amounts of work will cause threads with less work to idle. This form of thread imbalance influences the design space of algorithms-particularly in terms of processing granularity-but we lack models to quantify its impact on application performance. In this paper, we present a statistical model for quantifying the performance loss due to thread imbalance for iterative SIMT applications with stochastic, variable-length workloads. Our model is designed to operate with minimal knowledge of the implementation details of the algorithm, relying solely on an understanding of the probability distribution of the lengths of the workloads. We validate our model against a synthetic benchmark based on a Monte Carlo simulation of matrix exponentiation, and show that our model achieves nearly perfect accuracy. Compared to empirical data extracted from real hardware, our model maintains a high degree of accuracy, predicting mean performance loss within a margin of 2%.</p
Otimização do framework PSkel para o processador manycore MPPA-256
TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.A new class of highly parallel low-power chips that deal with a energy restriction was developed. The Sunway SW26010 and Kalray processors are some exemples of them, giving more than two hundred processing cores in a single low-power chip. Despite presenting a better energy efficiency than the general purpose multi core processors, the architects features such as the limited amount of distributed memory on the chip makes the development of efficient scientific applications a challenging task. In this term paper were proposed optimizations to the framework PSkel MPPA, which provides an unique, high-level abstraction for stencil programming in the MPPA-256 processor, exempting programmers from being responsible for the task of explicitly handling with communication and with the parallel hybrid programming model of the MPPA-256.Uma nova classe de chips altamente paralelos de baixo consumo energético que lidam com a restrição de energia foi desenvolvida. Os processadores Sunway SW26010 e Kalray MPPA-256 são exemplos deles, entregando mais de duzentos núcleos de processamento em um único chip. Apesar de apresentarem melhor eficiência energética do que os processadores multicore de propósito geral, características arquiteturais como a limitada quantidade de memória distribuída no chip torna o desenvolvimento de aplicações científicas paralelas eficientes uma tarefa desafiadora. Neste projeto foram propostas otimizações ao framework PSkelMPPA, que provê uma abstração única e de alto nível para programação estêncil no processador MPPA-256, eximindo os programadores de serem responsáveis pela tarefa de explicitamente lidar com a comunicação e com o modelo de programação paralela híbrida do MPPA-256
On the Energy Efficiency and Performance of Irregular Application Executions on Multicore, NUMA and Manycore Platforms
International audienceUntil the last decade, performance of HPC architectures has been almost exclusively quantifiedby their processing power. However, energy efficiency is being recently considered as importantas raw performance and has become a critical aspect to the development of scalablesystems. These strict energy constraints guided the development of a new class of so-calledlight-weight manycore processors. This study evaluates the computing and energy performanceof two well-known irregular NP-hard problems — the Traveling-Salesman Problem (TSP) andK-Means clustering—and a numerical seismic wave propagation simulation kernel—Ondes3D—on multicore, NUMA, and manycore platforms. First, we concentrate on the nontrivial task ofadapting these applications to a manycore, specifically the novel MPPA-256 manycore processor.Then, we analyze their performance and energy consumption on those di↵erent machines.Our results show that applications able to fully use the resources of a manycore can have betterperformance and may consume from 3.8x to 13x less energy when compared to low-power andgeneral-purpose multicore processors, respectivel
Corporate influence and the academic computer science discipline. [4: CMU]
Prosopographical work on the four major centers for computer
research in the United States has now been conducted, resulting in big
questions about the independence of, so called, computer science
Multi-tasking scheduling for heterogeneous systems
Heterogeneous platforms play an increasingly important role in modern computer
systems. They combine high performance with low power consumption. From mobiles
to supercomputers, we see an increasing number of computer systems that are
heterogeneous.
The most well-known heterogeneous system, CPU+GPU platforms have been widely
used in recent years. As they become more mainstream, serving multiple tasks from
multiple users is an emerging challenge. A good scheduler can greatly improve performance.
However, indiscriminately allocating tasks based on availability leads to poor
performance. As modern GPUs have a large number of hardware resources, most tasks
cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising
solution, however, indiscriminately running tasks in parallel causes a slowdown.
This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed
to determine where to schedule OpenCL kernels. It predicts the best-fit device by
using a machine learning-based classifier, then schedules the kernels accordingly to either
CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed.
Kernels are merged if their predicted co-execution can provide better performance than
sequential execution. A machine learning based classifier is developed to find the best
kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to
schedule kernels separately on either CPU or GPU, and run kernels in pairs if their
co-execution can improve performance. The approaches developed in this thesis significantly
improve system performance and outperform all existing techniques
Automatic performance optimisation of parallel programs for GPUs via rewrite rules
Graphics Processing Units (GPUs) are now commonplace in computing systems and are the
most successful parallel accelerators. Their performance is orders of magnitude higher than
traditional Central Processing Units (CPUs) making them attractive for many application domains
with high computational demands. However, achieving their full performance potential
is extremely hard, even for experienced programmers, as it requires specialised software tailored
for specific devices written in low-level languages such as OpenCL. Differences in device
characteristics between manufacturers and even hardware generations often lead to large performance
variations when different optimisations are applied. This inevitably leads to code that
is not performance portable across different hardware.
This thesis demonstrates that achieving performance portability is possible using LIFT, a
functional data-parallel language which allows programs to be expressed at a high-level in a
hardware-agnostic way. The LIFT compiler is empowered to automatically explore the optimisation
space using a set of well-defined rewrite rules to transform programs seamlessly between
different high-level algorithmic forms before translating them to a low-level OpenCL-specific
form.
The first contribution of this thesis is the development of techniques to compile functional
LIFT programs that have optimisations explicitly encoded into efficient imperative OpenCL
code. Producing efficient code is non-trivial as many performance sensitive details such as
memory allocation, array accesses or synchronisation are not explicitly represented in the functional
LIFT language. The thesis shows that the newly developed techniques are essential for
achieving performance on par with manually optimised code for GPU programs with the exact
same complex optimisations applied.
The second contribution of this thesis is the presentation of techniques that enable the
LIFT compiler to perform complex optimisations that usually require from tens to hundreds of
individual rule applications by grouping them as macro-rules that cut through the optimisation
space. Using matrix multiplication as an example, starting from a single high-level program
the compiler automatically generates highly optimised and specialised implementations for
desktop and mobile GPUs with very different architectures achieving performance portability.
The final contribution of this thesis is the demonstration of how low-level and GPU-specific
features are extracted directly from the high-level functional LIFT program, enabling building
a statistical performance model that makes accurate predictions about the performance of differently
optimised program variants. This performance model is then used to drastically speed
up the time taken by the optimisation space exploration by ranking the different variants based
on their predicted performance.
Overall, this thesis demonstrates that performance portability is achievable using LIFT
El capitalismo y las religiones de China: revisión de los postulados de Max Weber en la China del nuevo siglo
The purpose of this doctoral research is to determine to what extent we can apply Max Weber's
interpretations, analysis, and conclusions about Imperial China to the country's economic rise in the
last decades.
In order to find an answer to this problem, thesis director Dr. Josetxo Beriain Razquin and I
designed a research methodology aimed to test the applicability of Weber's theories and
explanations on some of the most crucial aspects of Chinese economic and social development,
such as education, entrepreneurship, and the relative strategies of the state and the local
government.
The study of the educational aspect started in September 2011, and resulted in 20 interviewed and
100 surveyed students, all of them from Wuhan University, one of the top-10 universities in China.
For the question of governmental strategies, I researched mainly through observant participation in
the “2012 Gansu International Fellowship Program”, a 2-month experience of international
cooperation in which I took part as a delegate of the Spanish province of Navarre. And regarding
entrepreneurship, I completed more than 10 collaborations with Spanish and Chinese companies
operating in Chinese and inside sectors such as renewable energies, real estate, tourism, food and
high technology.
The interpretation and analysis phases of this doctoral thesis have been completed in a year and a
half period studyig at Jilin University, under the guidance of Professor Tian Yipeng, who helped me
achieve a better understanding of the deep changes faced by Chinese society since the decline of
Work Unit-oriented communities, and the rise of the State-owned enterprises.
Regarding the conclusions of the research, they can be summarized in three main questions:
1) The emphasis of Max Weber on the tension between the intra-mundane and ultra-mundane
spheres has led to much confusion among researchers, especially those who use this factor
to elaborate anti-materialist explanations of the so called “Chinese miracle” and, thus, ignore
“external” factors of pressure such as governmental policies, inter-familiar competition for a
higher economic and social status, and a “technical” puritanism that coerces Chinese people
to focus on training and work.
2) According to the evidence collected, Weber's plural concept of “the religions of China”,
appears far more accurate than the concept of “Confucian capitalism”, since it offers the possibility of including some of the most popular business-related cults and ethical
perspectives, the majority of which can't be considered as an exclusive heritage of
Confucianism.
3) The results of this research strongly suggest the presence of the “adaptive” rationality that
Weber considered as opposed to the one dominant in the West, although it is possible that a
good part of the nation's successful efforts towards the so called “Chinese dream” have been
driven by this mentality.Programa Oficial de Doctorado en Dinámicas de Cambio en las Sociedades Modernas Avanzadas (RD 1393/2007)Gizarte Moderno Aurreratuen Aldaketa Dinamiketako Doktoretza Programa Ofiziala (ED 1393/2007