5 research outputs found
Using data dependencies to improve task-based scheduling strategies on NUMA architectures
International audienceThe recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels : during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy , depending on the topology. This paper introduces several heuristics for these strategies and their implementations in our OpenMP runtime XKAAPI. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works
Performance analysis and acceleration of nuclear physics application on high-performance computing platforms using GPGPUs and topology-aware mapping techniques
The number of nodes on current generation of high performance computing (HPC) platforms increases with a steady rate, and nodes of these computing platforms support multiple/many core hardware designs. As the number of cores per node increase, either CPU or accelerator based, we need to make use of all those cores. Thus, one has to use the accelerators as much as possible inside scientific applications. Furthermore, with the increase of the number of nodes, the communication time between nodes is likely to increase, which necessitates application specific network topology-aware mapping techniques for efficient utilization of these platforms. In addition, one also needs to construct network models in order to study the benefits of specific network mapping. The specific topology-aware mapping techniques will help to distribute the computational tasks so that the communication patterns make optimal use of the underlying network hardware. This research will mainly focus on the Many Fermion Dynamics nuclear (MFDn) application developed at Iowa State University, a computational tool for low-energy nuclear physics, which utilizes the so-called Lanczos algorithm (LA), an algorithm for diagonalization of sparse matrices that is widely used in the scientific parallel computing domain. We present techniques applied to this application which enhance its performance with the utilization of general purpose graphics processing units (GPGPUs). Additionally, we compare the performance of the sparse matrix vector multiplication (SpMVM), the main computationally intensive kernel in the LA, with other efficient approaches presented in the literature. We compare results for the total HPC platforms\u27 resources needed for different SpMVM implementations, present and analyze the implementation of communication and computation overlapping method, and extend a model for the analysis of network topology presented in the literature. Finally, we present network topology-aware mapping techniques, focused at the LA stage, for IBM Blue Gene/Q (BG/Q) supercomputers, which enhance
the performance as compared to the default mapping, and validate the results of our test using the network model
Achieving reliability and fairness in online task computing environments
MenciĂłn Internacional en el tĂtulo de doctorWe consider online task computing environments such as volunteer computing platforms running
on BOINC (e.g., SETI@home) and crowdsourcing platforms such as Amazon Mechanical
Turk. We model the computations as an Internet-based task computing system under the masterworker
paradigm. A master entity sends tasks across the Internet, to worker entities willing to
perform a computational task. Workers execute the tasks, and report back the results, completing
the computational round. Unfortunately, workers are untrustworthy and might report an incorrect
result. Thus, the first research question we answer in this work is how to design a reliable masterworker
task computing system. We capture the workers’ behavior through two realistic models:
(1) the “error probability model” which assumes the presence of altruistic workers willing to
provide correct results and the presence of troll workers aiming at providing random incorrect
results. Both types of workers suffer from an error probability altering their intended response.
(2) The “rationality model” which assumes the presence of altruistic workers, always reporting
a correct result, the presence of malicious workers always reporting an incorrect result, and the
presence of rational workers following a strategy that will maximize their utility (benefit). The
rational workers can choose among two strategies: either be honest and report a correct result,
or cheat and report an incorrect result. Our two modeling assumptions on the workers’ behavior
are supported by an experimental evaluation we have performed on Amazon Mechanical Turk.
Given the error probability model, we evaluate two reliability techniques: (1) “voting” and (2)
“auditing” in terms of task assignments required and time invested for computing correctly a set
of tasks with high probability. Considering the rationality model, we take an evolutionary game
theoretic approach and we design mechanisms that eventually achieve a reliable computational
platform where the master receives the correct task result with probability one and with minimal
auditing cost. The designed mechanisms provide incentives to the rational workers, reinforcing
their strategy to a correct behavior, while they are complemented by four reputation schemes that
cope with malice. Finally, we also design a mechanism that deals with unresponsive workers by
keeping a reputation related to the workers’ response rate. The designed mechanism selects the
most reliable and active workers in each computational round. Simulations, among other, depict
the trade-off between the master’s cost and the time the system needs to reach a state where
the master always receives the correct task result. The second research question we answer in
this work concerns the fair and efficient distribution of workers among the masters over multiple computational rounds. Masters with similar tasks are competing for the same set of workers at
each computational round. Workers must be assigned to the masters in a fair manner; when the
master values a worker’s contribution the most. We consider that a master might have a strategic
behavior, declaring a dishonest valuation on a worker in each round, in an attempt to increase its
benefit. This strategic behavior from the side of the masters might lead to unfair and inefficient assignments
of workers. Applying renown auction mechanisms to solve the problem at hand can be
infeasible since monetary payments are required on the side of the masters. Hence, we present an
alternative mechanism for fair and efficient distribution of the workers in the presence of strategic
masters, without the use of monetary incentives. We show analytically that our designed mechanism
guarantees fairness, is socially efficient, and is truthful. Simulations favourably compare
our designed mechanism with two benchmark auction mechanisms.This work has been supported by IMDEA Networks Institute and the Spanish Ministry of Education grant FPU2013-03792.Programa Oficial de Doctorado en IngenierĂa MatemáticaPresidente: Alberto Tarable.- Secretario: JosĂ© Antonio Cuesta Ruiz.- Vocal: Juan Julián Merelo GuervĂł
Optimal program variant generation for hybrid manycore systems
Field Programmable Gate Arrays promise to deliver superior energy efficiency in heterogeneous high performance computing, as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel's HLS address a large part of the programmability issue by synthesizing a Hardware Description Languages representation from a high-level specification of the application, given in programming languages such as OpenCL C, typically used to program CPUs and GPUs. Although HLS solutions make programming easier, they fail to also lighten the burden of optimization. Application developers must rely on expert knowledge to manually optimize their applications for each target device, meaning that traditional HLS solutions do not offer a solution to the issue of performance portability. This state of fact prompted the development of compiler frameworks such as TyTra that operate at an even higher level of abstraction that is amenable to the use of Design Space Exploration (DSE). With DSE the initial program specification can be seen as the starting location in a search-space of correct-by-construction program transformations. In TyTra the search-space is generated from the transitive-closure of term-level transformations derived from type-level transformations. Compiler frameworks such as TyTra theoretically solve the issue of performance portability by providing a way to automatically generate alternative correct program variants. They however suffer from the very practical issue that the generated space is often too large to fully explore. As a consequence, the globally optimal solution may be overlooked.
In this work we provide a novel solution to issue performance portability by deriving an efficient yet effective DSE strategy for the TyTra compiler framework. We make use of categorical data types to derive categorical semantics for the formal languages that describe the terms, types, cost-performance estimates and their transformations. From these we define a category of interpretations for TyTra applications, from which we derive a DSE strategy that finds the globally optimal transformation sequence in polynomial time. This is achieved by reducing the size of the generated search space. We formally state and prove a theorem for this claim and then show that the polynomial run-time for our DSE strategy has practically negligible coefficients leading to sub-second exploration times for realistic applications