    Using data dependencies to improve task-based scheduling strategies on NUMA architectures

    International audienceThe recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels : during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy , depending on the topology. This paper introduces several heuristics for these strategies and their implementations in our OpenMP runtime XKAAPI. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works

    Performance analysis and acceleration of nuclear physics application on high-performance computing platforms using GPGPUs and topology-aware mapping techniques

    The number of nodes on current generation of high performance computing (HPC) platforms increases with a steady rate, and nodes of these computing platforms support multiple/many core hardware designs. As the number of cores per node increase, either CPU or accelerator based, we need to make use of all those cores. Thus, one has to use the accelerators as much as possible inside scientific applications. Furthermore, with the increase of the number of nodes, the communication time between nodes is likely to increase, which necessitates application specific network topology-aware mapping techniques for efficient utilization of these platforms. In addition, one also needs to construct network models in order to study the benefits of specific network mapping. The specific topology-aware mapping techniques will help to distribute the computational tasks so that the communication patterns make optimal use of the underlying network hardware. This research will mainly focus on the Many Fermion Dynamics nuclear (MFDn) application developed at Iowa State University, a computational tool for low-energy nuclear physics, which utilizes the so-called Lanczos algorithm (LA), an algorithm for diagonalization of sparse matrices that is widely used in the scientific parallel computing domain. We present techniques applied to this application which enhance its performance with the utilization of general purpose graphics processing units (GPGPUs). Additionally, we compare the performance of the sparse matrix vector multiplication (SpMVM), the main computationally intensive kernel in the LA, with other efficient approaches presented in the literature. We compare results for the total HPC platforms\u27 resources needed for different SpMVM implementations, present and analyze the implementation of communication and computation overlapping method, and extend a model for the analysis of network topology presented in the literature. Finally, we present network topology-aware mapping techniques, focused at the LA stage, for IBM Blue Gene/Q (BG/Q) supercomputers, which enhance the performance as compared to the default mapping, and validate the results of our test using the network model

    Achieving reliability and fairness in online task computing environments

    Mención Internacional en el título de doctorWe consider online task computing environments such as volunteer computing platforms running on BOINC (e.g., SETI@home) and crowdsourcing platforms such as Amazon Mechanical Turk. We model the computations as an Internet-based task computing system under the masterworker paradigm. A master entity sends tasks across the Internet, to worker entities willing to perform a computational task. Workers execute the tasks, and report back the results, completing the computational round. Unfortunately, workers are untrustworthy and might report an incorrect result. Thus, the first research question we answer in this work is how to design a reliable masterworker task computing system. We capture the workers’ behavior through two realistic models: (1) the “error probability model” which assumes the presence of altruistic workers willing to provide correct results and the presence of troll workers aiming at providing random incorrect results. Both types of workers suffer from an error probability altering their intended response. (2) The “rationality model” which assumes the presence of altruistic workers, always reporting a correct result, the presence of malicious workers always reporting an incorrect result, and the presence of rational workers following a strategy that will maximize their utility (benefit). The rational workers can choose among two strategies: either be honest and report a correct result, or cheat and report an incorrect result. Our two modeling assumptions on the workers’ behavior are supported by an experimental evaluation we have performed on Amazon Mechanical Turk. Given the error probability model, we evaluate two reliability techniques: (1) “voting” and (2) “auditing” in terms of task assignments required and time invested for computing correctly a set of tasks with high probability. Considering the rationality model, we take an evolutionary game theoretic approach and we design mechanisms that eventually achieve a reliable computational platform where the master receives the correct task result with probability one and with minimal auditing cost. The designed mechanisms provide incentives to the rational workers, reinforcing their strategy to a correct behavior, while they are complemented by four reputation schemes that cope with malice. Finally, we also design a mechanism that deals with unresponsive workers by keeping a reputation related to the workers’ response rate. The designed mechanism selects the most reliable and active workers in each computational round. Simulations, among other, depict the trade-off between the master’s cost and the time the system needs to reach a state where the master always receives the correct task result. The second research question we answer in this work concerns the fair and efficient distribution of workers among the masters over multiple computational rounds. Masters with similar tasks are competing for the same set of workers at each computational round. Workers must be assigned to the masters in a fair manner; when the master values a worker’s contribution the most. We consider that a master might have a strategic behavior, declaring a dishonest valuation on a worker in each round, in an attempt to increase its benefit. This strategic behavior from the side of the masters might lead to unfair and inefficient assignments of workers. Applying renown auction mechanisms to solve the problem at hand can be infeasible since monetary payments are required on the side of the masters. Hence, we present an alternative mechanism for fair and efficient distribution of the workers in the presence of strategic masters, without the use of monetary incentives. We show analytically that our designed mechanism guarantees fairness, is socially efficient, and is truthful. Simulations favourably compare our designed mechanism with two benchmark auction mechanisms.This work has been supported by IMDEA Networks Institute and the Spanish Ministry of Education grant FPU2013-03792.Programa Oficial de Doctorado en Ingeniería MatemáticaPresidente: Alberto Tarable.- Secretario: José Antonio Cuesta Ruiz.- Vocal: Juan Julián Merelo Guervó

    Optimal program variant generation for hybrid manycore systems

    Field Programmable Gate Arrays promise to deliver superior energy efficiency in heterogeneous high performance computing, as compared to multicore CPUs and GPUs. The rate of adoption is however hampered by the relative difficulty of programming FPGAs. High-level synthesis tools such as Xilinx Vivado, Altera OpenCL or Intel's HLS address a large part of the programmability issue by synthesizing a Hardware Description Languages representation from a high-level specification of the application, given in programming languages such as OpenCL C, typically used to program CPUs and GPUs. Although HLS solutions make programming easier, they fail to also lighten the burden of optimization. Application developers must rely on expert knowledge to manually optimize their applications for each target device, meaning that traditional HLS solutions do not offer a solution to the issue of performance portability. This state of fact prompted the development of compiler frameworks such as TyTra that operate at an even higher level of abstraction that is amenable to the use of Design Space Exploration (DSE). With DSE the initial program specification can be seen as the starting location in a search-space of correct-by-construction program transformations. In TyTra the search-space is generated from the transitive-closure of term-level transformations derived from type-level transformations. Compiler frameworks such as TyTra theoretically solve the issue of performance portability by providing a way to automatically generate alternative correct program variants. They however suffer from the very practical issue that the generated space is often too large to fully explore. As a consequence, the globally optimal solution may be overlooked. In this work we provide a novel solution to issue performance portability by deriving an efficient yet effective DSE strategy for the TyTra compiler framework. We make use of categorical data types to derive categorical semantics for the formal languages that describe the terms, types, cost-performance estimates and their transformations. From these we define a category of interpretations for TyTra applications, from which we derive a DSE strategy that finds the globally optimal transformation sequence in polynomial time. This is achieved by reducing the size of the generated search space. We formally state and prove a theorem for this claim and then show that the polynomial run-time for our DSE strategy has practically negligible coefficients leading to sub-second exploration times for realistic applications