    TaskPoint: sampled simulation of task-based programs

    Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft

    Power efficient job scheduling by predicting the impact of processor manufacturing variability

    Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft

    Improving scalability of task-based programs

    In a multi-core era, parallel programming allows further performance improvements, but with an important programmability cost. We envision that the best approach to parallel programming that can exceed the programability, parallelism, power, memory and reliability walls in Computer Architecture is a run-time approach. Many traditional computer architecture concepts can be revisited and applied at the runtime layer in a completely transparent way to the programmer. The goal of this work is taking the computer architecture value prediction and data-prefetching concepts inside a runtime environment like OmpSs

    The Mediterranean as a melting pot: phylogeography of Loxosceles rufescens (Sicariidae) in the Mediterranean Basin

    The species Loxosceles rufescens is native to the Mediterranean but considered cosmopolitan because it has been dispersed worldwide. A previous study revealed 11 evolutionary lineages across the Mediterranean, grouped into two main clades, without any clear phylogeographic pattern. The high genetic diversity within this species (p-distances of up to 7.8% in some Mediterranean lineages), together with the results obtained with different species delimitation methods (GMYC, TCS) could indicate the existence of cryptic species. Here we compare the mitochondrial and microsatellite diversity to elucidate if the lineages of L. rufescens in the Mediterranean should be considered different species (cryptic species) or populations of the same species. To do so, we analyzed the cox1 diversity of 196 individuals, of which, we genotyped 148, sampled from 19 localities across the Mediterranean. STRUCTURE analyses of microsatellite data identified two genetic clusters of L. rufescens. One cluster included individuals from Western Mediterranean localities (Iberian Peninsula, Morocco, Balearic Islands) and Israel, while the second one grouped individuals from Italian and Greek localities, including Sardinia, Sicily and Tunisia. These patterns suggest that geographic proximity is the more significant factor in the clustering with microsatellite data and shows the existence of gene flow between the nearest geographic areas, even if the individuals belong to different mitochondrial lineages or clades. The lack of correspondence between both genetic markers suggests that the evolutionary lineages found within L. rufescens should not be considered different species. We conclude that these phylogenetic linages and their distribution may be the result of the maternal evolutionary history of the species and human-mediated dispersion

    Design trade-offs for emerging HPC processors based on mobile market technology

    This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: http://dx.doi.org/10.1007/s11227-019-02819-4High-performance computing (HPC) is at the crossroads of a potential transition toward mobile market processor technology. Unlike in prior transitions, numerous hardware vendors and integrators will have access to state-of-the-art processor designs due to Arm’s licensing business model. This fact gives them greater flexibility to implement custom HPC-specific designs. In this paper, we undertake a study to quantify the different energy-performance trade-offs when architecting a processor based on mobile market technology. Through detailed simulations over a representative set of benchmarks, our results show that: (i) a modest amount of last-level cache per core is sufficient, leading to significant power and area savings; (ii) in-order cores offer favorable trade-offs when compared to out-of-order cores for a wide range of benchmarks; and (iii) heterogeneous configurations help to improve processor performance and energy efficiency.Peer ReviewedPostprint (author's final draft

    Graph partitioning applied to DAG scheduling to reduce NUMA effects

    The complexity of shared memory systems is becoming more relevant as the number of memory domains increases, with different access latencies and bandwidth rates depending on the proximity between the cores and the devices containing the data. In this context, techniques to manage and mitigate non-uniform memory access (NUMA) effects consist in migrating threads, memory pages or both and are typically applied by the system software. We propose techniques at the runtime system level to reduce NUMA effects on parallel applications. We leverage runtime system metadata in terms of a task dependency graph. Our approach, based on graph partitioning methods, is able to provide parallel performance improvements of 1.12X on average with respect to the state-of-the-art.This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Spanish Government (contract TIN2015-65316-P). I. Sánchez Barrera has been supported by the Spanish Government under Formación del Profesorado Universitario fellowship number FPU15/03612.Peer ReviewedPostprint (published version

    Runtime-guided mitigation of manufacturing variability in power-constrained multi-socket NUMA nodes

    This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243). This work was also partially performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-689878). Finally, the authors are grateful to the reviewers for their valuable comments, to the RoMoL team, to Xavier Teruel and Kallia Chronaki from the Programming Models group of BSC and the Computation Department of LLNL for their technical support and useful feedback.Peer ReviewedPostprint (published version

    A vulnerability factor for ECC-protected memory

    Fault injection studies and vulnerability analyses have been used to estimate the reliability of data structures in memory. We survey these metrics and look at their adequacy to describe the data stored in ECC-protected memory. We also introduce FEA, a new metric improving on the memory derating factor by ignoring a class of false errors. We measure all metrics using simulations and compare them to the outcomes of injecting errors in real runs. This in-depth study reveals that FEA provides more accurate results than any state-of-the-art vulnerability metric. Furthermore, FEA gives an upper bound on the failure probability due to an error in memory, making this metric a tool of choice to quantify memory vulnerability. Finally, we show that ignoring these false errors reduces the failure rate on average by 12.75% and up to over 45%.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316- P), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017- SGR-1328), by the Spanish Government (Severo Ochoa grant SEV-2015- 0493) and by the European Union’s Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). L. Jaulmes has been partially supported by the Spanish Ministry of Education, Culture and Sports under grant FPU2013/06982. M. Moreto and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowships RYC-2016-21104 and RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Architectural support for task dependence management with flexible software scheduling

    The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

    TD-NUCA: runtime driven management of NUCA caches in task dataflow programming models

    In high performance processors, the design of on-chip memory hierarchies is crucial for performance and energy efficiency. Current processors rely on large shared Non-Uniform Cache Architectures (NUCA) to improve performance and reduce data movement. Multiple solutions exploit information available at the microarchitecture level or in the operating system to optimize NUCA performance. However, existing methods have not taken advantage of the information captured by task dataflow programming models to guide the management of NUCA caches. In this paper we propose TD-NUCA, a hardware/software co-designed approach that leverages information present in the runtime system of task dataflow programming models to efficiently manage NUCA caches. TD-NUCA identifies the data access and reuse patterns of parallel applications in the runtime system and guides the operation of the NUCA caches in the hardware. As a result, TD-NUCA achieves a 1.18x average speedup over the baseline S-NUCA while requiring only 0.62x the data movement.This work has been supported by the Spanish Ministry of Science and Technology (contract PID2019-107255GB-C21) and the Generalitat de Catalunya (contract 2017-SGR-1414). M. Casas has been partially supported by the Grant RYC- 2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF ‘Investing in your future’. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (published version