41 research outputs found
Predicting Software Performance with Divide-and-Learn
Predicting the performance of highly configurable software systems is the
foundation for performance testing and quality assurance. To that end, recent
work has been relying on machine/deep learning to model software performance.
However, a crucial yet unaddressed challenge is how to cater for the sparsity
inherited from the configuration landscape: the influence of configuration
options (features) and the distribution of data samples are highly sparse.
In this paper, we propose an approach based on the concept of
'divide-and-learn', dubbed . The basic idea is that, to handle sample
sparsity, we divide the samples from the configuration landscape into distant
divisions, for each of which we build a regularized Deep Neural Network as the
local model to deal with the feature sparsity. A newly given configuration
would then be assigned to the right model of division for the final prediction.
Experiment results from eight real-world systems and five sets of training
data reveal that, compared with the state-of-the-art approaches, performs
no worse than the best counterpart on 33 out of 40 cases (within which 26 cases
are significantly better) with up to improvement on accuracy;
requires fewer samples to reach the same/better accuracy; and producing
acceptable training overhead. Practically, also considerably improves
different global models when using them as the underlying local models, which
further strengthens its flexibility. To promote open science, all the data,
code, and supplementary figures of this work can be accessed at our repository:
https://github.com/ideas-labo/DaL.Comment: This paper has been accepted by The ACM Joint European Software
Engineering Conference and Symposium on the Foundations of Software
Engineering (ESEC/FSE), 202
POSE : A mathematical and visual modelling tool to guide energy aware code optimisation
Performance engineers are beginning to explore software-level optimisation as a means to reduce the energy consumed when running their codes. This paper presents POSE, a mathematical and visual modelling tool which highlights the relationship between runtime and power consumption. POSE allows developers to assess whether power optimisation is worth pursuing for their codes. We demonstrate POSE by studying the power optimisation characteristics of applications from the Mantevo and Rodinia benchmark suites. We show that LavaMD has the most scope for CPU power optimisation, with improvements in Energy Delay Squared Product (ED2P) of up to 30.59%. Conversely, MiniMD offers the least scope, with improvements to the same metric limited to 7.60%. We also show that no power optimised version of MiniMD operating below 2.3 GHz can match the ED2P performance of the original code running at 3.2 GHz. For LavaMD this limit is marginally less restrictive at 2.2 GHz
Simulation Modelling of Cloud Mini and Mega Data Centers Using Cloud Analyst
Cloud Computing has now become a base technology for various other technologies including Internet of Things, Big Data Technologies and many other technologies, the responsibility of Cloud become critical in case of real time applications where the cloud services are required in real time. Delay in the response from Cloud may lead to serious consequences even loss of lives where the processes data from cloud must reach within predefined time interval. The performance of Cloud has experienced delays with the current infrastructure due to multiple issues in Traditional Cloud Network Model. The Paper suggests a proposed architecture Cloud Mini Data Centers simulated using Cloud Analyst to minimize the delays of Cloud Service delivery. The paper also simulate traditional cloud Network model using Cloud Analyst and provides a comparative study of both models
Revisiting the high-performance reconfigurable computing for future datacenters
Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well
Energy-efficient computing for HPC workloads on heterogeneous manycore chips
Power and energy efficiency is one of the major challenges to achieve exascale computing in the next several years. While chips operating at low voltages have been studied to be highly energy-efficient, low voltage operations lead to heterogeneity across cores within the microprocessor chip. In this work, we study chips with low voltage operation and discuss programming systems, and performance modeling in the presence of heterogeneity. We propose an integer linear programming based approach for selecting optimal configu-ration of a chip that minimizes its energy consumption. We obtain an average of 26 % and 10.7 % savings in energy con-sumption of the chip for two HPC mini-applications- min-iMD and Jacobi, respectively. We also evaluate the energy savings with execution time constraints, using the proposed approach. These energy savings are significantly more than the savings by sub-optimal configurations obtained from heuristics
Multithreading opportunities for program optimizations
The introduction of Multiprocessor On Chip (CMP) led to a substantial reformulation of the Moore law stating that the number of cores in a single chip doubles every one year and a half.
The tech boom related to CMP gave a strong impulse to parallel program design diminishing its ``gap'' with parallel architectures.
Nowadays a leading trend related to high performance products is represented by CMP with multithreading CPU nodes.
Basically the CPU multithreading feature tries to overcome the underutilization of superscalar processors, due to the lack of exploitable instruction level parallelism (ILP), allowing the simultaneous processing of different programs during the same time slot.
In multithreading architectures a thread is a concurrent computational entity supported directly at firmware level (these threads are usually called hardware threads).
Multithreading technology opens a broad range of possible optimizations that can be applied to improve the performance of sequential and parallel applications.
This thesis treat four possible optimization targeted for multithreading architectures: Speculative Precomputation, Threaded Multipath Execution, Speculative Multithreading and Communication threads.
L'introduzione dei Multiprocessor On Chip (CMP) ha portato ad una sostanziale riformulazione della legge di Moore la quale afferma che il numero di cores in un singolo chip raddoppia ogni anno e mezzo. Il boom tecnologico relativo ai CMP ha dato un grande impulso al design relativo alla programmazione parallela diminuendo il gap con le architetture parallele.
Allo stato attuale delle cose, un trend prominente relativo ai prodotti di high performance computing è rappresentato da CMP con nodi caratterizzati da hardware multithreading.
Questa tecnologia prova a risolvere il sottoutilizzo di processori superscalari, dovuto alla mancanza di ILP (instruction level parallelism), permettendo la computazione simultanea di diversi programmi durante lo stesso time slot
La tecnologia multithreading ha aperto un ampio spettro di possibili ottimizzazioni che possono essere utilizzate al fine di migliorare le performance di applicazioni sequenziali e parallele.
Questa tesi tratta quattro possibili ottimizzazioni indirizzate per architetture multithreading: Speculative Precomputation (Helper Thread), Threaded Multipath Execution, Speculative Multithreading and Communication Threads
Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget
Abstract—Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that esti-mates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power-response characteristics. We show that with a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1.7X improvement. I
Analytical Modeling is Enough for High Performance BLIS
We show how the BLAS-like Library Instantiation Software (BLIS) framework, which provides a more detailed layering of the GotoBLAS (now maintained as OpenBLAS) implementation, allows one to analytically determine tuning parameters for high-end instantiations of the matrix-matrix multiplication. This is of both practical and scientific importance, as it greatly reduces the development effort required for the implementation of the level-3 BLAS while also advancing our understanding of how hierarchically layered memories interact with high-performance software. This allows the community to move on from valuable engineering solutions (empirically autotuning) to scientific understanding (analytical insight).This research was sponsored in part by NSF grants ACI-1148125/1340293 and CCF-0917167.
Enrique S. Quintana-Ortí was supported by project TIN2011-23283 of the Ministerio de Ciencia e Innovacióon and FEDER. Francisco D. Igual was supported by project TIN2012-32180 of the Ministerio de Ciencia e Innovación