1,503 research outputs found
HPC Cloud for Scientific and Business Applications: Taxonomy, Vision, and Research Challenges
High Performance Computing (HPC) clouds are becoming an alternative to
on-premise clusters for executing scientific applications and business
analytics services. Most research efforts in HPC cloud aim to understand the
cost-benefit of moving resource-intensive applications from on-premise
environments to public cloud platforms. Industry trends show hybrid
environments are the natural path to get the best of the on-premise and cloud
resources---steady (and sensitive) workloads can run on on-premise resources
and peak demand can leverage remote resources in a pay-as-you-go manner.
Nevertheless, there are plenty of questions to be answered in HPC cloud, which
range from how to extract the best performance of an unknown underlying
platform to what services are essential to make its usage easier. Moreover, the
discussion on the right pricing and contractual models to fit small and large
users is relevant for the sustainability of HPC clouds. This paper brings a
survey and taxonomy of efforts in HPC cloud and a vision on what we believe is
ahead of us, including a set of research challenges that, once tackled, can
help advance businesses and scientific discoveries. This becomes particularly
relevant due to the fast increasing wave of new HPC applications coming from
big data and artificial intelligence.Comment: 29 pages, 5 figures, Published in ACM Computing Surveys (CSUR
Power efficient job scheduling by predicting the impact of processor manufacturing variability
Modern CPUs suffer from performance and power consumption variability due to the manufacturing process. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability.
In this work we show that parallel systems benefit from taking into account the consequences of manufacturing variability when making scheduling decisions at the job scheduler level. We also show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensure that power consumption stays under a system-wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications, utilizing up to 4096 cores in total. We demonstrate that they decrease job turnaround time, compared to contemporary scheduling policies used on production clusters, up to 31% while saving up to 5.5% energy.Postprint (author's final draft
A methodology for full-system power modeling in heterogeneous data centers
The need for energy-awareness in current data centers has encouraged the use of power modeling to estimate their power consumption. However, existing models present noticeable limitations, which make them application-dependent, platform-dependent, inaccurate, or computationally complex. In this paper, we propose a platform-and application-agnostic methodology for full-system power modeling in heterogeneous data centers that overcomes those limitations. It derives a single model per platform, which works with high accuracy for heterogeneous applications with different patterns of resource usage and energy consumption, by systematically selecting a minimum set of resource usage indicators and extracting complex relations among them that capture the impact on energy consumption of all the resources in the system. We demonstrate our methodology by generating power models for heterogeneous platforms with very different power consumption profiles. Our validation experiments with real Cloud applications show that such models provide high accuracy (around 5% of average estimation error).This work is supported by the Spanish Ministry of Economy and Competitiveness under contract TIN2015-65316-P, by the Gener-
alitat de Catalunya under contract 2014-SGR-1051, and by the European Commission under FP7-SMARTCITIES-2013 contract 608679 (RenewIT) and FP7-ICT-2013-10 contracts 610874 (AS- CETiC) and 610456 (EuroServer).Peer ReviewedPostprint (author's final draft
Recommended from our members
Scheduling, Characterization and Prediction of HPC Workloads for Distributed Computing Environments
As High Performance Computing (HPC) has grown considerably and is expected to grow even more, effective resource management for distributed computing sys- tems is motivated more than ever. As the computational workloads grow in quantity, it is becoming more crucial to apply efficient resource management and workload scheduling to use resources efficiently while keeping the computational performance reasonably good. The problem of efficiently scheduling workloads on resources while meeting performance standards is hard. Additionally, non-clairvoyance of job dimen- sions makes resource management even harder in real-world scenarios. Our research methodology investigates the scheduling problem compliant for HPC and researches the challenges for deploying the scheduling in real world-scenarios using state of the art machine learning and data science techniques.To this end, this Ph.D. dissertation makes the following core contributions: a) We perform a theoretical analysis of space-sharing, non-preemptive scheduling: we studied this scheduling problem and proposed scheduling algorithms with polyno- mial computation time. We also proved constant upper-bounds for the performance of these algorithms. b) We studied the sensitivity of scheduling algorithms to the accuracy of runtime and devised a meta-learning approach to estimate prediction accuracy for newly submitted jobs to the HPC system. c) We studied the runtime prediction problem for HPC applications. For this purpose, we studied the distri- bution of available public workloads and proposed two different solutions that can predict multi-modal distributions: switching state-space models and Mixture Density Networks. d) We studied the effectiveness of recent recurrent neural network models for CPU usage trace prediction for individual VM traces as well as aggregate CPU usage traces. In this dissertation, we explore solutions to improve the performance of scheduling workloads on distributed systems.We begin by looking at the problem from the theoretical perspective. Modeling the problem mathematically, we first propose a scheduling algorithm that finds a constant approximation of the optimal solution for the problem in polynomial time. We prove that the performance of the algorithm (average completion time is the constant approximation of the performance of the optimal scheduling. We next look at the problem in real-world scenarios. Considering High-Performance Computing (HPC) workload computing environments as the most similar real-world equivalent of our mathematical model, we explore the problem of predicting application runtime. We propose an algorithm to handle the existing uncertainties in the real world and show-case our algorithm with demonstrative effectiveness in terms of response time and resource utilization. After looking at the uncertainty problem, we focus on trying to improve the accuracy of existing prediction approaches for HPC application runtime. We propose two solutions, one based on Kalman filters and one based on deep density mixture networks. We showcase the effectiveness of our prediction approaches by comparing with previous prediction approaches in terms of prediction accuracy and impact on improving scheduling performance. In the end, we focus on predicting resource usage for individual applications during their execution. We explore the application of recurrent neural networks for predicting resource usage of applications deployed on individual virtual machines. To validate our proposed models and solutions, we performed extensive trace-driven simulation and measured the effectiveness of our approaches
Power Bounded Computing on Current & Emerging HPC Systems
Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets.
We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency.
Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
ํด๋ผ์ฐ๋ ์ปดํจํ ํ๊ฒฝ๊ธฐ๋ฐ์์ ์์น ๋ชจ๋ธ๋ง๊ณผ ๋จธ์ ๋ฌ๋์ ํตํ ์ง๊ตฌ๊ณผํ ์๋ฃ์์ฑ์ ๊ดํ ์ฐ๊ตฌ
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ์์ฐ๊ณผํ๋ํ ์ง๊ตฌํ๊ฒฝ๊ณผํ๋ถ, 2022. 8. ์กฐ์๊ธฐ.To investigate changes and phenomena on Earth, many scientists use high-resolution-model results based on numerical models or develop and utilize machine learning-based prediction models with observed data. As information technology advances, there is a need for a practical methodology for generating local and global high-resolution numerical modeling and machine learning-based earth science data.
This study recommends data generation and processing using high-resolution numerical models of earth science and machine learning-based prediction models in a cloud environment.
To verify the reproducibility and portability of high-resolution numerical ocean model implementation on cloud computing, I simulated and analyzed the performance of a numerical ocean model at various resolutions in the model domain, including the Northwest Pacific Ocean, the East Sea, and the Yellow Sea. With the containerization method, it was possible to respond to changes in various infrastructure environments and achieve computational reproducibility effectively.
The data augmentation of subsurface temperature data was performed using generative models to prepare large datasets for model training to predict the vertical temperature distribution in the ocean. To train the prediction model, data augmentation was performed using a generative model for observed data that is relatively insufficient compared to satellite dataset.
In addition to observation data, HYCOM datasets were used for performance comparison, and the data distribution of augmented data was similar to the input data distribution. The ensemble method, which combines stand-alone predictive models, improved the performance of the predictive model compared to that of the model based on the existing observed data. Large amounts of computational resources were required for data synthesis, and the synthesis was performed in a cloud-based graphics processing unit environment.
High-resolution numerical ocean model simulation, predictive model development, and the data generation method can improve predictive capabilities in the field of ocean science. The numerical modeling and generative models based on cloud computing used in this study can be broadly applied to various fields of earth science.์ง๊ตฌ์ ๋ณํ์ ํ์์ ์ฐ๊ตฌํ๊ธฐ ์ํด ๋ง์ ๊ณผํ์๋ค์ ์์น ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ ๊ณ ํด์๋ ๋ชจ๋ธ ๊ฒฐ๊ณผ๋ฅผ ์ฌ์ฉํ๊ฑฐ๋ ๊ด์ธก๋ ๋ฐ์ดํฐ๋ก ๋จธ์ ๋ฌ๋ ๊ธฐ๋ฐ ์์ธก ๋ชจ๋ธ์ ๊ฐ๋ฐํ๊ณ ํ์ฉํ๋ค. ์ ๋ณด๊ธฐ์ ์ด ๋ฐ์ ํจ์ ๋ฐ๋ผ ์ง์ญ ๋ฐ ์ ์ง๊ตฌ์ ์ธ ๊ณ ํด์๋ ์์น ๋ชจ๋ธ๋ง๊ณผ ๋จธ์ ๋ฌ๋ ๊ธฐ๋ฐ ์ง๊ตฌ๊ณผํ ๋ฐ์ดํฐ ์์ฑ์ ์ํ ์ค์ฉ์ ์ธ ๋ฐฉ๋ฒ๋ก ์ด ํ์ํ๋ค.
๋ณธ ์ฐ๊ตฌ๋ ์ง๊ตฌ๊ณผํ์ ๊ณ ํด์๋ ์์น ๋ชจ๋ธ๊ณผ ๋จธ์ ๋ฌ๋ ๊ธฐ๋ฐ ์์ธก ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ ๋ฐ์ดํฐ ์์ฑ ๋ฐ ์ฒ๋ฆฌ๊ฐ ํด๋ผ์ฐ๋ ํ๊ฒฝ์์ ํจ๊ณผ์ ์ผ๋ก ๊ตฌํ๋ ์ ์์์ ์ ์ํ๋ค.
ํด๋ผ์ฐ๋ ์ปดํจํ
์์ ๊ณ ํด์๋ ์์น ํด์ ๋ชจ๋ธ ๊ตฌํ์ ์ฌํ์ฑ๊ณผ ์ด์์ฑ์ ๊ฒ์ฆํ๊ธฐ ์ํด ๋ถ์ํํ์, ๋ํด, ํฉํด ๋ฑ ๋ชจ๋ธ ์์ญ์ ๋ค์ํ ํด์๋์์ ์์น ํด์ ๋ชจ๋ธ์ ์ฑ๋ฅ์ ์๋ฎฌ๋ ์ด์
ํ๊ณ ๋ถ์ํ์๋ค. ์ปจํ
์ด๋ํ ๋ฐฉ์์ ํตํด ๋ค์ํ ์ธํ๋ผ ํ๊ฒฝ ๋ณํ์ ๋์ํ๊ณ ๊ณ์ฐ ์ฌํ์ฑ์ ํจ๊ณผ์ ์ผ๋ก ํ๋ณดํ ์ ์์๋ค.
๋จธ์ ๋ฌ๋ ๊ธฐ๋ฐ ๋ฐ์ดํฐ ์์ฑ์ ์ ์ฉ์ ๊ฒ์ฆํ๊ธฐ ์ํด ์์ฑ ๋ชจ๋ธ์ ์ด์ฉํ ํ์ธต ์ดํ ์จ๋ ๋ฐ์ดํฐ์ ๋ฐ์ดํฐ ์ฆ๊ฐ์ ์คํํ์ฌ ํด์์ ์์ง ์จ๋ ๋ถํฌ๋ฅผ ์์ธกํ๋ ๋ชจ๋ธ ํ๋ จ์ ์ํ ๋์ฉ๋ ๋ฐ์ดํฐ ์ธํธ๋ฅผ ์ค๋นํ๋ค. ์์ธก๋ชจ๋ธ ํ๋ จ์ ์ํด ์์ฑ ๋ฐ์ดํฐ์ ๋นํด ์๋์ ์ผ๋ก ๋ถ์กฑํ ๊ด์ธก ๋ฐ์ดํฐ์ ๋ํด์ ์์ฑ ๋ชจ๋ธ์ ์ฌ์ฉํ์ฌ ๋ฐ์ดํฐ ์ฆ๊ฐ์ ์ํํ์๋ค. ๋ชจ๋ธ์ ์์ธก์ฑ๋ฅ ๋น๊ต์๋ ๊ด์ธก ๋ฐ์ดํฐ ์ธ์๋ HYCOM ๋ฐ์ดํฐ ์ธํธ๋ฅผ ์ฌ์ฉํ์์ผ๋ฉฐ, ์ฆ๊ฐ ๋ฐ์ดํฐ์ ๋ฐ์ดํฐ ๋ถํฌ๋ ์
๋ ฅ ๋ฐ์ดํฐ ๋ถํฌ์ ์ ์ฌํจ์ ํ์ธํ์๋ค. ๋
๋ฆฝํ ์์ธก ๋ชจ๋ธ์ ๊ฒฐํฉํ ์์๋ธ ๋ฐฉ์์ ๊ธฐ์กด ๊ด์ธก ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ํ๋ ์์ธก ๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋นํด ํฅ์๋์๋ค. ๋ฐ์ดํฐํฉ์ฑ์ ์ํด ๋ง์ ์์ ๊ณ์ฐ ์์์ด ํ์ํ์ผ๋ฉฐ, ๋ฐ์ดํฐ ํฉ์ฑ์ ํด๋ผ์ฐ๋ ๊ธฐ๋ฐ GPU ํ๊ฒฝ์์ ์ํ๋์๋ค.
๊ณ ํด์๋ ์์น ํด์ ๋ชจ๋ธ ์๋ฎฌ๋ ์ด์
, ์์ธก ๋ชจ๋ธ ๊ฐ๋ฐ, ๋ฐ์ดํฐ ์์ฑ ๋ฐฉ๋ฒ์ ํด์ ๊ณผํ ๋ถ์ผ์์ ์์ธก ๋ฅ๋ ฅ์ ํฅ์์ํฌ ์ ์๋ค. ๋ณธ ์ฐ๊ตฌ์์ ์ฌ์ฉ๋ ํด๋ผ์ฐ๋ ์ปดํจํ
๊ธฐ๋ฐ์ ์์น ๋ชจ๋ธ๋ง ๋ฐ ์์ฑ ๋ชจ๋ธ์ ์ง๊ตฌ ๊ณผํ์ ๋ค์ํ ๋ถ์ผ์ ๊ด๋ฒ์ํ๊ฒ ์ ์ฉ๋ ์ ์๋ค.1. General Introduction 1
2. Performance of numerical ocean modeling on cloud computing 6
2.1. Introduction 6
2.2. Cloud Computing 9
2.2.1. Cloud computing overview 9
2.2.2. Commercial cloud computing services 12
2.3. Numerical model for performance analysis of commercial clouds 15
2.3.1. High Performance Linpack Benchmark 15
2.3.2. Benchmark Sustainable Memory Bandwidth and Memory Latency 16
2.3.3. Numerical Ocean Model 16
2.3.4. Deployment of Numerical Ocean Model and Benchmark Packages on Cloud Clusters 19
2.4. Simulation results 21
2.4.1. Benchmark simulation 21
2.4.2. Ocean model simulation 24
2.5. Analysis of ROMS performance on commercial clouds 26
2.5.1. Performance of ROMS according to H/W resources 26
2.5.2. Performance of ROMS according to grid size 34
2.6. Summary 41
3. Reproducibility of numerical ocean model on the cloud computing 44
3.1. Introduction 44
3.2. Containerization of numerical ocean model 47
3.2.1. Container virtualization 47
3.2.2. Container-based architecture for HPC 49
3.2.3. Container-based architecture for hybrid cloud 53
3.3. Materials and Methods 55
3.3.1. Comparison of traditional and container based HPC cluster workflows 55
3.3.2. Model domain and datasets for numerical simulation 57
3.3.3. Building the container image and registration in the repository 59
3.3.4. Configuring a numeric model execution cluster 64
3.4. Results and Discussion 74
3.4.1. Reproducibility 74
3.4.2. Portability and Performance 76
3.5. Conclusions 81
4. Generative models for the prediction of ocean temperature profile 84
4.1. Introduction 84
4.2. Materials and Methods 87
4.2.1. Model domain and datasets for predicting the subsurface temperature 87
4.2.2. Model architecture for predicting the subsurface temperature 90
4.2.3. Neural network generative models 91
4.2.4. Prediction Models 97
4.2.5. Accuracy 103
4.3. Results and Discussion 104
4.3.1. Data Generation 104
4.3.2. Ensemble Prediction 109
4.3.3. Limitations of this study and future works 111
4.4. Conclusion 111
5. Summary and conclusion 114
6. References 118
7. Abstract (in Korean) 140๋ฐ
- โฆ