8 research outputs found

    DREditor: An Time-efficient Approach for Building a Domain-specific Dense Retrieval Model

    Full text link
    Deploying dense retrieval models efficiently is becoming increasingly important across various industries. This is especially true for enterprise search services, where customizing search engines to meet the time demands of different enterprises in different domains is crucial. Motivated by this, we develop a time-efficient approach called DREditor to edit the matching rule of an off-the-shelf dense retrieval model to suit a specific domain. This is achieved by directly calibrating the output embeddings of the model using an efficient and effective linear mapping. This mapping is powered by an edit operator that is obtained by solving a specially constructed least squares problem. Compared to implicit rule modification via long-time finetuning, our experimental results show that DREditor provides significant advantages on different domain-specific datasets, dataset sources, retrieval models, and computing devices. It consistently enhances time efficiency by 100-300 times while maintaining comparable or even superior retrieval performance. In a broader context, we take the first step to introduce a novel embedding calibration approach for the retrieval task, filling the technical blank in the current field of embedding calibration. This approach also paves the way for building domain-specific dense retrieval models efficiently and inexpensively.Comment: 15 pages, 6 figures, Codes are available at https://github.com/huangzichun/DREdito

    Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics

    Full text link
    The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.Comment: 15 pages, 12 figures, 4 tables. Published in IEEE Cluster 202

    Automatic regenerative simulation via non-reversible simulated tempering

    Full text link
    Simulated Tempering (ST) is an MCMC algorithm for complex target distributions that operates on a path between the target and a more amenable reference distribution. Crucially, if the reference enables i.i.d. sampling, ST is regenerative and can be parallelized across independent tours. However, the difficulty of tuning ST has hindered its widespread adoption. In this work, we develop a simple nonreversible ST (NRST) algorithm, a general theoretical analysis of ST, and an automated tuning procedure for ST. A core contribution that arises from the analysis is a novel performance metric -- Tour Effectiveness (TE) -- that controls the asymptotic variance of estimates from ST for bounded test functions. We use the TE to show that NRST dominates its reversible counterpart. We then develop an automated tuning procedure for NRST algorithms that targets the TE while minimizing computational cost. This procedure enables straightforward integration of NRST into existing probabilistic programming languages. We provide extensive experimental evidence that our tuning scheme improves the performance and robustness of NRST algorithms on a diverse set of probabilistic models

    λ™μ‹œμ— μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ μœ„ν•œ 병렬성 관리

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·컴퓨터곡학뢀, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.λ©€ν‹°μ½”μ–΄ μ‹œμŠ€ν…œμ—μ„œ μ—¬λŸ¬ 개의 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ ν•¨κ»˜ μ‹€ν–‰μ‹œν‚€λŠ” 것 은 주어진 ν•˜λ“œμ›¨μ–΄ μžμ›μ„ 효율적으둜 μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œ 점점 더 μ€‘μš”ν•΄μ§€κ³  μžˆλ‹€. ν•˜μ§€λ§Œ, ν˜„μž¬ λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ—μ„œ μ—¬λŸ¬ 개의 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ λ™μ‹œμ— 효율적으둜 μ‹€ν–‰μ‹œν‚€λŠ” 것은 μ—¬μ „νžˆ μ–΄λ €μš΄ λ¬Έμ œμ΄λ‹€. OpenMP와 같이 톡상 사 μš©λ˜λŠ” 병렬화 λŸ°νƒ€μž„ μ‹œμŠ€ν…œλ“€μ€ λͺ¨λ“  ν•˜λ“œμ›¨μ–΄ μ½”μ–΄ μžμ›μ„ μ‚¬μš©ν•˜κΈ° μœ„ν•΄μ„œ 일반적으둜 μ½”μ–΄ 개수 만큼 μŠ€λ ˆλ“œλ₯Ό μƒμ„±ν•˜μ—¬ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ„ μ‹€ν–‰μ‹œν‚¨λ‹€. 이 λ•Œ, μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ€ λͺ¨λ“  μ½”μ–΄ μžμ›μ„ ν™œμš©ν•  λ•Œ 였히렀 졜적의 μ„±λŠ₯을 얻지 λͺ»ν•  μˆ˜λ„ 있으며, 운영체제 μ»€λ„μ˜ λΆ€ν•˜λŠ” μ‹€ν–‰λ˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ κ°œμˆ˜κ°€ λŠ˜μ–΄λ‚  수둝 관리해야 ν•˜λŠ” μŠ€λ ˆλ“œμ˜ κ°œμˆ˜κ°€ λŠ˜μ–΄λ‚˜κΈ° λ•Œλ¬Έμ— κ³„μ†ν•΄μ„œ μ»€μ§€κ²Œ λœλ‹€. λ³Έ ν•™μœ„ λ…Όλ¬Έμ—μ„œ, μš°λ¦¬λŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ λŸ°νƒ€μž„ μ„±λŠ₯을 λ†’μ΄λŠ” 것에 μ§‘μ€‘ν•œλ‹€. 이λ₯Ό μœ„ν•΄, λ³Έ μ—°κ΅¬μ˜ 핡심 λͺ©ν‘œλŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ—κ²Œ 곡간 뢄할식 μŠ€μΌ€μ€„λ§ 방법을 μ μš©ν•˜λŠ” 것이닀. 각 μ–΄ν”Œλ¦¬ μΌ€μ΄μ…˜μ—κ²Œ 독립적인 μ½”μ–΄ μžμ›μ„ ν• λ‹Ήν•΄μ£ΌλŠ” 곡간 뢄할식 μŠ€μΌ€μ€„λ§μ€ 점점 더 λŠ˜μ–΄λ‚˜λŠ” μ½”μ–΄ μžμ›μ˜ 개수λ₯Ό 효율적으둜 κ΄€λ¦¬ν•˜κΈ° μœ„ν•œ λ°©λ²•μœΌλ‘œ λ§Žμ€ 관심을 λ°›κ³  μžˆλ‹€. ν•˜μ§€λ§Œ, 곡간 λΆ„ν•  μŠ€μΌ€μ€„λ§ 방법을 톡해 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ„ μ‹€ν–‰μ‹œν‚€λŠ” 것은 두 가지 연ꡬ 과제λ₯Ό 가지고 μžˆλ‹€. λ¨Όμ €, 각 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ€ 가변적인 μ½”μ–΄ μžμ› μƒμ—μ„œ 효율적으둜 μ‹€ν–‰λ˜κΈ° μœ„ν•œ λŸ°νƒ€μž„ κΈ°μˆ μ„ ν•„μš”λ‘œ ν•˜κ³ , μŠ€μΌ€μ€„λŸ¬λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ μ„±λŠ₯ νŠΉμ„±μ„ κ³ λ €ν•΄μ„œ λŸ°νƒ€μž„ μ„±λŠ₯을 높일 수 μžˆλ„λ‘ μ λ‹Ήν•œ 수의 μ½”μ–΄ μžμ›μ„ μ œκ³΅ν•΄μ•Όν•œλ‹€. 이 ν•™μœ„ λ…Όλ¬Έμ—μ„œ, μš°λ¦¬λŠ” ν•¨κ»˜ μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ„ 곡간 λΆ„ ν•  μŠ€μΌ€μ€„λ§μ„ ν†΅ν•΄μ„œ 효율적으둜 μ‹€ν–‰μ‹œν‚€κΈ° μœ„ν•œ 세가지 λŸ°νƒ€μž„ μ‹œμŠ€ν…œ κΈ°μˆ μ„ μ†Œκ°œν•œλ‹€. λ¨Όμ € μš°λ¦¬λŠ” ν˜‘λ™μ μΈ λŸ°νƒ€μž„ μ‹œμŠ€ν…œμ΄λΌλŠ” κΈ°μˆ μ„ μ†Œκ°œν•˜λŠ”λ°, μ΄λŠ” OpenMP 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ—κ²Œ μœ μ—°ν•˜κ³  효율적인 μ‹€ν–‰ ν™˜κ²½μ„ μ œκ³΅ν•œλ‹€. 이 κΈ°μˆ μ€ 곡유 λ©”λͺ¨λ¦¬ 병렬 싀행에 λ‚΄μž¬λ˜μ–΄ μžˆλŠ” νŠΉμ„±μ„ ν™œμš©ν•˜μ—¬ λ³‘λ ¬μ²˜λ¦¬ ν”„λ‘œκ·Έλž¨λ“€μ΄ λ³€ν™”ν•˜λŠ” μ½”μ–΄ μžμ›μ— λ§žμΆ”μ–΄ λ³‘λ ¬μ„±μ˜ 정도λ₯Ό λ™μ μœΌλ‘œ μ‘°μ ˆν•  수 μžˆλ„λ‘ ν•΄μ€€λ‹€. μ΄λŸ¬ν•œ μœ μ—°ν•œ μ‹€ν–‰ λͺ¨λΈμ€ 병렬 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ΄ μ‚¬μš© κ°€λŠ₯ν•œ μ½”μ–΄ μžμ›μ΄ λ™μ μœΌλ‘œ λ³€ν™”ν•˜λŠ” ν™˜κ²½μ—μ„œ μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ μŠ€λ ˆλ“œ μˆ˜μ€€ 병렬성을 닀루지 λͺ»ν•˜λŠ” κΈ°μ‘΄ λŸ°νƒ€μž„ μ‹œμŠ€ν…œλ“€μ— λΉ„ν•΄μ„œ 더 효율적으둜 싀행될 수 μžˆλ„λ‘ ν•΄μ€€λ‹€. λ‘λ²ˆμ§Έλ‘œ, λ³Έ 논문은 μ‚¬μš©λ˜λŠ” μ½”μ–΄ μžμ›μ— λ”°λ₯Έ λ³‘λ ¬μ²˜λ¦¬ ν”„λ‘œκ·Έλž¨μ˜ μ„±λŠ₯ 및 μžμ› ν™œμš©λ„λ₯Ό μ˜ˆμΈ‘ν•  수 μžˆλ„λ‘ ν•΄μ£ΌλŠ” 뢄석적 μ„±λŠ₯ λͺ¨λΈμ„ μ†Œκ°œν•œλ‹€. 병렬 처리 μ½”λ“œμ˜ μ„±λŠ₯ ν™•μž₯성이 일반적으둜 λ©”λͺ¨λ¦¬ μ„±λŠ₯에 μ’Œμš°λœλ‹€λŠ” 관찰에 κΈ°μ΄ˆν•˜μ—¬, 제 μ•ˆλœ 해석 λͺ¨λΈμ€ νμž‰ 이둠을 ν™œμš©ν•˜μ—¬ λ©”λͺ¨λ¦¬ μ‹œμŠ€ν…œμ˜ μ„±λŠ₯ 정보듀을 κ³„μ‚°ν•œλ‹€. 이 νμž‰ μ‹œμŠ€ν…œμ— κΈ°λ°˜ν•œ 방법은 μœ μš©ν•œ μ„±λŠ₯ 정보듀을 μˆ˜μ‹μ„ 톡해 효율적으둜 계산할 수 μžˆλ„λ‘ ν•˜λ©° μƒμš© μ‹œμŠ€ν…œμ—μ„œ μ œκ³΅ν•˜λŠ” ν•˜λ“œμ›¨μ–΄ μ„±λŠ₯ μΉ΄μš΄ν„°λ§Œμ„ μš”κ΅¬ ν•˜κΈ° λ•Œλ¬Έμ— ν™œμš© κ°€λŠ₯μ„± λ˜ν•œ λ†’λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, λ³Έ 논문은 λ™μ‹œμ— μ‹€ν–‰λ˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€ μ‚¬μ΄μ—μ„œ μ½”μ–΄ μžμ›μ„ ν• λ‹Ήν•΄μ£ΌλŠ” ν”„λ ˆμž„μ›Œν¬λ₯Ό μ†Œκ°œν•œλ‹€. μ œμ•ˆλœ ν”„λ ˆμž„μ›Œν¬λŠ” λ™μ‹œμ— 동 μž‘ν•˜λŠ” 병렬 처리 μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜μ˜ 병렬성 및 μ½”μ–΄ μžμ›μ„ κ΄€λ¦¬ν•˜μ—¬ λ©€ν‹° μ†ŒμΌ“ λ©€ν‹°μ½”μ–΄ μ‹œμŠ€ν…œμ—μ„œ CPU μžμ› 및 λ©”λͺ¨λ¦¬ λŒ€μ—­ν­ μžμ› ν™œμš©λ„λ₯Ό λ™μ‹œμ— 졜적 ν™”ν•œλ‹€. 해석적인 λͺ¨λΈλ§κ³Ό μ œμ•ˆλœ μ½”μ–΄ ν• λ‹Ή ν”„λ ˆμž„μ›Œν¬μ˜ μ„±λŠ₯ 평가λ₯Ό ν†΅ν•΄μ„œ, μš°λ¦¬κ°€ μ œμ•ˆν•˜λŠ” 정책이 일반적인 κ²½μš°μ— CPU μžμ›μ˜ ν™œμš©λ„λ§Œμ„ μ΅œμ ν™”ν•˜λŠ” 방법에 λΉ„ν•΄μ„œ ν•¨κ»˜ λ™μž‘ν•˜λŠ” μ–΄ν”Œλ¦¬μΌ€μ΄μ…˜λ“€μ˜ μ‹€ν–‰μ‹œκ°„μ„ κ°μ†Œμ‹œν‚¬ 수 μžˆμŒμ„ 보여쀀닀.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 초둝 159 Acknowledgements 161Docto

    Machine learning-based performance analytics for high-performance computing systems

    Full text link
    High-performance Computing (HPC) systems play pivotal roles in societal and scientific advancements, executing up to quintillions of calculations every second. As we shift towards exascale computing and beyond, modern HPC systems emphasize resource sharing, where various applications share processors, memory, networks, and other components. While this sharing enhances power efficiency, it complicates performance prediction and introduces significant variations in application running times, affecting overall system efficiency and operational costs. HPC systems utilize monitoring frameworks that gather numerical telemetry data on resource usage to track operational status. Given the massive complexity and volume of this data, manual analysis is often daunting and inefficient. Machine learning (ML) techniques offer automated performance anomaly diagnosis, but the transition from successful research outcomes to production-scale deployment encounters two critical obstacles. First, the scarcity of labeled training data (i.e., identifying healthy and anomalous runs) in telemetry datasets makes it hard to train these ML systems effectively. Second, runtime analysis, required for providing timely detection and diagnosis of performance anomalies, demands seamless integration of ML-based methods with the monitoring frameworks. This thesis claims that ML-based performance analytics frameworks that leverage a limited amount of labeled data and ensure runtime analysis can achieve sufficient anomaly diagnosis performance for production HPC systems. To support this claim, we undertake ML-based performance analytics on two fronts. First, we design and develop novel frameworks for anomaly diagnosis that leverage semi-supervised or unsupervised learning techniques to reduce the need for extensive labeled data. Second, we design a simple yet adaptable architecture to enable deployment and demonstrate that these frameworks are feasible for runtime analysis. This thesis makes the following specific contributions: First, we design a semi-supervised anomaly diagnosis framework, Proctor, which operates with hundreds of labeled samples (in contrast to tens of thousands) and a vast number of unlabeled samples. We show that Proctor outperforms the fully supervised baseline by up to 11% in F1-score for diagnosing anomalies when there are approximately 30 labeled samples. We then reframe the problem and introduce ALBADRoss to determine which samples should be labeled by experts to maximize the model performance using active learning. On a production HPC dataset, ALBADRoss achieves a 0.95 F1-score (the same score that a fully-supervised framework achieved) and near-zero false alarm rate using 24x fewer labeled samples. Finally, with Prodigy, we solve the anomaly detection problem but with a focus on deployment. Prodigy is designed for detecting performance anomalies on compute nodes using unsupervised learning. Our framework achieves a 0.95 F1-score in detecting anomalies on a production HPC system telemetry dataset. We also design a simple and adaptable software architecture and deploy it on a 1488-node production HPC system, detecting real-world performance anomalies with 88% accuracy

    Supercomputing Frontiers

    Get PDF
    This open access book constitutes the refereed proceedings of the 6th Asian Supercomputing Conference, SCFA 2020, which was planned to be held in February 2020, but unfortunately, the physical conference was cancelled due to the COVID-19 pandemic. The 8 full papers presented in this book were carefully reviewed and selected from 22 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling
    corecore