2,711 research outputs found

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Instituteโ€™s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids

    SplitPlace: AI augmented splitting and placement of large-scale neural networks in mobile edge environments

    Get PDF
    In recent years, deep learning models have become ubiquitous in industry and academia alike. Deep neural networks can solve some of the most complex pattern-recognition problems today, but come with the price of massive compute and memory requirements. This makes the problem of deploying such large-scale neural networks challenging in resource-constrained mobile edge computing platforms, specifically in mission-critical domains like surveillance and healthcare. To solve this, a promising solution is to split resource-hungry neural networks into lightweight disjoint smaller components for pipelined distributed processing. At present, there are two main approaches to do this: semantic and layer-wise splitting. The former partitions a neural network into parallel disjoint models that produce a part of the result, whereas the latter partitions into sequential models that produce intermediate results. However, there is no intelligent algorithm that decides which splitting strategy to use and places such modular splits to edge nodes for optimal performance. To combat this, this work proposes a novel AI-driven online policy, SplitPlace, that uses Multi-Armed-Bandits to intelligently decide between layer and semantic splitting strategies based on the input task's service deadline demands. SplitPlace places such neural network split fragments on mobile edge devices using decision-aware reinforcement learning for efficient and scalable computing. Moreover, SplitPlace fine-tunes its placement engine to adapt to volatile environments. Our experiments on physical mobile-edge environments with real-world workloads show that SplitPlace can significantly improve the state-of-the-art in terms of average response time, deadline violation rate, inference accuracy, and total reward by up to 46, 69, 3 and 12 percent respectively

    ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ์ž์› ํšจ์œจ์ ์ธ ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๋™์  ์ตœ์ ํ™” ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์ „๋ณ‘๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers. To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead. The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory. We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ์™€ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๊ฐ€ ์–ด๋Š๋•Œ๋ณด๋‹ค ์ปค์ง์— ๋”ฐ๋ผ ํšจ์œจ์ ์ธ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์„์œ„ํ•œ ๋งŽ์€ ๋…ธ๋ ฅ๋“ค์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ๋ฐฉ์‹์€ ๊ฑฐ๋Œ€ํ•œ ์Šค์ผ€์ผ์˜ ๋ฐ์ดํ„ฐ์™€ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•œ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ, ํ•™์Šต ์ž‘์—…์€ ๋ถ„์‚ฐ ์›Œ์ปค์™€ ์„œ๋ฒ„๋“ค๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ์›Œ์ปค๋“ค์€ ํ• ๋‹น๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ทธ๋ ˆ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์„œ๋ฒ„๋“ค์— ๋ณด๊ด€๋œ ๊ธ€๋กœ๋ฒŒ ๋ชจ๋ธ ํŒŒ ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž๋™์ ์œผ๋กœ ์ž์› ํšจ์œจ์„ฑ๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ํ•ด๋ฒ•์€, ํŒŒ๋ผ๋ฏธํ„ฐ ์‹œ์Šคํ…œ์—์„œ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์„ ์ˆ˜ํ–‰์‹œ์— ์ž์› ์„ค์ • ๋ฐ ์›Œํฌ๋กœ๋“œ ๋ถ„๋ฐฐ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ตœ๊ณ ์˜ ์„ค์ •์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์˜จ๋ผ์ธ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Optimizer๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. Optimizer์˜ ๊ฒฐ์ •์„ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋Ÿฐํƒ€์ž„์„ ๋™์  ์žฌ์„ค์ •์„ ์ตœ์†Œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋””์ž์ธํ–ˆ๋‹ค. ๋‘๋ฒˆ์งธ ํ•ด๋ฒ•์€ ๊ณต์œ  ํด๋Ÿฌ์Šคํ„ฐ ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„ ํ•™์Šต ์ž‘์—…์˜ ์„ธ๋ถ€ ์ž‘์—… ๊ณผ ์ž์›์˜ ์Šค์ผ€์ฅด๋ง์„ ์ตœ์ ํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์„ธ๋ถ€ ์ž‘์—…๋“ค์„ ์„ธ๋ฐ€ํ•œ ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ž์› ๊ฒฝ์Ÿ์„ ์–ต์ œํ•˜๊ณ , ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋Š” ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ณด์ด๋Š” ์ž‘์—…๋“ค์„ ๊ฐ™์€ ์ž์›์— ํ•จ๊ป˜ ์œ„์น˜์‹œ์ผœ ์ž์› ํ™œ์šฉ์œจ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ํ•จ๊ป˜ ์œ„์น˜ํ•œ ์ž‘์—…๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์••๋ ฅ์„ ๊ฒฝ๊ฐ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋””์Šคํฌ๋กœ ๋‚ด๋ ธ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ฝ์–ด์˜ค๋Š” ๊ธฐ๋Šฅ์„ ์ง€์›ํ•จ๊ณผ ๋™์‹œ์—, ๋””์Šคํฌ์™€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์‹œ์Šคํ…œ์ด ์ž๋™์œผ๋กœ ๋งž์ถ”๋„๋ก ํ•˜์˜€๋‹ค. ์œ„์˜ ํ•ด๋ฒ•๋“ค์„ ์‹ค์ฒดํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์‹ค์ œ ๋™์ž‘ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์—ˆ๋‹ค. ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, ๋™์ ์œผ๋กœ ์ž‘์—…์„ ๋จธ์‹  ๊ฐ„์— ์˜ฎ๊ธฐ๊ณ  ์ž์›์„ ์žฌํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ๋Ÿฐํƒ€์ž„์„ ๊ณต์œ ํ•œ๋‹ค. ํ•ด๋‹น ์†”๋ฃจ์…˜๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด, ์ด ์‹œ์Šคํ…œ์„ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์œผ๋กœ ์‹คํ—˜ํ•˜์˜€๊ณ  ๊ธฐ์กด ์‹œ์Šคํ…œ๋“ค ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.Chapter1. Introduction 1 1.1 Distributed Machine Learning on Parameter Servers 1 1.2 Automating System Conguration of Distributed Machine Learning 2 1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3 1.4 Contributions 5 1.5 Dissertation Structure 6 Chapter2. Background 7 Chapter3. Automating System Conguration of Distributed Machine Learning 10 3.1 System Conguration Challenges 11 3.2 Finding Good System Conguration 13 3.2.1 Cost Model 13 3.2.2 Cost Formulation 15 3.2.3 Optimization 16 3.3 Cruise 18 3.3.1 Optimizer 19 3.3.2 Elastic Runtime 21 3.4 Evaluation 26 3.4.1 Experimental Setup 26 3.4.2 Finding Baselines with Grid Search 28 3.4.3 Optimization in the Homogeneous Environment 28 3.4.4 Utilizing Opportunistic Resources 30 3.4.5 Optimization in the Heterogeneous Environment 31 3.4.6 Reconguration Speed 32 3.5 Related Work 33 3.6 Summary 34 Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36 4.1 Resource Under-utilization Problems in PS ML Training 37 4.2 Harmony Overview 42 4.3 Multiplexing ML Jobs 43 4.3.1 Fine-grained Execution with Subtasks 44 4.3.2 Dynamic Grouping of Jobs 45 4.3.3 Dynamic Data Reloading 54 4.4 Evaluation 56 4.4.1 Baselines 56 4.4.2 Experimental Setup 57 4.4.3 Performance Comparison 59 4.4.4 Performance Breakdown 59 4.4.5 Workload Sensitivity Analysis 61 4.4.6 Accuracy of the Performance Model 63 4.4.7 Performance and Scalability of the Scheduling Algorithm 64 4.4.8 Dynamic Data Reloading 66 4.5 Discussion 67 4.6 Related Work 67 4.7 Summary 70 Chapter5 Conclusion 71 5.1 Summary 71 5.2 Future Work 71 5.2.1 Other Communication Architecture Support 71 5.2.2 Deep Learning & GPU Resource Support 72 ์š”์•ฝ 81Docto

    DyScale: A MapReduce Job Scheduler for Heterogeneous Multicore Processors

    Get PDF
    The functionality of modern multi-core processors is often driven by a given power budget that requires designers to evaluate different decision trade-offs, e.g., to choose between many slow, power-efficient cores, or fewer faster, power-hungry cores, or a combination of them. Here, we prototype and evaluate a new Hadoop scheduler, called DyScale, that exploits capabilities offered by heterogeneous cores within a single multi-core processor for achieving a variety of performance objectives. A typical MapReduce workload contains jobs with different performance goals: large, batch jobs that are throughput oriented, and smaller interactive jobs that are response time sensitive. Heterogeneous multi-core processors enable creating virtual resource pools based on slow and fast cores for multi-class priority scheduling. Since the same data can be accessed with either slow or fast slots, spare resources (slots) can be shared between different resource pools. Using measurements on an actual experimental setting and via simulation, we argue in favor of heterogeneous multi-core processors as they achieve faster (up to 40 percent) processing of small, interactive MapReduce jobs, while offering improved throughput (up to 40 percent) for large, batch jobs. We evaluate the performance benefits of DyScale versus the FIFO and Capacity job schedulers that are broadly used in the Hadoop community
    • โ€ฆ
    corecore