3 research outputs found

    Efficient Resource Management for Deep Learning Clusters

    Full text link
    Deep Learning (DL) is gaining rapid popularity in various domains, such as computer vision, speech recognition, etc. With the increasing demands, large clusters have been built to develop DL models (i.e., data preparation and model training). DL jobs have some unique features ranging from their hardware requirements to execution patterns. However, the resource management techniques applied in existing DL clusters have not yet been adapted to those new features, which leads to resource inefficiency and hurts the performance of DL jobs. We observed three major challenges brought by DL jobs. First, data preparation jobs, which prepare training datasets from a large volume of raw data, are memory intensive. DL clusters often over-allocate memory resource to those jobs for protecting their performance, which causes memory underutilization in DL clusters. Second, the execution time of a DL training job is often unknown before job completion. Without such information, existing cluster schedulers are unable to minimize the average Job Completion Time (JCT) of those jobs. Third, model aggregations in Distributed Deep Learning (DDL) training are often assigned with a fixed group of CPUs. However, a large portion of those CPUs are wasted because the bursty model aggregations can not saturate them all the time. In this thesis, we propose a suite of techniques to eliminate the mismatches between DL jobs and resource management in DL clusters. First, we bring the idea of memory disaggregation to enhance the memory utilization of DL clusters. The unused memory in data preparation jobs is exposed as remote memory to other machines that are running out of local memory. Second, we design a two-dimensional attained-service-based scheduler to optimize the average JCT of DL training jobs. This scheduler takes the temporal and spatial characteristics of DL training jobs into consideration and can efficiently schedule them without knowing their execution time. Third, we define a shared model aggregation service to reduce the CPU cost of DDL training. Using this service, model aggregations from different DDL training jobs are carefully packed together and use the same group of CPUs in a time-sharing manner. With these techniques, we demonstrate that huge improvements in resource efficiency and job performance can be obtained when the cluster’s resource management matches with the features of DL jobs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169955/1/jcgu_1.pd

    mPart: Miss Ratio Curve Guided Partitioning in Key-Value Stores

    Get PDF
    Web applications employ key-value stores to cache the data that is most commonly accessed. The cache improves an web application’s performance by serving its requests from memory, avoiding fetching them from the backend database. Since the memory space is limited, maximizing the memory utilization is a key to delivering the best performance possible. This has lead to the use of multi-tenant systems, allowing applications to share cache space. In addition, application data access patterns change over time, so the system should be adaptive in its memory allocation. In this thesis, we address both multi-tenancy (where a single cache is used for mul- tiple applications) and dynamic workloads (changing access patterns) using a model that relates the cache size to the application miss ratio, known as a miss ratio curve. Intuitively, the larger the cache, the less likely the system will need to fetch the data from the database. Our efficient, online construction of the miss ratio curve allows us to determine a near optimal memory allocation given the available system memory, while adapting to changing data access patterns. We show that our model outper- forms an existing state-of-the-art sharing model, Memshare, in terms of cache hit ratio and does so at a lower time cost. We show that average hit ratio is consistently 1 percentage point greater and 99.9th percentile latency is reduced by as much as 2.9% under standard web application workloads containing millions of requests

    Faster slab reassignment in memcached

    No full text
    Web applications, databases, and many datacenter services rely on in-memory key-value stores to cache frequently accessed data. In this work, we focus on a commonly used system, memcached, where even small performance improvements can result in large end-to-end speed ups in request latency. memcached organizes its memory into slabs that belong to different classes corresponding to object sizes. Many prior works have explored the problem of how many slabs should each class be assigned in the face of dynamic workloads, typically reassigning hundreds of slabs during a reassignment. However, we find that as workloads scale and applications use increasing amounts of memory, the current reassignment mechanism in memcached is inefficient. In fact, we measure that reassignments can take millions of requests to complete. Motivated by these findings, we introduce a faster slab reassignment mechanism in memcached with minimal changes to existing source code. In our experiments, we show that the time needed to reassign a slab reduces by over 99% resulting in the ability to reach workloads\u27 steady state miss ratio by 53% to 75% faster. By arriving at the steady state miss ratio faster, we reduce the overall average miss ratio by 3.42% to 11.5%
    corecore