77 research outputs found

    Greening Multi-Tenant Data Center Demand Response

    Get PDF
    Data centers have emerged as promising resources for demand response, particularly for emergency demand response (EDR), which saves the power grid from incurring blackouts during emergency situations. However, currently, data centers typically participate in EDR by turning on backup (diesel) generators, which is both expensive and environmentally unfriendly. In this paper, we focus on "greening" demand response in multi-tenant data centers, i.e., colocation data centers, by designing a pricing mechanism through which the data center operator can efficiently extract load reductions from tenants during emergency periods to fulfill energy reduction requirement for EDR. In particular, we propose a pricing mechanism for both mandatory and voluntary EDR programs, ColoEDR, that is based on parameterized supply function bidding and provides provably near-optimal efficiency guarantees, both when tenants are price-taking and when they are price-anticipating. In addition to analytic results, we extend the literature on supply function mechanism design, and evaluate ColoEDR using trace-based simulation studies. These validate the efficiency analysis and conclude that the pricing mechanism is both beneficial to the environment and to the data center operator (by decreasing the need for backup diesel generation), while also aiding tenants (by providing payments for load reductions).Comment: 34 pages, 6 figure

    Optimizing Resource Management in Cloud Analytics Services

    Get PDF
    The fundamental challenge in the cloud today is how to build and optimize machine learning and data analytical services. Machine learning and data analytical platforms are changing computing infrastructure from expensive private data centers to easily accessible online services. These services pack user requests as jobs and run them on thousands of machines in parallel in geo-distributed clusters. The scale and the complexity of emerging jobs lead to increasing challenges for the clusters at all levels, from power infrastructure to system architecture and corresponding software framework design. These challenges come in many forms. Today's clusters are built on commodity hardware and hardware failures are unavoidable. Resource competition, network congestion, and mixed generations of hardware make the hardware environment complex and hard to model and predict. Such heterogeneity becomes a crucial roadblock for efficient parallelization on both the task level and job level. Another challenge comes from the increasing complexity of the applications. For example, machine learning services run jobs made up of multiple tasks with complex dependency structures. This complexity leads to difficulties in framework designs. The scale, especially when services span geo-distributed clusters, leads to another important hurdle for cluster design. Challenges also come from the power infrastructure. Power infrastructure is very expensive and accounts for more than 20% of the total costs to build a cluster. Power sharing optimization to maximize the facility utilization and smooth peak hour usages is another roadblock for cluster design. In this thesis, we focus on solutions for these challenges at the task level, on the job level, with respect to the geo-distributed data cloud design and for power management in colocation data centers. At the task level, a crucial hurdle to achieving predictable performance is stragglers, i.e., tasks that take significantly longer than expected to run. At this point, speculative execution has been widely adopted to mitigate the impact of stragglers in simple workloads. We apply straggler mitigation for approximation jobs for the first time. We present GRASS, which carefully uses speculation to mitigate the impact of stragglers in approximation jobs. GRASS's design is based on the analysis of a model we develop to capture the optimal speculation levels for approximation jobs. Evaluations with production workloads from Facebook and Microsoft Bing in an EC2 cluster of 200 nodes show that GRASS increases accuracy of deadline-bound jobs by 47% and speeds up error-bound jobs by 38%. Moving from task level to job level, task level speculation mechanisms are designed and operated independently of job scheduling when, in fact, scheduling a speculative copy of a task has a direct impact on the resources available for other jobs. Thus, we present Hopper, a job-level speculation-aware scheduler that integrates the tradeoffs associated with speculation into job scheduling decisions based on a model generalized from the task-level speculation model. We implement both centralized and decentralized prototypes of the Hopper scheduler and show that 50% (66%) improvements over state-of-the-art centralized (decentralized) schedulers and speculation strategies can be achieved through the coordination of scheduling and speculation. As computing resources move from local clusters to geo-distributed cloud services, we are expecting the same transformation for data storage. We study two crucial pieces of a geo-distributed data cloud system: data acquisition and data placement. Starting from developing the optimal algorithm for the case of a data cloud made up of a single data center, we propose a near-optimal, polynomial-time algorithm for a geo-distributed data cloud in general. We show, via a case study, that the resulting design, Datum, is near-optimal (within 1.6%) in practical settings. Efficient power management is a fundamental challenge for data centers when providing reliable services. Power oversubscription in data centers is very common and may occasionally trigger an emergency when the aggregate power demand exceeds the capacity. We study power capping solutions for handling such emergencies in a colocation data center, where the operator supplies power to multiple tenants. We propose a novel market mechanism based on supply function bidding, called COOP, to financially incentivize and coordinate tenants' power reduction for minimizing total performance loss while satisfying multiple power capping constraints. We demonstrate that COOP is "win-win", increasing the operator's profit (through oversubscription) and reducing tenants' costs (through financial compensation for their power reduction during emergencies).</p

    Communication-Aware Scheduling of Precedence-Constrained Tasks on Related Machines

    Get PDF
    Scheduling precedence-constrained tasks is a classical problem that has been studied for more than fifty years. However, little progress has been made in the setting where there are communication delays between tasks. Results for the case of identical machines were derived nearly thirty years ago, and yet no results for related machines have followed. In this work, we propose a new scheduler, Generalized Earliest Time First (GETF), and provide the first provable, worst-case approximation guarantees for the goals of minimizing both the makespan and total weighted completion time of tasks with precedence constraints on related machines with machine-dependent communication times

    Joint Data Purchasing and Data Placement in a Geo-Distributed Data Market

    Full text link

    Communication-Aware Scheduling of Precedence-Constrained Tasks

    Get PDF
    Jobs in large-scale machine learning platforms are expressed using a computational graph of tasks with precedence constraints. To handle such precedence-constrained tasks that have machine-dependent communication demands in settings with heterogeneous service rates and communication times, we propose a new scheduling framework, Generalized Earliest Time First (GETF), that improves upon stateof- the-art results in the area. Specifically, we provide the first provable, worst-case approximation guarantee for the goal of minimizing the makespan of tasks with precedence constraints on related machines with machine-dependent communication times
    • …
    corecore