41 research outputs found

    A survey on architectures and energy efficiency in Data Center Networks

    Get PDF
    Data Center Networks (DCNs) are attracting growing interest from both academia and industry to keep pace with the exponential growth in cloud computing and enterprise networks. Modern DCNs are facing two main challenges of scalability and cost-effectiveness. The architecture of a DCN directly impacts on its scalability, while its cost is largely driven by its power consumption. In this paper, we conduct a detailed survey of the most recent advances and research activities in DCNs, with a special focus on the architectural evolution of DCNs and their energy efficiency. The paper provides a qualitative categorization of existing DCN architectures into switch-centric and server-centric topologies as well as their design technologies. Energy efficiency in data centers is discussed in details with survey of existing techniques in energy savings, green data centers and renewable energy approaches. Finally, we outline potential future research directions in DCNs

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    Hyperscale Data Processing With Network-Centric Designs

    Get PDF
    Today’s largest data processing workloads are hosted in cloud data centers. Due to unprecedented data growth and the end of Moore’s Law, these workloads have ballooned to the hyperscale level, encompassing billions to trillions of data items and hundreds to thousands of machines per query. Enabling and expanding with these workloads are highly scalable data center networks that connect up to hundreds of thousands of networked servers. These massive scales fundamentally challenge the designs of both data processing systems and data center networks, and the classic layered designs are no longer sustainable. Rather than optimize these massive layers in silos, we build systems across them with principled network-centric designs. In current networks, we redesign data processing systems with network-awareness to minimize the cost of moving data in the network. In future networks, we propose new interfaces and services that the cloud infrastructure offers to applications and codesign data processing systems to achieve optimal query processing performance. To transform the network to future designs, we facilitate network innovation at scale. This dissertation presents a line of systems work that covers all three directions. It first discusses GraphRex, a network-aware system that combines classic database and systems techniques to push the performance of massive graph queries in current data centers. It then introduces data processing in disaggregated data centers, a promising new cloud proposal. It details TELEPORT, a compute pushdown feature that eliminates data processing performance bottlenecks in disaggregated data centers, and Redy, which provides high-performance caches using remote disaggregated memory. Finally, it presents MimicNet, a fine-grained simulation framework that evaluates network proposals at datacenter scale with machine learning approximation. These systems demonstrate that our ideas in network-centric designs achieve orders of magnitude higher efficiency compared to the state of the art at hyperscale

    Improving the end-to-end latency of datacenter applications using coordination across application components

    Get PDF
    To handle millions of user requests every second and process hundreds of terabytes of data each day, many organizations have turned to large datacenter-scale computing systems. The applications running in these datacenters consist of a multitude of dependent logical components or stages which perform specific functionality. These stages are connected to form a directed acyclic graph (DAG), with edges representing input-output dependencies. Each stage can run over tens to thousands of machines, and involves multiple cluster sub-systems such as storage, network and compute. The scale and complexity of these applications can lead to significant delays in their end-to-end latency. However, the organizations running these applications have strict requirements on this latency as it directly affects their revenue and operational costs. Addressing this problem, the goal of this dissertation is to develop scheduling and resource allocation techniques to optimize for the end-to-end latency of datacenter applications. The key idea behind these techniques is to utilize coordination between different application components, allowing us to efficiently allocate cluster resources. In particular, we develop planning algorithms that coordinate the storage and compute sub-systems in datacenters to determine how many resources should be allocated to each stage in an application along with where in the cluster should they be allocated, to meet application requirements (e.g., completion time goals, minimize average completion time etc.). To further speed up applications at runtime, we develop a few latency reduction techniques: reissuing laggards elsewhere in the cluster, returning partial results and speeding up laggards by giving them extra resources. We perform a global optimization to coordinate across all the stages in an application DAG and determine which of these techniques works best for each stage, while ensuring that the cost incurred by these techniques is within a given end-to-end budget. We use application characteristics to predict and determine how resources should be allocated to different application components to meet the end-to-end latency requirements. We evaluate our techniques on two different kinds of datacenter applications: (a) web services, and (b) data analytics. With large-scale simulations and an implementation in Apache Yarn (Hadoop 2.0), we use workloads derived from production traces to show that our techniques can achieve more than 50% reduction in the 99th percentile latency of web services and up to 56% reduction in the median latency of data analytics jobs

    Towards Power- and Energy-Efficient Datacenters

    Full text link
    As the Internet evolves, cloud computing is now a dominant form of computation in modern lives. Warehouse-scale computers (WSCs), or datacenters, comprising the foundation of this cloud-centric web have been able to deliver satisfactory performance to both the Internet companies and the customers. With the increased focus and popularity of the cloud, however, datacenter loads rise and grow rapidly, and Internet companies are in need of boosted computing capacity to serve such demand. Unfortunately, power and energy are often the major limiting factors prohibiting datacenter growth: it is often the case that no more servers can be added to datacenters without surpassing the capacity of the existing power infrastructure. This dissertation aims to investigate the issues of power and energy usage in a modern datacenter environment. We identify the source of power and energy inefficiency at three levels in a modern datacenter environment and provides insights and solutions to address each of these problems, aiming to prepare datacenters for critical future growth. We start at the datacenter-level and find that the peak provisioning and improper service placement in multi-level power delivery infrastructures fragment the power budget inside production datacenters, degrading the compute capacity the existing infrastructure can support. We find that the heterogeneity among datacenter workloads is key to address this issue and design systematic methods to reduce the fragmentation and improve the utilization of the power budget. This dissertation then narrow the focus to examine the energy usage of individual servers running cloud workloads. Especially, we examine the power management mechanisms employed in these servers and find that the coarse time granularity of these mechanisms is one critical factor that leads to excessive energy consumption. We propose an intelligent and low overhead solution on top of the emerging finer granularity voltage/frequency boosting circuit to effectively pinpoints and boosts queries that are likely to increase the tail distribution and can reap more benefit from the voltage/frequency boost, improving energy efficiency without sacrificing the quality of services. The final focus of this dissertation takes a further step to investigate how using a fundamentally more efficient computing substrate, field programmable gate arrays (FPGAs), benefit datacenter power and energy efficiency. Different from other types of hardware accelerations, FPGAs can be reconfigured on-the-fly to provide fine-grain control over hardware resource allocation and presents a unique set of challenges for optimal workload scheduling and resource allocation. We aim to design a set coordinated algorithms to manage these two key factors simultaneously and fully explore the benefit of deploying FPGAs in the highly varying cloud environment.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144043/1/hsuch_1.pd

    Optimization and Management Techniques for Geo-distributed SDN-enabled Cloud Datacenters\u27 Provisioning

    Get PDF
    Cloud computing has become a business reality that impacts technology users around the world. It has become a cornerstone for emerging technologies and an enabler of future Internet services as it provides on-demand IT services delivery via geographically distributed data centers. At the core of cloud computing, virtualization technology has played a crucial role by allowing resource sharing, which in turn allows cloud service providers to offer computing services without discrepancies in platform compatibility. At the same time, a trend has emerged in which enterprises are adopting a software-based network infrastructure with paradigms, such as software-defined networking, gaining further attention for large-scale networks. This trend is due to the flexibility and agility offered to networks by such paradigms. Software-defined networks allow for network resource sharing by facilitating network virtualization. Hence, combining cloud computing with a software-defined network architecture promises to enhance the quality of services that are delivered to clients and reduces the operational costs to service providers. However, this combined architecture introduces several challenges to cloud service providers, including resource management, energy efficiency, virtual network provisioning, and controller placement. This thesis tackles these challenges by proposing innovative resource provisioning techniques and developing novel frameworks to improve resource utilization, power efficiency, and quality of service performance. These metrics have a direct impact on the capital and operational expenditure of service providers. In this thesis, the problem of virtual computing and network provisioning in geographically distributed software-defined network-enabled cloud data centers is modeled and formulated. It proposes and evaluates optimal and sub-optimal heuristic solutions to validate their efficiency. To address the energy efficiency of cloud environments that are enabled for software-defined networks, this thesis presents an innovative architecture and develops a comprehensive power consumption model that accurately describes the power consumption behavior of such environments. To address the challenge of the number of software-defined network controllers and locations, a sub-optimal solution is proposed that combines unsupervised hierarchical clustering. Finally, betweenness centrality is proposed as an efficient solution to the controller placement problem

    Designing data center networks for high throughput

    Get PDF
    Data centers with tens of thousands of servers now support popular Internet services, scientific research, as well as industrial applications. The network is the foundation of such facilities, giving the large server pool the ability to work together on these applications. The network needs to provide high throughput between servers to ensure that computations are not slowed down by network bottlenecks, with servers waiting on data from other servers. This work address two broad, related questions about high-throughput data center network design: (a) how do we measure and benchmark various network designs for throughput? and (b) how do we design such networks for near-optimal throughput? The problem of designing high-throughput networks has received a lot of attention, with multiple interesting architectures being proposed every year. However, there is no clarity on how one should benchmark these networks and how they compare to each other. In fact, this work shows that commonly used measurement approaches, in particular, cut-metrics like bisection bandwidth, do not predict throughput accurately. In contrast, we directly evaluate the throughput of networks on both uniform and (heretofore unknown) nearly-worst-case traffic matrices, and include here a comparison of 10 networks using this approach. Further, prior work has not addressed a fundamental question: how far are we from throughput-optimal design? In this work, we propose the first upper bound on network throughput for any topology with identical switches. Although designing optimal topologies is infeasible, we demonstrate that random graphs achieve throughput surprisingly close to this bound -- within a few percent at the scale of a few thousand servers for uniform traffic. Our approach also addresses important practical concerns in the design of data center networks, such as incremental expansion and heterogeneous design – as more and varied equipment is added to a data center over the years in response to evolving needs, how do we best accommodate such equipment? Our networks can achieve the same incremental growth at 40% of the expense such growth would incur with past techniques for Clos networks. Further, our approach to designing heterogeneous topologies (i.e., where all the network switches are not identical) achieves 43% higher throughput than a comparable VL2 topology, a heterogeneous network already deployed in Microsoft’s data centers. We acknowledge that the use of random graphs also poses challenges, particularly with regards to efficient routing and physical cabling. We thus present here high-efficiency routing and cabling schemes for such networks as well
    corecore