27 research outputs found

    Disaggregated Computing. An Evaluation of Current Trends for Datacentres

    Get PDF
    Next generation data centers will likely be based on the emerging paradigm of disaggregated function-blocks-as-a-unit departing from the current state of mainboard-as-a-unit. Multiple functional blocks or bricks such as compute, memory and peripheral will be spread through the entire system and interconnected together via one or multiple high speed networks. The amount of memory available will be very large distributed among multiple bricks. This new architecture brings various benefits that are desirable in today’s data centers such as fine-grained technology upgrade cycles, fine-grained resource allocation, and access to a larger amount of memory and accelerators. An analysis of the impact and benefits of memory disaggregation is presented in this paper. One of the biggest challenges when analyzing these architectures is that memory accesses should be modeled correctly in order to obtain accurate results. However, modeling every memory access would generate a high overhead that can make the simulation unfeasible for real data center applications. A model to represent and analyze memory disaggregation has been designed and a statistics-based queuing-based full system simulator was developed to rapidly and accurately analyze applications performance in disaggregated systems. With a mean error of 10%, simulation results pointed out that the network layers may introduce overheads that degrade applications’ performance up to 66%. Initial results also suggest that low memory access bandwidth may degrade up to 20% applications’ performance.This project has received funding from the European Unions Horizon 2020 research and innovation programme under grant agreement No 687632 (dReDBox project) and TIN2015-65316-P - Computacion de Altas Prestaciones VII.Peer ReviewedPostprint (published version

    Characterization and optimization of network traffic in cortical simulation

    Get PDF
    Considering the great variety of obstacles the Exascale systems have to face in the next future, a deeper attention will be given in this thesis to the interconnect and the power consumption. The data movement challenge involves the whole hierarchical organization of components in HPC systems — i.e. registers, cache, memory, disks. Running scientific applications needs to provide the most effective methods of data transport among the levels of hierarchy. On current petaflop systems, memory access at all the levels is the limiting factor in almost all applications. This drives the requirement for an interconnect achieving adequate rates of data transfer, or throughput, and reducing time delays, or latency, between the levels. Power consumption is identified as the largest hardware research challenge. The annual power cost to operate the system would be above 2.5 B$ per year for an Exascale system using current technology. The research for alternative power-efficient computing device is mandatory for the procurement of the future HPC systems. In this thesis, a preliminary approach will be offered to the critical process of co-design. Co-desing is defined as the simultaneos design of both hardware and software, to implement a desired function. This process both integrates all components of the Exascale initiative and illuminates the trade-offs that must be made within this complex undertaking

    The case for in-network computing on demand

    Get PDF
    Programmable network hardware can run services traditionally deployed on servers, resulting in orders-of-magnitude improvements in performance. Yet, despite these performance improvements, network operators remain skeptical of in-network computing. The conventional wisdom is that the operational costs from increased power consumption outweigh any performance benefits. Unless in-network computing can justify its costs, it will be disregarded as yet another academic exercise. In this paper, we challenge that assumption, by providing a detailed power analysis of several in-network computing use cases. Our experiments show that in-network computing can be extremely power-efficient. In fact, for a single watt, a software system on commodity CPU can be improved by a factor of x100 using FPGA, and a factor of x1000 utilizing ASIC implementations. However, this efficiency depends on the system load. To address changing workloads, we propose In-Network Computing On Demand, where services can be dynamically moved between servers and the network. By shifting the placement of services on-demand, data centers can optimize for both performance and power efficiency

    An Experimental Evaluation of Datacenter Workloads On Low-Power Embedded Micro Servers

    Get PDF
    This paper presents a comprehensive evaluation of an ultra-low power cluster, built upon the Intel Edison based micro servers. The improved performance and high energy efficiency of micro servers have driven both academia and industry to explore the possibility of replacing conventional brawny servers with a larger swarm of embedded micro servers. Existing attempts mostly focus on mobile-class micro servers, whose capacities are similar to mobile phones. We, on the other hand, target on sensor-class micro servers, which are originally intended for uses in wearable technologies, sensor networks, and Internet-of-Things. Although sensor-class micro servers have much less capacity, they are touted for minimal power consumption (< 1 Watt), which opens new possibilities of achieving higher energy efficiency in datacenter workloads. Our systematic evaluation of the Edison cluster and comparisons to conventional brawny clusters involve careful workload choosing and laborious parameter tuning, which ensures maximum server utilization and thus fair comparisons. Results show that the Edison cluster achieves up to 3.5× improvement on work-done-per-joule for web service applications and data-intensive MapReduce jobs. In terms of scalability, the Edison cluster scales linearly on the throughput of web service workloads, and also shows satisfactory scalability for MapReduce workloads despite coordination overhead.This research was supported in part by NSF grant 13-20209.Ope

    Enabling Hyperscale Web Services

    Full text link
    Modern web services such as social media, online messaging, web search, video streaming, and online banking often support billions of users, requiring data centers that scale to hundreds of thousands of servers, i.e., hyperscale. In fact, the world continues to expect hyperscale computing to drive more futuristic applications such as virtual reality, self-driving cars, conversational AI, and the Internet of Things. This dissertation presents technologies that will enable tomorrow’s web services to meet the world’s expectations. The key challenge in enabling hyperscale web services arises from two important trends. First, over the past few years, there has been a radical shift in hyperscale computing due to an unprecedented growth in data, users, and web service software functionality. Second, modern hardware can no longer support this growth in hyperscale trends due to a decline in hardware performance scaling. To enable this new hyperscale era, hardware architects must become more aware of hyperscale software needs and software researchers can no longer expect unlimited hardware performance scaling. In short, systems researchers can no longer follow the traditional approach of building each layer of the systems stack separately. Instead, they must rethink the synergy between the software and hardware worlds from the ground up. This dissertation establishes such a synergy to enable futuristic hyperscale web services. This dissertation bridges the software and hardware worlds, demonstrating the importance of that bridge in realizing efficient hyperscale web services via solutions that span the systems stack. The specific goal is to design software that is aware of new hardware constraints and architect hardware that efficiently supports new hyperscale software requirements. This dissertation spans two broad thrusts: (1) a software and (2) a hardware thrust to analyze the complex hyperscale design space and use insights from these analyses to design efficient cross-stack solutions for hyperscale computation. In the software thrust, this dissertation contributes uSuite, the first open-source benchmark suite of web services built with a new hyperscale software paradigm, that is used in academia and industry to study hyperscale behaviors. Next, this dissertation uses uSuite to study software threading implications in light of today’s hardware reality, identifying new insights in the age-old research area of software threading. Driven by these insights, this dissertation demonstrates how threading models must be redesigned at hyperscale by presenting an automated approach and tool, uTune, that makes intelligent run-time threading decisions. In the hardware thrust, this dissertation architects both commodity and custom hardware to efficiently support hyperscale software requirements. First, this dissertation characterizes commodity hardware’s shortcomings, revealing insights that influenced commercial CPU designs. Based on these insights, this dissertation presents an approach and tool, SoftSKU, that enables cheap commodity hardware to efficiently support new hyperscale software paradigms, improving the efficiency of real-world web services that serve billions of users, saving millions of dollars, and meaningfully reducing the global carbon footprint. This dissertation also presents a hardware-software co-design, uNotify, that redesigns commodity hardware with minimal modifications by using existing hardware mechanisms more intelligently to overcome new hyperscale overheads. Next, this dissertation characterizes how custom hardware must be designed at hyperscale, resulting in industry-academia benchmarking efforts, commercial hardware changes, and improved software development. Based on this characterization’s insights, this dissertation presents Accelerometer, an analytical model that estimates gains from hardware customization. Multiple hyperscale enterprises and hardware vendors use Accelerometer to make well-informed hardware decisions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169802/1/akshitha_1.pd

    Future Energy Efficient Data Centers With Disaggregated Servers

    Get PDF
    The popularity of the Internet and the demand for 24/7 services uptime is driving system performance and reliability requirements to levels that today's data centers can no longer support. This paper examines the traditional monolithic conventional server (CS) design and compares it to a new design paradigm: the disaggregated server (DS) data center design. The DS design arranges data centers resources in physical pools, such as processing, memory, and IO module pools, rather than packing each subset of such resources into a single server box. In this paper, we study energy efficient resource provisioning and virtual machine (VM) allocation in DS-based data centers compared to CS-based data centers. First, we present our new design for the photonic DS-based data center architecture, supplemented with a complete description of the architectural components. Second, we develop a mixed integer linear programming (MILP) model to optimize VM allocation for the DS-based data center, including the data center communication fabric power consumption. Our results indicate that, in DS data centers, the optimum allocation of pooled resources and their communication power yields up to 42% average savings in total power consumption when compared with the CS approach. Due to the MILP high computational complexity, we developed an energy efficient resource provisioning heuristic for DS with communication fabric (EERP-DSCF), based on the MILP model insights, with comparable power efficiency to the MILP model. With EERP-DSCF, we can extend the number of served VMs, where the MILP model scalability for a large number of VMs is challenging. Furthermore, we assess the energy efficiency of the DS design under stringent conditions by increasing the CPU to memory traffic and by including high noncommunication power consumption to determine the conditions at which the DS and CS designs become comparable in power consumption. Finally, we present a complete analysis of the communication patterns in our new DS design and some recommendations for design and implementation challenges

    Consensus protocols exploiting network programmability

    Get PDF
    Services rely on replication mechanisms to be available at all time. The service demanding high availability is replicated on a set of machines called replicas. To maintain the consistency of replicas, a consensus protocol such as Paxos or Raft is used to synchronize the replicas' state. As a result, failures of a minority of replicas will not affect the service as other non-faulty replicas continue serving requests. A consensus protocol is a procedure to achieve an agreement among processors in a distributed system involving unreliable processors. Unfortunately, achieving such an agreement involves extra processing on every request, imposing a substantial performance degradation. Consequently, performance has long been a concern for consensus protocols. Although many efforts have been made to improve consensus performance, it continues to be an important problem for researchers. This dissertation presents a novel approach to improving consensus performance. Essentially, it exploits the programmability of a new breed of network devices to accelerate consensus protocols that traditionally run on commodity servers. The benefits of using programmable network devices to run consensus protocols are twofold: The network switches process packets faster than commodity servers and consensus messages travel fewer hops in the network. It means that the system throughput is increased and the latency of requests is reduced. The evaluation of our network-accelerated consensus approach shows promising results. Individual components of our FPGA- based and switch-based consensus implementations can process 10 million and 2.5 billion consensus messages per second, respectively. Our FPGA-based system as a whole delivers 4.3 times performance of a traditional software consensus implementation. The latency is also better for our system and is only one third of the latency of the software consensus implementation when both systems are under half of their maximum throughputs. In order to drive even higher performance, we apply a partition mechanism to our switch-based system, leading to 11 times better throughput and 5 times better latency. By dynamically switching between software-based and network-based implementations, our consensus systems not only improve performance but also use energy more efficiently. Encouraged by those benefits, we developed a fault-tolerant non-volatile memory system. A prototype using software memory controller demonstrated reasonable overhead over local memory access, showing great promise as scalable main memory. Our network-based consensus approach would have a great impact in data centers. It not only improves performance of replication mechanisms which relied on consensus, but also enhances performance of services built on top of those replication mechanisms. Our approach also motivates others to move new functionalities into the network, such as, key-value store and stream processing. We expect that in the near future, applications that typically run on traditional servers will be folded into networks for performance
    corecore