20 research outputs found

    Venice: Exploring Server Architectures for Effective Resource Sharing

    Get PDF
    Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance

    dReDBox: A Disaggregated Architectural Perspective for Data Centers

    Get PDF
    Data centers are currently constructed with fixed blocks (blades); the hard boundaries of this approach lead to suboptimal utilization of resources and increased energy requirements. The dReDBox (disaggregated Recursive Datacenter in a Box) project addresses the problem of fixed resource proportionality in next-generation, low-power data centers by proposing a paradigm shift toward finer resource allocation granularity, where the unit is the function block rather than the mainboard tray. This introduces various challenges at the system design level, requiring elastic hardware architectures, efficient software support and management, and programmable interconnect. Memory and hardware accelerators can be dynamically assigned to processing units to boost application performance, while high-speed, low-latency electrical and optical interconnect is a prerequisite for realizing the concept of data center disaggregation. This chapter presents the dReDBox hardware architecture and discusses design aspects of the software infrastructure for resource allocation and management. Furthermore, initial simulation and evaluation results for accessing remote, disaggregated memory are presented, employing benchmarks from the Splash-3 and the CloudSuite benchmark suites.This work was supported in part by EU H2020 ICT project dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

    Disaggregated Memory Architectures for Blade Servers.

    Full text link
    Current trends in memory capacity and power of servers indicate the need for memory system redesign. Memory capacity is projected to grow at a smaller rate relative to the growth in compute capacity, leading to a potential memory capacity wall in future systems. Furthermore, per-server memory demands are increasing due to large-memory applications, virtual machine consolidation, and bigger operating system footprints. The large amount of memory required is leading to memory power being a substantial and growing portion of server power budgets. As these capacity and power trends continue, a new memory architecture is needed that provides increased capacity and maximizes resource efficiency. This thesis presents the design of a disaggregated memory architecture for blade servers that provides expanded memory capacity and dynamic capacity sharing across multiple servers. Unlike traditional architectures that co-locate compute and memory resources, the proposed design disaggregates a portion of the servers’ memory, which is then assembled in separate memory blades optimized for both capacity and power usage. The servers access memory blades through a redesigned memory hierarchy that is extended to include a remote level that augments local memory. Through the shared interconnect of blade enclosures, multiple compute blades can connect to a single memory blade and dynamically share its capacity. This sharing increases resource efficiency by taking advantage of the differing memory utilization patterns of the compute blades. This thesis evaluates two system architectures that provide operating system-transparent access to the memory blade; one uses virtualization and a commodity-based interconnect, and the other uses minor hardware additions and a high-speed interconnect. The ability to extend and share memory can achieve orders of magnitude performance improvements in cases where applications run out of memory capacity, and similar improvements in performance-per-dollar in cases where systems are overprovisioned for peak memory usage. To complement the evaluation, a hypervisor-based prototype of one system architecture is developed. Finally, by extending the principles of disaggregation to both compute and memory resources, new server architectures are proposed for large-scale data centers that can double performance-per-dollar when considering total cost of ownership compared to traditional servers.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/76007/1/ktlim_1.pd

    Pluggable Optical Connector Interfaces for Electro-Optical Circuit Boards

    Get PDF
    A study is hereby presented on system embedded photonic interconnect technologies, which would address the communications bottleneck in modern exascale data centre systems driven by exponentially rising consumption of digital information and the associated complexity of intra-data centre network management along with dwindling data storage capacities. It is proposed that this bottleneck be addressed by adopting within the system electro-optical printed circuit boards (OPCBs), on which conventional electrical layers provide power distribution and static or low speed signaling, but high speed signals are conveyed by optical channels on separate embedded optical layers. One crucial prerequisite towards adopting OPCBs in modern data storage and switch systems is a reliable method of optically connecting peripheral cards and devices within the system to an OPCB backplane or motherboard in a pluggable manner. However the large mechanical misalignment tolerances between connecting cards and devices inherent to such systems are contrasted by the small sizes of optical waveguides required to support optical communication at the speeds defined by prevailing communication protocols. An innovative approach is therefore required to decouple the contrasting mechanical tolerances in the electrical and optical domains in the system in order to enable reliable pluggable optical connectivity. This thesis presents the design, development and characterisation of a suite of new optical waveguide connector interface solutions for electro-optical printed circuit boards (OPCBs) based on embedded planar polymer waveguides and planar glass waveguides. The technologies described include waveguide receptacles allowing parallel fibre connectors to be connected directly to OPCB embedded planar waveguides and board-to-board connectors with embedded parallel optical transceivers allowing daughtercards to be orthogonally connected to an OPCB backplane. For OPCBs based on embedded planar polymer waveguides and embedded planar glass waveguides, a complete demonstration platform was designed and developed to evaluate the connector interfaces and the associated embedded optical interconnect. Furthermore a large portfolio of intellectual property comprising 19 patents and patent applications was generated during the course of this study, spanning the field of OPCBs, optical waveguides, optical connectors, optical assembly and system embedded optical interconnects

    A software-defined architecture and prototype for disaggregated memory rack scale systems

    Get PDF
    Disaggregation and rack-scale systems have the potential of drastically increasing TCO and utilization of cloud datacenters, while maintaining performance. In this paper, we present a novel rack-scale system architecture featuring software-defined remote memory disaggregation. Our hardware design and operating system extensions enable unmodified applications to dynamically attach to memory segments residing on physically remote memory pools and use such remote segments in a byte-addressable manner, as if they were local to the application. Our system features also a control plane that automates software-defined dynamic matching of compute to memory resources, as driven by datacenter workload needs. We prototyped our system on the commercially available Zynq Ultrascale+ MPSoC platform. To our knowledge, this is the first time a software-defined disaggregated system has been prototyped on commercial hardware and evaluated through industry standard software benchmarks. Our initial results - using benchmarks that are artificially highly adversarial in terms of memory bandwidth - show that disaggregated memory access exhibits a round-trip latency of only 134 clock cycles; and a throughput penalty of as low as 55%, relative to locally-attached memory. We also discuss estimations as to how our findings may translate to applications with pragmatically milder memory aggressiveness levels, as well as innovation avenues across the stack opened up by our work

    Improving the Scalability of High Performance Computer Systems

    Full text link
    Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design
    corecore