603 research outputs found

    dReDBox: Materializing a full-stack rack-scale system prototype of a next-generation disaggregated datacenter

    Get PDF
    Current datacenters are based on server machines, whose mainboard and hardware components form the baseline, monolithic building block that the rest of the system software, middleware and application stack are built upon. This leads to the following limitations: (a) resource proportionality of a multi-tray system is bounded by the basic building block (mainboard), (b) resource allocation to processes or virtual machines (VMs) is bounded by the available resources within the boundary of the mainboard, leading to spare resource fragmentation and inefficiencies, and (c) upgrades must be applied to each and every server even when only a specific component needs to be upgraded. The dRedBox project (Disaggregated Recursive Datacentre-in-a-Box) addresses the above limitations, and proposes the next generation, low-power, across form-factor datacenters, departing from the paradigm of the mainboard-as-a-unit and enabling the creation of function-block-as-a-unit. Hardware-level disaggregation and software-defined wiring of resources is supported by a full-fledged Type-1 hypervisor that can execute commodity virtual machines, which communicate over a low-latency and high-throughput software-defined optical network. To evaluate its novel approach, dRedBox will demonstrate application execution in the domains of network functions virtualization, infrastructure analytics, and real-time video surveillance.This work has been supported in part by EU H2020 ICTproject dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

    dReDBox: A Disaggregated Architectural Perspective for Data Centers

    Get PDF
    Data centers are currently constructed with fixed blocks (blades); the hard boundaries of this approach lead to suboptimal utilization of resources and increased energy requirements. The dReDBox (disaggregated Recursive Datacenter in a Box) project addresses the problem of fixed resource proportionality in next-generation, low-power data centers by proposing a paradigm shift toward finer resource allocation granularity, where the unit is the function block rather than the mainboard tray. This introduces various challenges at the system design level, requiring elastic hardware architectures, efficient software support and management, and programmable interconnect. Memory and hardware accelerators can be dynamically assigned to processing units to boost application performance, while high-speed, low-latency electrical and optical interconnect is a prerequisite for realizing the concept of data center disaggregation. This chapter presents the dReDBox hardware architecture and discusses design aspects of the software infrastructure for resource allocation and management. Furthermore, initial simulation and evaluation results for accessing remote, disaggregated memory are presented, employing benchmarks from the Splash-3 and the CloudSuite benchmark suites.This work was supported in part by EU H2020 ICT project dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

    LLM: Realizing Low-Latency Memory by Exploiting Embedded Silicon Photonics for Irregular Workloads

    Get PDF
    As emerging workloads exhibit irregular memory access patterns with poor data reuse and locality, they would benefit from a DRAM that achieves low latency without sacrificing bandwidth and energy efficiency. We propose LLM (Low Latency Memory), a codesign of the DRAM microarchitecture, the memory controller and the LLC/DRAM interconnect by leveraging embedded silicon photonics in 2.5D/3D integrated system on chip. LLM relies on Wavelength Division Multiplexing (WDM)-based photonic interconnects to reduce the contention throughout the memory subsystem. LLM also increases the bank-level parallelism, eliminates bus conflicts by using dedicated optical data paths, and reduces the access energy per bit with shorter global bitlines and smaller row buffers. We evaluate the design space of LLM for a variety of synthetic benchmarks and representative graph workloads on a full-system simulator (gem5). LLM exhibits low memory access latency for traffics with both regular and irregular access patterns. For irregular traffic, LLM achieves high bandwidth utilization (over 80% peak throughput compared to 20% of HBM2.0). For real workloads, LLM achieves 3 Ă— and 1.8 Ă— lower execution time compared to HBM2.0 and a state-of-the-art memory system with high memory level parallelism, respectively. This study also demonstrates that by reducing queuing on the data path, LLM can achieve on average 3.4 Ă— lower memory latency variation compared to HBM2.0

    All-Optical Programmable Disaggregated Data Centre Network realized by FPGA-based Switch and Interface Card

    Get PDF
    This paper reports an FPGA-based switch and interface card (SIC) and its application scenario in an all-optical, programmable disaggregated data center network (DCN). Our novel SIC is designed and implemented to replace traditional optical network interface cards, plugged into the server directly, supporting optical packet switching (OPS)/optical circuit switching (OCS) or time division multiplexing (TDM)/wavelength division multiplexing (WDM) traffic on demand. Placing the SIC in each server/blade, we eliminate electronics from the top of rack (ToR) switch by pushing all the functionality on each blade while enabling direct intrarack blade-to-blade communication to deliver ultralow chip-to-chip latency. We demonstrate the disaggregated DCN architecture scenarios along with all-optical dimension-programmable N Ă— M spectrum selective Switches (SSS) and an architecture-on-demand (AoD) optical backplane. OPS and OCS complement each other as do TDM and WDM, which can support variable traffic flows. A flat disaggregated DCN architecture is realized by connecting the optical ToR switches directly to either an optical top of cluster switch or the intracluster AoD optical backplane, while clusters are further interconnected to an intercluster AoD for scaling out

    A Software-defined SoC Memory Bus Bridge Architecture for Disaggregated Computing

    Full text link
    Disaggregation and rack-scale systems have the potential of drastically decreasing TCO and increasing utilization of cloud datacenters, while maintaining performance. While the concept of organising resources in separate pools and interconnecting them together on demand is straightforward, its materialisation can be radically different in terms of performance and scale potential. In this paper, we present a memory bus bridge architecture which enables communication between 100s of masters and slaves in todays complex multiprocessor SoCs, that are physically intregrated in different chips and even different mainboards. The bridge tightly couples serial transceivers and a circuit network for chip-to-chip transfers. A key property of the proposed bridge architecture is that it is software-defined and thus can be configured at runtime, via a software control plane, to prepare and steer memory access transactions to remote slaves. This is particularly important because it enables datacenter orchestration tools to manage the disaggregated resource allocation. Moreover, we evaluate a bridge prototype we have build for ARM AXI4 memory bus interconnect and we discuss application-level observed performance.Comment: 3rd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems (AISTECS 2018, part of HiPEAC 2018

    Venice: Exploring Server Architectures for Effective Resource Sharing

    Get PDF
    Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance

    Scalability of broadcast performance in wireless network-on-chip

    Get PDF
    Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
    • …
    corecore