Search CORE

20 research outputs found

Recommended from our members

Silicon Photonic Subsystems for Inter-Chip Optical Networks

Author: Gazman Alexander
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

The continuous growth of electronic compute and memory nodes in terms of the number of I/O pins, bandwidth, and areal throughput poses major integration and packaging challenges associated with offloading multi-Tbit/s data rates within the few pJ/bit targets. While integrated photonics are already deployed in long and short distances such as inter and intra data centers communications, the promising characteristics of the silicon photonic platform set it as the future technology for optical interconnects in ultra short inter-chip distances. The high index contrast between the waveguide and the cladding together with strong thermo-optic and carrier effects in silicon allows developing a wide range of micro-scale and low power optical devices compatible with the CMOS fabrication processes. Furthermore, the availability of photonic foundries and new electrical and optical co-packaging techniques further pushes this platform for the next steps of commercial deployment. The work in this dissertation presents the current trends in high-performance memory and processor nodes and gives motivation for disaggregated and reconfigurable inter-chip network enabled with the silicon photonic layer. A dense WDM transceiver and broadband switch architectures are discussed to support a bi-directional network of ten hybrid-memory cubes (HMC) interconnected to ten processor nodes with an overall aggregated bandwidth of 9.6Tbit/s. Latency and energy consumption are key performance parameters in a processor to primary memory nodes connectivity. The transceiver design is based on energy-efficient micro-ring resonators, and the broadband switch is constructed with 2x2 Mach-Zehnder elements for nano-second reconfiguration. Each transceiver is based on hundreds of micro-rings to convert the native HMC electrical protocol to the optical domain and the switch is based on tens of hundreds of 2x2 elements to achieve non-blocking all-to-all connectivity. The next chapters focus on developing methods for controlling and monitoring such complex and highly integrated silicon photonic subsystems. The thermo-optic effect is characterized and we show experimentally that the phase of the optical carrier can be reliably controlled with pulse-width modulation (PWM) signal, ultimately relaxing the need for hundreds of digital to analog converters (DACs). We further show that doped waveguide heaters can be utilized as \textit{in-line} optical power monitors by measuring photo-conductance current, which is an alternative for the conventional tapping and integration of photo-diodes. The next part concerned with a common cascaded micro-ring resonator in a WDM transceiver design. We develop on an FPGA control algorithm that abstracts the physical layer and takes user-defined inputs to set the resonances to the desired wavelength in a unicast and multicast transmission modes. The associated sensitivities of these silicon ring resonators are presented and addressed with three closed-loop solutions. We first show a closed-loop operation based on tapping the error signal from the drop port of the micro-ring. The second solution presents a resonance wavelength locking with a single digital I/O for control and feedback signals. Lastly, we leverage the photo-conductance effect and demonstrate the locking procedure using only the doped heater for both control and feedback purposes. To achieve the inter-chip reconfigurability we discuss recent advances of high-port-count SiP broadband switches for reconfigurable inter-chip networks. To ensure optimal operation in terms of low insertion loss, low cross-talk and high signal integrity per routing path, hundreds of 2x2 Mach-Zehnder elements need to be biased precisely for the cross and bar states. We address this challenge with a tapless and a design agnostic calibration approach based on the photo-conductance effect. The automated algorithm returns a look-up table for all for each 2x2 element and the associated calibrated biases. Each routing scenario is then tested for insertion loss, crosstalk and bit-error rate of 25Gbit/s 4-level pulse amplitude modulation signals. The last part utilizes the Mach-Zehnder interferometers in WDM transceiver applications. We demonstrate a polarization insensitive four-channel WDM receiver with 40Gbit/s per channel and a transmitter design generating 8-level pulse amplitude modulation signals at 30Gbit/s

Columbia University Academic Commons

Venice: Exploring Server Architectures for Effective Resource Sharing

Author: Cui Xiaosong
Dong Jianbo
Hou Rui
Huang Michael
Jiang Tao
McKee Sally A
Wang Haibin
Zhang Lixin
Zhao Boyan
Publication venue
Publication date: 01/01/2016
Field of study

Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance

Chalmers Research

Chalmers Publication Library

dReDBox: A Disaggregated Architectural Perspective for Data Centers

Author: Alachiotis Nikolaos
Andronikakis Andreas
Igoumenos Ioannis
Katrinis Kostas
Korakis Thanasis
Mishra Vaibhawa
Papadakis Orion
Pnevmatikatos Dionisios
Reale Andrea
Syrigos Ilias
Syrivelis Dimitris
Theodoropoulos Dimitris
Torrents Marti
Yuan Hui
Zervas George
Zyulkyarov Ferad
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Data centers are currently constructed with fixed blocks (blades); the hard boundaries of this approach lead to suboptimal utilization of resources and increased energy requirements. The dReDBox (disaggregated Recursive Datacenter in a Box) project addresses the problem of fixed resource proportionality in next-generation, low-power data centers by proposing a paradigm shift toward finer resource allocation granularity, where the unit is the function block rather than the mainboard tray. This introduces various challenges at the system design level, requiring elastic hardware architectures, efficient software support and management, and programmable interconnect. Memory and hardware accelerators can be dynamically assigned to processing units to boost application performance, while high-speed, low-latency electrical and optical interconnect is a prerequisite for realizing the concept of data center disaggregation. This chapter presents the dReDBox hardware architecture and discusses design aspects of the software infrastructure for resource allocation and management. Furthermore, initial simulation and evaluation results for accessing remote, disaggregated memory are presented, employing benchmarks from the Splash-3 and the CloudSuite benchmark suites.This work was supported in part by EU H2020 ICT project dRedBox, contract #687632.Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Disaggregated Memory Architectures for Blade Servers.

Author: Lim Kevin Te-Ming
Publication venue
Publication date
Field of study

Current trends in memory capacity and power of servers indicate the need for memory system redesign. Memory capacity is projected to grow at a smaller rate relative to the growth in compute capacity, leading to a potential memory capacity wall in future systems. Furthermore, per-server memory demands are increasing due to large-memory applications, virtual machine consolidation, and bigger operating system footprints. The large amount of memory required is leading to memory power being a substantial and growing portion of server power budgets. As these capacity and power trends continue, a new memory architecture is needed that provides increased capacity and maximizes resource efficiency. This thesis presents the design of a disaggregated memory architecture for blade servers that provides expanded memory capacity and dynamic capacity sharing across multiple servers. Unlike traditional architectures that co-locate compute and memory resources, the proposed design disaggregates a portion of the servers’ memory, which is then assembled in separate memory blades optimized for both capacity and power usage. The servers access memory blades through a redesigned memory hierarchy that is extended to include a remote level that augments local memory. Through the shared interconnect of blade enclosures, multiple compute blades can connect to a single memory blade and dynamically share its capacity. This sharing increases resource efficiency by taking advantage of the differing memory utilization patterns of the compute blades. This thesis evaluates two system architectures that provide operating system-transparent access to the memory blade; one uses virtualization and a commodity-based interconnect, and the other uses minor hardware additions and a high-speed interconnect. The ability to extend and share memory can achieve orders of magnitude performance improvements in cases where applications run out of memory capacity, and similar improvements in performance-per-dollar in cases where systems are overprovisioned for peak memory usage. To complement the evaluation, a hypervisor-based prototype of one system architecture is developed. Finally, by extending the principles of disaggregation to both compute and memory resources, new server architectures are proposed for large-scale data centers that can double performance-per-dollar when considering total cost of ownership compared to traditional servers.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/76007/1/ktlim_1.pd

Deep Blue Documents at the University of Michigan

Pluggable Optical Connector Interfaces for Electro-Optical Circuit Boards

Author: Pitwon Richard Charles Alexander
Publication venue: 'University of Plymouth'
Publication date: 01/01/2017
Field of study

A study is hereby presented on system embedded photonic interconnect technologies, which would address the communications bottleneck in modern exascale data centre systems driven by exponentially rising consumption of digital information and the associated complexity of intra-data centre network management along with dwindling data storage capacities. It is proposed that this bottleneck be addressed by adopting within the system electro-optical printed circuit boards (OPCBs), on which conventional electrical layers provide power distribution and static or low speed signaling, but high speed signals are conveyed by optical channels on separate embedded optical layers. One crucial prerequisite towards adopting OPCBs in modern data storage and switch systems is a reliable method of optically connecting peripheral cards and devices within the system to an OPCB backplane or motherboard in a pluggable manner. However the large mechanical misalignment tolerances between connecting cards and devices inherent to such systems are contrasted by the small sizes of optical waveguides required to support optical communication at the speeds defined by prevailing communication protocols. An innovative approach is therefore required to decouple the contrasting mechanical tolerances in the electrical and optical domains in the system in order to enable reliable pluggable optical connectivity. This thesis presents the design, development and characterisation of a suite of new optical waveguide connector interface solutions for electro-optical printed circuit boards (OPCBs) based on embedded planar polymer waveguides and planar glass waveguides. The technologies described include waveguide receptacles allowing parallel fibre connectors to be connected directly to OPCB embedded planar waveguides and board-to-board connectors with embedded parallel optical transceivers allowing daughtercards to be orthogonally connected to an OPCB backplane. For OPCBs based on embedded planar polymer waveguides and embedded planar glass waveguides, a complete demonstration platform was designed and developed to evaluate the connector interfaces and the associated embedded optical interconnect. Furthermore a large portfolio of intellectual property comprising 19 patents and patent applications was generated during the course of this study, spanning the field of OPCBs, optical waveguides, optical connectors, optical assembly and system embedded optical interconnects

Plymouth Electronic Archive and Research Library

A software-defined architecture and prototype for disaggregated memory rack scale systems

Author: Bielski M
Katrinis K
Pnevmatikatos DN
Reale A
Syrigos I
Syrivelis D
Theodoropoulos D
Zervas G
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/04/2018
Field of study

Disaggregation and rack-scale systems have the potential of drastically increasing TCO and utilization of cloud datacenters, while maintaining performance. In this paper, we present a novel rack-scale system architecture featuring software-defined remote memory disaggregation. Our hardware design and operating system extensions enable unmodified applications to dynamically attach to memory segments residing on physically remote memory pools and use such remote segments in a byte-addressable manner, as if they were local to the application. Our system features also a control plane that automates software-defined dynamic matching of compute to memory resources, as driven by datacenter workload needs. We prototyped our system on the commercially available Zynq Ultrascale+ MPSoC platform. To our knowledge, this is the first time a software-defined disaggregated system has been prototyped on commercial hardware and evaluated through industry standard software benchmarks. Our initial results - using benchmarks that are artificially highly adversarial in terms of memory bandwidth - show that disaggregated memory access exhibits a round-trip latency of only 134 clock cycles; and a throughput penalty of as low as 55%, relative to locally-attached memory. We also discuss estimations as to how our findings may translate to applications with pragmatically milder memory aggressiveness levels, as well as innovation avenues across the stack opened up by our work

UCL Discovery

Recommended from our members

Cross-Layer Pathfinding for Off-Chip Interconnects

Author: Srinivas Vaishnav
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Off-chip interconnects for integrated circuits (ICs) today induce a diverse design space, spanning many different applications that require transmission of data at various bandwidths, latencies and link lengths. Off-chip interconnect design solutions are also variously sensitive to system performance, power and cost metrics, while also having a strong impact on these metrics. The costs associated with off-chip interconnects include die area, package (PKG) and printed circuit board (PCB) area, technology and bill of materials (BOM). Choices made regarding off-chip interconnects are fundamental to product definition, architecture, design implementation and technology enablement. Given their cross-layer impact, it is imperative that a cross-layer approach be employed to architect and analyze off-chip interconnects up front, so that a top-down design flow can comprehend the cross-layer impacts and correctly assess the system performance, power and cost tradeoffs for off-chip interconnects. Chip architects are not exposed to all the tradeoffs at the physical and circuit implementation or technology layers, and often lack the tools to accurately assess off-chip interconnects. Furthermore, the collaterals needed for a detailed analysis are often lacking when the chip is architected; these include circuit design and layout, PKG and PCB layout, and physical floorplan and implementation. To address the need for a framework that enables architects to assess the system-level impact of off-chip interconnects, this thesis presents power-area-timing (PAT) models for off-chip interconnects, optimization and planning tools with the appropriate abstraction using these PAT models, and die/PKG/PCB co-design methods that help expose the off-chip interconnect cross-layer metrics to the die/PKG/PCB design flows. Together, these models, tools and methods enable cross-layer optimization that allows for a top-down definition and exploration of the design space and helps converge on the correct off-chip interconnect implementation and technology choice. The tools presented cover off-chip memory interfaces for mobile and server products, silicon photonic interfaces, 2.5D silicon interposers and 3D through-silicon vias (TSVs). The goal of the cross-layer framework is to assess the key metrics of the interconnect (such as timing, latency, active/idle/sleep power, and area/cost) at an appropriate level of abstraction by being able to do this across layers of the design flow. In additional to signal interconnect, this thesis also explores the need for such cross-layer pathfinding for power distribution networks (PDN), where the system-on-chip (SoC) floorplan and pinmap must be optimized before the collateral layouts for PDN analysis are ready. Altogether, the developed cross-layer pathfinding methodology for off-chip interconnects enables more rapid and thorough exploration of a vast design space of off-chip parallel and serial links, inter-die and inter-chiplet links and silicon photonics. Such exploration will pave the way for off-chip interconnect technology enablement that is optimized for system needs. The basis of the framework can be extended to cover other interconnect technology as well, since it fundamentally relates to system-level metrics that are common to all off-chip interconnects

eScholarship - University of California

Improving the Scalability of High Performance Computer Systems

Author: Litz Heiner Hannes
Publication venue: Universität Mannheim
Publication date: 01/01/2011
Field of study

Improving the performance of future computing systems will be based upon the ability of increasing the scalability of current technology. New paths need to be explored, as operating principles that were applied up to now are becoming irrelevant for upcoming computer architectures. It appears that scaling the number of cores, processors and nodes within an system represents the only feasible alternative to achieve Exascale performance. To accomplish this goal, we propose three novel techniques addressing different layers of computer systems. The Tightly Coupled Cluster technique significantly improves the communication for inter node communication within compute clusters. By improving the latency by an order of magnitude over existing solutions the cost of communication is considerably reduced. This enables to exploit fine grain parallelism within applications, thereby, extending the scalability considerably. The mechanism virtually moves the network interconnect into the processor, bypassing the latency of the I/O interface and rendering protocol conversions unnecessary. The technique is implemented entirely through firmware and kernel layer software utilizing off-the-shelf AMD processors. We present a proof-of-concept implementation and real world benchmarks to demonstrate the superior performance of our technique. In particular, our approach achieves a software-to-software communication latency of 240 ns between two remote compute nodes. The second part of the dissertation introduces a new framework for scalable Networks-on-Chip. A novel rapid prototyping methodology is proposed, that accelerates the design and implementation substantially. Due to its flexibility and modularity a large application space is covered ranging from Systems-on-chip, to high performance many-core processors. The Network-on-Chip compiler enables to generate complex networks in the form of synthesizable register transfer level code from an abstract design description. Our engine supports different target technologies including Field Programmable Gate Arrays and Application Specific Integrated Circuits. The framework enables to build large designs while minimizing development and verification efforts. Many topologies and routing algorithms are supported by partitioning the tasks into several layers and by the introduction of a protocol agnostic architecture. We provide a thorough evaluation of the design that shows excellent results regarding performance and scalability. The third part of the dissertation addresses the Processor-Memory Interface within computer architectures. The increasing compute power of many-core processors, leads to an equally growing demand for more memory bandwidth and capacity. Current processor designs exhibit physical limitations that restrict the scalability of main memory. To address this issue we propose a memory extension technique that attaches large amounts of DRAM memory to the processor via a low pin count interface using high speed serial transceivers. Our technique transparently integrates the extension memory into the system architecture by providing full cache coherency. Therefore, applications can utilize the memory extension by applying regular shared memory programming techniques. By supporting daisy chained memory extension devices and by introducing the asymmetric probing approach, the proposed mechanism ensures high scalability. We furthermore propose a DMA offloading technique to improve the performance of the processor memory interface. The design has been implemented in a Field Programmable Gate Array based prototype. Driver software and firmware modifications have been developed to bring up the prototype in a Linux based system. We show microbenchmarks that prove the feasibility of our design

MAnnheim DOCument Server