179 research outputs found

    Rethinking Design Metrics for Datacenter DRAM

    Full text link
    Over the years, the evolution of DRAM has provided a little improvement in access latencies, but has been optimized to deliver greater peak bandwidths from the devices. The combined bandwidth in a contemporary multi-socket server system runs into hundreds of GB/s. However datacenter scale applications running on server platforms care largely about having access to a large pool of low-latency main memory (DRAM), and in the best case, are unable to utilize even a small fraction of the total memory bandwidth. In this extended abstract, we use measured data from the state-of-the-art servers running memory intensive datacenter workloads like Memcached to argue for main memory design to steer away from optimizing traditional metrics for DRAM design like peak bandwidth so as to be able to cater the growing needs to the datacenter server industry for high density, low latency memory with moderate bandwidth requirements

    Hardware-Software Co-Design for Network Performance Measurement

    Get PDF
    Diagnosing performance problems in networks is important, for example to determine where packets experience high latency or loss. However, existing performance diagnoses are constrained by limited switch mechanisms for measurement. Alternatively, operators use endpoint information indirectly to infer root causes for problematic latency or drops. Instead of designing piecemeal solutions to work around such switch restrictions, we believe that the right approach is to co-design language abstractions and switch hardware primitives for network performance measurement. This approach provides confidence that the switch primitives are sufficiently general, i.e., they can support a variety of existing and unanticipated use cases. We present a declarative query language that allows operators to ask a diverse set of network performance questions. We show that these queries can be implemented efficiently in switch hardware using a novel programmable key-value store primitive. Our preliminary evaluations show that our hardware design is feasible at modest chip area overhead relative to existing switching chips

    Relaxing state-access constraints in stateful programmable data planes

    Get PDF
    Supporting the programming of stateful packet forwarding functions in hardware has recently attracted the interest of the research community. When designing such switching chips, the challenge is to guarantee the ability to program functions that can read and modify data plane's state, while keeping line rate performance and state consistency. Current state-of-the-art designs are based on a very conservative all-or-nothing model: programmability is limited only to those functions that are guaranteed to sustain line rate, with any traffic workload. In effect, this limits the maximum time to execute state update operations. In this paper, we explore possible options to relax these constraints by using simulations on real traffic traces. We then propose a model in which functions can be executed in a larger but bounded time, while preventing data hazards with memory locking. We present results showing that such flexibility can be supported with little or no throughput degradation.Comment: 6 page

    Design Guidelines for High-Performance SCM Hierarchies

    Full text link
    With emerging storage-class memory (SCM) nearing commercialization, there is evidence that it will deliver the much-anticipated high density and access latencies within only a few factors of DRAM. Nevertheless, the latency-sensitive nature of memory-resident services makes seamless integration of SCM in servers questionable. In this paper, we ask the question of how best to introduce SCM for such servers to improve overall performance/cost over existing DRAM-only architectures. We first show that even with the most optimistic latency projections for SCM, the higher memory access latency results in prohibitive performance degradation. However, we find that deployment of a modestly sized high-bandwidth 3D stacked DRAM cache makes the performance of an SCM-mostly memory system competitive. The high degree of spatial locality that memory-resident services exhibit not only simplifies the DRAM cache's design as page-based, but also enables the amortization of increased SCM access latencies and the mitigation of SCM's read/write latency disparity. We identify the set of memory hierarchy design parameters that plays a key role in the performance and cost of a memory system combining an SCM technology and a 3D stacked DRAM cache. We then introduce a methodology to drive provisioning for each of these design parameters under a target performance/cost goal. Finally, we use our methodology to derive concrete results for specific SCM technologies. With PCM as a case study, we show that a two bits/cell technology hits the performance/cost sweet spot, reducing the memory subsystem cost by 40% while keeping performance within 3% of the best performing DRAM-only system, whereas single-level and triple-level cell organizations are impractical for use as memory replacements.Comment: Published at MEMSYS'1

    The transprecision computing paradigm: Concept, design, and applications

    Get PDF
    Guaranteed numerical precision of each elementary step in a complex computation has been the mainstay of traditional computing systems for many years. This era, fueled by Moore’s law and the constant exponential improvement in computing efficiency, is at its twilight: from tiny nodes of the Internet-of-Things, to large HPC computing centers, subpicoJoule/operation energy efficiency is essential for practical realizations. To overcome the power wall, a shift from traditional computing paradigms is now mandatory. In this paper we present the driving motivations, roadmap, and expected impact of the European project OPRECOMP. OPRECOMP aims to (i) develop the first complete transprecision computing framework, (ii) apply it to a wide range of hardware platforms, from the sub-milliWatt up to the MegaWatt range, and (iii) demonstrate impact in a wide range of computational domains, spanning IoT, Big Data Analytics, Deep Learning, and HPC simulations. By combining together into a seamless design transprecision advances in devices, circuits, software tools, and algorithms, we expect to achieve major energy efficiency improvements, even when there is no freedom to relax end-to-end application quality of results. Indeed, OPRECOMP aims at demolishing the ultraconservative “precise” computing abstraction, replacing it with a more flexible and efficient one, namely transprecision computing
    corecore