12 research outputs found

    Towards near-threshold server processors

    Get PDF
    The popularity of cloud computing has led to a dramatic increase in the number of data centers in the world. The ever-increasing computational demands along with the slowdown in technology scaling has ushered an era of power-limited servers. Techniques such as near-threshold computing (NTC) can be used to improve energy efficiency in the post-Dennard scaling era. This paper describes an architecture based on the FD-SOI process technology for near-threshold operation in servers. Our work explores the trade-offs in energy and performance when running a wide range of applications found in private and public clouds, ranging from traditional scale-out applications, such as web search or media streaming, to virtualized banking applications. Our study demonstrates the benefits of near-threshold operation and proposes several directions to synergistically increase the energy proportionality of a near-threshold server

    When parallel speedups hit the memory wall

    Get PDF
    After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.Comment: 24 page

    Energy Proportionality in Near-Threshold Computing Servers and Cloud Data Centers: Consolidating or Not?

    Get PDF
    Cloud Computing aims to efficiently tackle the increasing demand of computing resources, and its popularity has led to a dramatic increase in the number of computing servers and data centers worldwide. However, as effect of post-Dennard scaling, computing servers have become power-limited, and new system-level approaches must be used to improve their energy efficiency. This paper first presents an accurate power modelling characterization for a new server architecture based on the FD-SOI process technology for near-threshold computing (NTC). Then, we explore the existing energy vs. performance trade-offs when virtualized applications with different CPU utilization and memory footprint characteristics are executed. Finally, based on this analysis, we propose a novel dynamic virtual machine (VM) allocation method that exploits the knowledge of VMs characteristics together with our accurate server power model for next-generation NTC-based data centers, while guaranteeing quality of service (QoS) requirements. Our results demonstrate the inefficiency of current workload consolidation techniques for new NTC-based data center designs, and how our proposed method provides up to 45% energy savings when compared to state-of-the-art consolidation-based approaches

    Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores

    Get PDF
    In a drive to maximize resource utilization, today's datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments, such as those at Google, colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has significant slack in its response latency with respect to the QoS target. The slack affords a degradation in single-thread performance, which is inevitable under SMT colocation, without compromising QoS targets. This work makes the observation that many batch applications can greatly benefit from a large instruction window to uncover ILP and MLP. Under SMT colocation, conventional wisdom holds that individual hardware threads should be limited in their ability to acquire and hold a disproportionately large share of microarchitectural resources so as not to compromise the performance of a co-running thread. We show that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Based on this insight, we introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints

    Circuits and Systems Advances in Near Threshold Computing

    Get PDF
    Modern society is witnessing a sea change in ubiquitous computing, in which people have embraced computing systems as an indispensable part of day-to-day existence. Computation, storage, and communication abilities of smartphones, for example, have undergone monumental changes over the past decade. However, global emphasis on creating and sustaining green environments is leading to a rapid and ongoing proliferation of edge computing systems and applications. As a broad spectrum of healthcare, home, and transport applications shift to the edge of the network, near-threshold computing (NTC) is emerging as one of the promising low-power computing platforms. An NTC device sets its supply voltage close to its threshold voltage, dramatically reducing the energy consumption. Despite showing substantial promise in terms of energy efficiency, NTC is yet to see widescale commercial adoption. This is because circuits and systems operating with NTC suffer from several problems, including increased sensitivity to process variation, reliability problems, performance degradation, and security vulnerabilities, to name a few. To realize its potential, we need designs, techniques, and solutions to overcome these challenges associated with NTC circuits and systems. The readers of this book will be able to familiarize themselves with recent advances in electronics systems, focusing on near-threshold computing

    Near-Memory Address Translation

    Get PDF
    Virtual memory (VM) is a crucial abstraction in modern computer systems at any scale, from handheld devices to datacenters. VM provides programmers the illusion of an always sufficiently large and linear memory, making programming easier. Although the core components of VM have remained largely unchanged since early VM designs, the design constraints and usage patterns of VM have radically shifted from when it was invented. Today, computer systems integrate hundreds of gigabytes to a few terabytes of memory, while tightly integrated heterogeneous computing platforms (e.g., CPUs, GPUs, FPGAs) are becoming increasingly ubiquitous. As there is a clear trend towards extending the CPU's VM to all computing elements in the system for an efficient and easy to use programming model, the continuous demand for faster memory accesses calls for fast translations to terabytes of memory for any computing element in the system. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of today's translation caches, such as TLBs. In this thesis, we provide fundamental insights into the reason why address translation sits on the critical path of accessing memory. We observe that the traditional fully associative flexibility to map any virtual page to any page frame precludes accessing memory before translating. We study the associativity in VM across a variety of scenarios by classifying page faults using the 3C model developed for caches. Our study demonstrates that the full associativity of VM is unnecessary, and only modest associativity is required. We conclude that capacity and compulsory misses---which are unaffected by associativity---dominate, while conflict misses rapidly disappear as the associativity of VM increases. Building on the modest associativity requirements, we propose a distributed memory management unit close to where the data resides to reduce or eliminate the TLB miss penalty

    Heterogeneous Architectures For Parallel Acceleration

    Get PDF
    To enable a new generation of digital computing applications, the greatest challenge is to provide a better level of energy efficiency (intended as the performance that a system can provide within a certain power budget) without giving up a systems's flexibility. This constraint applies to digital system across all scales, starting from ultra-low power implanted devices up to datacenters for high-performance computing and for the "cloud". In this thesis, we show that architectural heterogeneity is the key to provide this efficiency and to respond to many of the challenges of tomorrow's computer architecture - and at the same time we show methodologies to introduce it with little or no loss in terms of flexibility. In particular, we show that heterogeneity can be employed to tackle the "walls" that impede further development of new computing applications: the utilization wall, i.e. the impossibility to keep all transistors on in deeply integrated chips, and the "data deluge", i.e. the amount of data to be processed that is scaling up much faster than the computing performance and efficiency. We introduce a methodology to improve heterogeneous design exploration of tightly coupled clusters; moreover we propose a fractal heterogeneity architecture that is a parallel accelerator for low-power sensor nodes, and is itself internally heterogeneous thanks to an heterogeneous coprocessor for brain-inspired computing. This platform, which is silicon-proven, can lead to more than 100x improvement in terms of energy efficiency with respect to typical computing nodes used within the same domain, enabling the application of complex algorithms, vastly more performance-hungry than the current state-of-the-art in the ULP computing domain

    SLICING-BASED RESOURCE ALLOCATION AND MOBILITY MANAGEMENT FOR EMERGING WIRELESS NETWORKS

    Get PDF
    The proliferation of smart mobile devices and user applications has continued to contribute to the tremendous volume of data traffic in cellular networks. Moreover, with the feature of heterogeneous connectivity interfaces of these smart devices, it becomes more complex for managing the traffic volume in the context of mobility. To surmount this challenge, service and resource providers are looking for alternative mechanisms that can successfully facilitate managing network resources and mobility in a more dynamic, predictive and distributed manner. New concepts of network architectures such as Software-Defined Network (SDN) and Network Function Virtualization (NFV) have paved the way to move from static to flexible networks. They make networks more flexible (i.e., network providers capable of on-demand provisioning), easily customizable and cost effective. In this regard, network slicing is emerging as a new technology built on the concepts of SDN and NFV. It splits a network infrastructure into isolated virtual networks and allows them to manage network resources based on their requirements and characteristics. Most of the existing solutions for network slicing are facing challenges in terms of resource and mobility management. Regarding resource management, it creates challenges in terms of provisioning network throughput, end-to-end delay, and fairness resources allocation for each slice, whereas, in the case of mobility management, due to the rapid change of user mobility the network slice operator would like to hold the mobility controlling over its clients across different access networks, rather than the network operator, to ensure better services and user experience. In this thesis, we propose two novel architectural solutions to solve the challenges identified above. The first proposed solution introduces a Network Slicing Resource Management (NSRM) mechanism that assigns the required resources for each slice, taking into consideration resource isolation between different slices. The second proposed v solution provides a Mobility Management architecture-based Network Slicing (MMNS) where each slice manages its users across heterogeneous radio access technologies such as WiFi, LTE and 5G networks. In MMNS architecture, each slice has different mobility demands (e.g,. latency, speed and interference) and these demands are governed by a network slice configuration and service characteristics. In addition, NSRM ensures isolating, customizing and fair sharing of distributed bandwidths between various network slices and users belonging to the same slice depending on different requirements of each one. Whereas, MMNS is a logical platform that unifies different Radio Access Technologies (RATs) and allows all slices to share them in order to satisfy different slice mobility demands. We considered two software simulations, namely OPNET Modeler and OMNET++, to validate the performance evaluation of the thesis contributions. The simulation results for both proposed architectures show that, in case of NSRM, the resource blocking is approximately 35% less compared to the legacy LTE network, which it allows to accommodate more users. The NSRM also successfully maintains the isolation for both the inter and intra network slices. Moreover, the results show that the NSRM is able to run different scheduling mechanisms where each network slice guarantee perform its own scheduling mechanism and simultaneously with other slices. Regarding the MMNS, the results show the advantages of the proposed architecture that are the reduction of the tunnelling overhead and the minimization of the handover latency. The MMNS results show the packets delivery cost is optimal by reducing the number of hops that the packets transit between a source node and destination. Additionally, seamless session continues of a user IP-flow between different access networks interfaces has been successfully achieved
    corecore