5,984 research outputs found

    Modeling Data-Plane Power Consumption of Future Internet Architectures

    Full text link
    With current efforts to design Future Internet Architectures (FIAs), the evaluation and comparison of different proposals is an interesting research challenge. Previously, metrics such as bandwidth or latency have commonly been used to compare FIAs to IP networks. We suggest the use of power consumption as a metric to compare FIAs. While low power consumption is an important goal in its own right (as lower energy use translates to smaller environmental impact as well as lower operating costs), power consumption can also serve as a proxy for other metrics such as bandwidth and processor load. Lacking power consumption statistics about either commodity FIA routers or widely deployed FIA testbeds, we propose models for power consumption of FIA routers. Based on our models, we simulate scenarios for measuring power consumption of content delivery in different FIAs. Specifically, we address two questions: 1) which of the proposed FIA candidates achieves the lowest energy footprint; and 2) which set of design choices yields a power-efficient network architecture? Although the lack of real-world data makes numerous assumptions necessary for our analysis, we explore the uncertainty of our calculations through sensitivity analysis of input parameters

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Formal Executable Models for Automatic Detection of Timing Anomalies

    Get PDF
    A timing anomaly is a counterintuitive timing behavior in the sense that a local fast execution slows down an overall global execution. The presence of such behaviors is inconvenient for the WCET analysis which requires, via abstractions, a certain monotony property to compute safe bounds. In this paper we explore how to systematically execute a previously proposed formal definition of timing anomalies. We ground our work on formal designs of architecture models upon which we employ guided model checking techniques. Our goal is towards the automatic detection of timing anomalies in given computer architecture designs

    A unified approach to the performance analysis of caching systems

    Get PDF
    We propose a unified methodology to analyse the performance of caches (both isolated and interconnected), by extending and generalizing a decoupling technique originally known as Che's approximation, which provides very accurate results at low computational cost. We consider several caching policies, taking into account the effects of temporal locality. In the case of interconnected caches, our approach allows us to do better than the Poisson approximation commonly adopted in prior work. Our results, validated against simulations and trace-driven experiments, provide interesting insights into the performance of caching systems.Comment: in ACM TOMPECS 20016. Preliminary version published at IEEE Infocom 201
    • …
    corecore