903 research outputs found

    Software-Controlled Instruction Prefetch Buffering for Low-End Processors

    Get PDF
    This paper proposes a method of buffering instructions by software-based prefetching. The method allows low-end processors to improve their instruction throughput with a minimum of additional logic and power consumption. Low-end embedded processors do not employ caches for mainly two reasons. The first reason is that the overhead of cache implementation in terms of energy and area is considerable. The second reason is that, because a cache's performance primarily depends on the number of hits, an increasing number of misses could cause a processor to remain in stall mode for a longer duration. As a result, a cache may become more of a liability than an advantage. In contrast, the benchmarked results for the proposed software-based prefetch buffering without a cache show a 5-10% improvement in execution time. They also show a 4% or more reduction in the energy-delay-square-product (ED2P) with a maximum reduction of 40%. The results additionally demonstrate that the performance and efficiency of the proposed architecture scales with the number of multicycle instructions. The benchmarked routines tested to arrive at these results are widely deployed components of embedded applications

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    WCET-aware prefetching of unlocked instruction caches: a technique for reconciling real-time guarantees and energy efficiency

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2015.A computação embarcada requer crescente vazão sob baixa potência. Ela requer um aumento de eficiência energética quando se executam programas de crescente complexidade. Muitos sistemas embarcados são também sistemas de tempo real, cuja correção temporal precisa ser garantida através de análise de escalonabilidade, a qual costuma assumir que o WCET de uma tarefa é conhecido em tempo de projeto. Como resultado da crescente complexidade do software, uma quantidade significativa de energia é gasta ao se prover instruções através da hierarquia de memória. Como a cache de instruções consome cerca de 40% da energia gasta em um processador embarcado e afeta a energia consumida em memória principal, ela se torna um relevante alvo para otimização. Entretanto, como ela afeta substancialmente o WCET, o comportamento da cache precisa ser restrito via  cache locking ou previsto via análise de WCET. Para obter eficiência energética sob restrições de tempo real, é preciso estender a consciência que o compilador tem da plataforma de hardware. Entretanto, compiladores para tempo real ignoram a energia, embora determinem rapidamente limites superiores para o WCET, enquanto compiladores para sistemas embarcados estimem com precisão a energia, mas gastem muito tempo em  profiling . Por isso, esta tese propõe um método unificado para estimar a energia gasta em memória, o qual é baseado em Interpretação Abstrata, exatamente o mesmo substrato matemático usado para a análise de WCET em caches. As estimativas mostram derivadas que são tão precisas quanto as obtidas via  profiling , mas são computadas 1000 vezes mais rápido, sendo apropriadas para induzir otimização de código através de melhoria iterativa. Como  cache locking troca eficiência energética por previsibilidade, esta tese propõe uma nova otimização de código, baseada em pré-carga por software, a qual reduz a taxa de faltas de caches de instruções e, provadamente, não aumenta o WCET. A otimização proposta é comparada com o estado-da-arte em  cache locking parcial para 37 programas do  Malardalen WCET benchmark para 36 configurações de cache e duas tecnologias distintas (2664 casos de uso). Em média, para obter uma melhoria de 68% no WCET,  cache locking parcial requer 8% mais energia. Por outro lado, a pré-carga por software diminui o consumo de energia em 11% enquanto melhora em 15% o WCET, reconciliando assim eficiência energética e garantias de tempo real.Abstract : Embedded computing requires increasing throughput at low power budgets. It asks for growing energy efficiency when executing programs of rising complexity. Many embedded systems are also real-time systems, whose temporal correctness is asserted through schedulability analysis, which often assumes that the WCET of each task is known at design-time. As a result of the growing software complexity, a significant amount of energy is spent in supplying instructions through the memory hierarchy. Since an instruction cache consumes around 40% of an embedded processor s energy and affects the energy spent in main memory, it becomes a relevant optimization target. However, since it largely impacts the WCET, cache behavior must be either constrained via cache locking or predicted by WCET analysis. To achieve energy efficiency under real-time constraints, a compiler must have extended awareness of the hardware platform. However, real-time compilers ignore energy, although they quickly determine bounds for WCET, whereas embedded compilers accurately estimate energy but require time-consuming profiling. That is why this thesis proposes a unifying method to estimate memory energy consumption that is based on Abstract Interpretation, the very same mathematical framework employed for the WCET analysis of caches. The estimates exhibit derivatives that are as accurate as those obtained by profiling, but are computed 1000 times faster, being suitable for driving code optimization through iterative improvement. Since cache locking gives up energy efficiency for predictability, this thesis proposes a novel code optimization, based on software prefetching, which reduces miss rate of unlocked instruction caches and, provenly, does not increase the WCET. The proposed optimization is compared with a state-of-the-art partial cache locking technique for the 37 programs of the Malardalen WCET benchmarks under 36 cache configurations and two distinct target technologies (2664 use cases). On average, to achieve an improvement of 68% in the WCET, partial cache locking required 8% more energy. On the other hand, software prefetching decreased the energy consumption by 11% while leading to an improvement of 15% in the WCET, thereby reconciling energy efficiency and real-time guarantees

    Three-dimensional memory vectorization for high bandwidth media memory systems

    Get PDF
    Vector processors have good performance, cost and adaptability when targeting multimedia applications. However, for a significant number of media programs, conventional memory configurations fail to deliver enough memory references per cycle to feed the SIMD functional units. This paper addresses the problem of the memory bandwidth. We propose a novel mechanism suitable for 2-dimensional vector architectures and targeted at providing high effective bandwidth for SIMD memory instructions. The basis of this mechanism is the extension of the scope of vectorization at the memory level, so that 3-dimensional memory patterns can be fetched into a second-level register file. By fetching long blocks of data and by reusing 2-dimensional memory streams at this second-level register file, we obtain a significant increase in the effective memory bandwidth. As side benefits, the new 3-dimensional load instructions provide a high robustness to memory latency and a significant reduction of the cache activity, thus reducing power and energy requirements. At the investment of a 50% more area than a regular SIMD register file, we have measured and average speed-up of 13% and the potential for power savings in the L2 cache of a 30%.Peer ReviewedPostprint (published version

    Analysis of the reconfiguration latency and energy overheads for a Xilinx Virtex-5 FPGA

    Get PDF
    In this paper we have evaluated the overhead and the tradeoffs of a set of components usually included in a system with run-time partial reconfiguration implemented on a Xilinx Virtex-5. Our analysis shows the benefits of including a scratchpad memory inside the reconfiguration controller in order to improve the efficiency of the reconfiguration process. We have designed a simple controller for this scratchpad that includes support for prefetching and caching in order to further reduce both the energy and latency overhead

    Scalable Hierarchical Instruction Cache for Ultra-Low-Power Processors Clusters

    Full text link
    High Performance and Energy Efficiency are critical requirements for Internet of Things (IoT) end-nodes. Exploiting tightly-coupled clusters of programmable processors (CMPs) has recently emerged as a suitable solution to address this challenge. One of the main bottlenecks limiting the performance and energy efficiency of these systems is the instruction cache architecture due to its criticality in terms of timing (i.e., maximum operating frequency), bandwidth, and power. We propose a hierarchical instruction cache tailored to ultra-low-power tightly-coupled processor clusters where a relatively large cache (L1.5) is shared by L1 private caches through a two-cycle latency interconnect. To address the performance loss caused by the L1 capacity misses, we introduce a next-line prefetcher with cache probe filtering (CPF) from L1 to L1.5. We optimize the core instruction fetch (IF) stage by removing the critical core-to-L1 combinational path. We present a detailed comparison of instruction cache architectures' performance and energy efficiency for parallel ultra-low-power (ULP) clusters. Focusing on the implementation, our two-level instruction cache provides better scalability than existing shared caches, delivering up to 20\% higher operating frequency. On average, the proposed two-level cache improves maximum performance by up to 17\% compared to the state-of-the-art while delivering similar energy efficiency for most relevant applications.Comment: 14 page
    corecore