185 research outputs found

    Cache remapping to improve the performance of tiled algorithms

    Get PDF

    Impulse: building a smarter memory controller

    Get PDF
    Journal ArticleImpulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper we describe the design of the Impulse architecture, and show how an Impulse memory system can be used to improve the performance of memory-bound programs. For the NAS conjugate gradient benchmark, Impulse improves performance by 67%. Because it requires no modification to processor, cache, or bus designs, Impulse can be adopted in conventional systems. In addition to scientific applications, we expect that Impulse will benefit regularly strided, memory-bound applications of commercial importance, such as database and multimedia programs

    Impulse: Memory System Support for Scientific Applications

    Get PDF

    Dynamic Parameter Allocation in Parameter Servers

    Full text link
    To keep up with increasing dataset sizes and model complexity, distributed training has become a necessity for large machine learning tasks. Parameter servers ease the implementation of distributed parameter management---a key concern in distributed training---, but can induce severe communication overhead. To reduce communication overhead, distributed machine learning algorithms use techniques to increase parameter access locality (PAL), achieving up to linear speed-ups. We found that existing parameter servers provide only limited support for PAL techniques, however, and therefore prevent efficient training. In this paper, we explore whether and to what extent PAL techniques can be supported, and whether such support is beneficial. We propose to integrate dynamic parameter allocation into parameter servers, describe an efficient implementation of such a parameter server called Lapse, and experimentally compare its performance to existing parameter servers across a number of machine learning tasks. We found that Lapse provides near-linear scaling and can be orders of magnitude faster than existing parameter servers

    Memory system support for image processing

    Get PDF
    Journal ArticleProcessor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefectching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system

    WCET-aware prefetching of unlocked instruction caches: a technique for reconciling real-time guarantees and energy efficiency

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2015.A computação embarcada requer crescente vazão sob baixa potência. Ela requer um aumento de eficiência energética quando se executam programas de crescente complexidade. Muitos sistemas embarcados são também sistemas de tempo real, cuja correção temporal precisa ser garantida através de análise de escalonabilidade, a qual costuma assumir que o WCET de uma tarefa é conhecido em tempo de projeto. Como resultado da crescente complexidade do software, uma quantidade significativa de energia é gasta ao se prover instruções através da hierarquia de memória. Como a cache de instruções consome cerca de 40% da energia gasta em um processador embarcado e afeta a energia consumida em memória principal, ela se torna um relevante alvo para otimização. Entretanto, como ela afeta substancialmente o WCET, o comportamento da cache precisa ser restrito via  cache locking ou previsto via análise de WCET. Para obter eficiência energética sob restrições de tempo real, é preciso estender a consciência que o compilador tem da plataforma de hardware. Entretanto, compiladores para tempo real ignoram a energia, embora determinem rapidamente limites superiores para o WCET, enquanto compiladores para sistemas embarcados estimem com precisão a energia, mas gastem muito tempo em  profiling . Por isso, esta tese propõe um método unificado para estimar a energia gasta em memória, o qual é baseado em Interpretação Abstrata, exatamente o mesmo substrato matemático usado para a análise de WCET em caches. As estimativas mostram derivadas que são tão precisas quanto as obtidas via  profiling , mas são computadas 1000 vezes mais rápido, sendo apropriadas para induzir otimização de código através de melhoria iterativa. Como  cache locking troca eficiência energética por previsibilidade, esta tese propõe uma nova otimização de código, baseada em pré-carga por software, a qual reduz a taxa de faltas de caches de instruções e, provadamente, não aumenta o WCET. A otimização proposta é comparada com o estado-da-arte em  cache locking parcial para 37 programas do  Malardalen WCET benchmark para 36 configurações de cache e duas tecnologias distintas (2664 casos de uso). Em média, para obter uma melhoria de 68% no WCET,  cache locking parcial requer 8% mais energia. Por outro lado, a pré-carga por software diminui o consumo de energia em 11% enquanto melhora em 15% o WCET, reconciliando assim eficiência energética e garantias de tempo real.Abstract : Embedded computing requires increasing throughput at low power budgets. It asks for growing energy efficiency when executing programs of rising complexity. Many embedded systems are also real-time systems, whose temporal correctness is asserted through schedulability analysis, which often assumes that the WCET of each task is known at design-time. As a result of the growing software complexity, a significant amount of energy is spent in supplying instructions through the memory hierarchy. Since an instruction cache consumes around 40% of an embedded processor s energy and affects the energy spent in main memory, it becomes a relevant optimization target. However, since it largely impacts the WCET, cache behavior must be either constrained via cache locking or predicted by WCET analysis. To achieve energy efficiency under real-time constraints, a compiler must have extended awareness of the hardware platform. However, real-time compilers ignore energy, although they quickly determine bounds for WCET, whereas embedded compilers accurately estimate energy but require time-consuming profiling. That is why this thesis proposes a unifying method to estimate memory energy consumption that is based on Abstract Interpretation, the very same mathematical framework employed for the WCET analysis of caches. The estimates exhibit derivatives that are as accurate as those obtained by profiling, but are computed 1000 times faster, being suitable for driving code optimization through iterative improvement. Since cache locking gives up energy efficiency for predictability, this thesis proposes a novel code optimization, based on software prefetching, which reduces miss rate of unlocked instruction caches and, provenly, does not increase the WCET. The proposed optimization is compared with a state-of-the-art partial cache locking technique for the 37 programs of the Malardalen WCET benchmarks under 36 cache configurations and two distinct target technologies (2664 use cases). On average, to achieve an improvement of 68% in the WCET, partial cache locking required 8% more energy. On the other hand, software prefetching decreased the energy consumption by 11% while leading to an improvement of 15% in the WCET, thereby reconciling energy efficiency and real-time guarantees

    Locality Enhancement and Dynamic Optimizations on Multi-Core and GPU

    Get PDF
    Enhancing the match between software executions and hardware features is key to computing efficiency. The match is a continuously evolving and challenging problem. This dissertation focuses on the development of programming system support for exploiting two key features of modern hardware development: the massive parallelism of emerging computational accelerators such as Graphic Processing Units (GPU), and the non-uniformity of cache sharing in modern multicore processors. They are respectively driven by the important role of accelerators in today\u27s general-purpose computing and the ultimate importance of memory performance. This dissertation particularly concentrates on optimizing control flows and memory references, at both compilation and execution time, to tap into the full potential of pure software solutions in taking advantage of the two key hardware features.;Conditional branches cause divergences in program control flows, which may result in serious performance degradation on massively data-parallel GPU architectures with Single Instruction Multiple Data (SIMD) parallelism. On such an architecture, control divergence may force computing units to stay idle for a substantial time, throttling system throughput by orders of magnitude. This dissertation provides an extensive exploration of the solution to this problem and presents program level transformations based upon two fundamental techniques --- thread relocation and data relocation. These two optimizations provide fundamental support for swapping jobs among threads so that the control flow paths of threads converge within every SIMD thread group.;In memory performance, this dissertation concentrates on two aspects: the influence of nonuniform sharing on multithreading applications, and the optimization of irregular memory references on GPUs. In shared cache multicore chips, interactions among threads are complicated due to the interplay of cache contention and synergistic prefetching. This dissertation presents the first systematic study on the influence of non-uniform shared cache on contemporary parallel programs, reveals the mismatch between the software development and underlying cache sharing hierarchies, and further demonstrates it by proposing and applying cache-sharing-aware data transformations that bring significant performance improvement. For the second aspect, the efficiency of GPU accelerators is sensitive to irregular memory references, which refer to the memory references whose access patterns remain unknown until execution time (e.g., A[P[i]]). The root causes of the irregular memory reference problem are similar to that of the control flow problem, while in a more general and complex form. I developed a framework, named G-Streamline, as a unified software solution to dynamic irregularities in GPU computing. It treats both types of irregularities at the same time in a holistic fashion, maximizing the whole-program performance by resolving conflicts among optimizations

    An accurate prefetching policy for object oriented systems

    Get PDF
    PhD ThesisIn the latest high-performance computers, there is a growing requirement for accurate prefetching(AP) methodologies for advanced object management schemes in virtual memory and migration systems. The major issue for achieving this goal is that of finding a simple way of accurately predicting the objects that will be referenced in the near future and to group them so as to allow them to be fetched same time. The basic notion of AP involves building a relationship for logically grouping related objects and prefetching them, rather than using their physical grouping and it relies on demand fetching such as is done in existing restructuring or grouping schemes. By this, AP tries to overcome some of the shortcomings posed by physical grouping methods. Prefetching also makes use of the properties of object oriented languages to build inter and intra object relationships as a means of logical grouping. This thesis describes how this relationship can be established at compile time and how it can be used for accurate object prefetching in virtual memory systems. In addition, AP performs control flow and data dependency analysis to reinforce the relationships and to find the dependencies of a program. The user program is decomposed into prefetching blocks which contain all the information needed for block prefetching such as long branches and function calls at major branch points. The proposed prefetching scheme is implemented by extending a C++ compiler and evaluated on a virtual memory simulator. The results show a significant reduction both in the number of page fault and memory pollution. In particular, AP can suppress many page faults that occur during transition phases which are unmanageable by other ways of fetching. AP can be applied to a local and distributed virtual memory system so as to reduce the fault rate by fetching groups of objects at the same time and consequently lessening operating system overheads.British Counci

    Steganographic database management system

    Get PDF
    Master'sMASTER OF SCIENC
    corecore