30 research outputs found

    Is random access memory random?

    Get PDF
    Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers

    Improving Data Locality in Applications using Message Passing

    Get PDF
    This thesis presents a systematic study of two modes of program execution: synchronous and asynchronous. In synchronous mode, program components are tightly coupled. Traditional procedure call represents the synchronous execution mode. In asynchronous mode, program components execute independently of each other. Asynchronous message passing represents the asynchronous execution mode. The asynchronous mode of execution introduces communication overhead in the execution of program components. However it improves the temporal locality of data in a program by facilitating temporal and spatial reorganization of program components. Temporal reorganization refers to the batched execution of program components. Spatial reorganization refers to the scheduling of components on different processors in order to avoid the over-subscription of cache memory. Synchronous execution avoids the communication overhead. The goal of this study is to systematically understand the trade-offs associated with each execution mode and the effect of each mode on the throughput and the resource utilization of applications. The findings of this study help derive application designs for achieving high throughput in current and future multicore hardware

    Location-dependent data caching with handover and replacement for mobile ad hoc networks

    Get PDF
    Master'sMASTER OF ENGINEERIN

    PERFORMANCE OPTIMIZATION OF A STRUCTURED CFD CODE - GHOST ON COMMODITY CLUSTER ARCHITECTURES

    Get PDF
    This thesis focuses on optimizing the performance of an in-house, structured, 2D CFD code – GHOST, on commodity cluster architectures. The basic philosophy of the work is to optimize the cache usage of the code by implementing efficient coding techniques without changing the underlying numerical algorithm. Various optimization techniques that were implemented and the resulting changes in performance have been presented. Two techniques, external and internal blocking that were implemented earlier to tune the performance of this code have been reviewed. What follows is further tuning effort in order to circumvent the problems associated with using the blocking techniques. Later, to establish the universality of the optimization techniques, testing has been done on more complicated test case. All the techniques presented in this thesis have been tested on steady, laminar test cases. It has been proved that optimized versions of the code achieve better performances on variety of commodity cluster architectures chosen in this study

    Exploiting data locality in cache-coherent NUMA systems

    Get PDF
    The end of Dennard scaling has caused a stagnation of the clock frequency in computers.To overcome this issue, in the last two decades vendors have been integrating larger numbers of processing elements in the systems, interconnecting many nodes, including multiple chips in the nodes and increasing the number of cores in each chip. The speed of main memory has not evolved at the same rate as processors, it is much slower and there is a need to provide more total bandwidth to the processors, especially with the increase in the number of cores and chips. Still keeping a shared address space, where all processors can access the whole memory, solutions have come by integrating more memories: by using newer technologies like high-bandwidth memories (HBM) and non-volatile memories (NVM), by giving groups cores (like sockets, for example) faster access to some subset of the DRAM, or by combining many of these solutions. This has caused some heterogeneity in the access speed to main memory, depending on the CPU requesting access to a memory address and the actual physical location of that address, causing non-uniform memory access (NUMA) behaviours. Moreover, many of these systems are cache-coherent (ccNUMA), meaning that changes in the memory done from one CPU must be visible by the other CPUs and transparent for the programmer. These NUMA behaviours reduce the performance of applications and can pose a challenge to the programmers. To tackle this issue, this thesis proposes solutions, at the software and hardware levels, to improve the data locality in NUMA systems and, therefore, the performance of applications in these computer systems. The first contribution shows how considering hardware prefetching simultaneously with thread and data placement in NUMA systems can find configurations with better performance than considering these aspects separately. The performance results combined with performance counters are then used to build a performance model to predict, both offline and online, the best configuration for new applications not in the model. The evaluation is done using two different high performance NUMA systems, and the performance counters collected in one machine are used to predict the best configurations in the other machine. The second contribution builds on the idea that prefetching can have a strong effect in NUMA systems and proposes a NUMA-aware hardware prefetching scheme. This scheme is generic and can be applied to multiple hardware prefetchers with a low hardware cost but giving very good results. The evaluation is done using a cycle-accurate architectural simulator and provides detailed results of the performance, the data transfer reduction and the energy costs. Finally, the third and last contribution consists in scheduling algorithms for task-based programming models. These programming models help improve the programmability of applications in parallel systems and also provide useful information to the underlying runtime system. This information is used to build a task dependency graph (TDG), a directed acyclic graph that models the application where the nodes are sequential pieces of code known as tasks and the edges are the data dependencies between the different tasks. The proposed scheduling algorithms use graph partitioning techniques and provide a scheduling for the tasks in the TDG that minimises the data transfers between the different NUMA regions of the system. The results have been evaluated in real ccNUMA systems with multiple NUMA regions.La fi de la llei de Dennard ha provocat un estancament de la freqüència de rellotge dels computadors. Amb l'objectiu de superar aquest fet, durant les darreres dues dècades els fabricants han integrat més quantitat d'unitats de còmput als sistemes mitjançant la interconnexió de nodes diferents, la inclusió de múltiples xips als nodes i l'increment de nuclis de processador a cada xip. La rapidesa de la memòria principal no ha evolucionat amb el mateix factor que els processadors; és molt més lenta i hi ha la necessitat de proporcionar més ample de banda als processadors, especialment amb l'increment del nombre de nuclis i xips. Tot mantenint un adreçament compartit en el qual tots els processadors poden accedir a la memòria sencera, les solucions han estat al voltant de la integració de més memòries: amb tecnologies modernes com HBM (high-bandwidth memories) i NVM (non-volatile memories), fent que grups de nuclis (com sòcols sencers) tinguin accés més ràpid a una part de la DRAM o amb la combinació de solucions. Això ha provocat una heterogeneïtat en la velocitat d'accés a la memòria principal, en funció del nucli que sol·licita l'accés a una adreça en particular i la seva localització física, fet que provoca uns comportaments no uniformes en l'accés a la memòria (non-uniform memory access, NUMA). A més, sovint tenen memòries cau coherents (cache-coherent NUMA, ccNUMA), que implica que qualsevol canvi fet a la memòria des d'un nucli d'un processador ha de ser visible la resta de manera transparent. Aquests comportaments redueixen el rendiment de les aplicacions i suposen un repte. Per abordar el problema, a la tesi s'hi proposen solucions, a nivell de programari i maquinari, que milloren la localitat de dades als sistemes NUMA i, en conseqüència, el rendiment de les aplicacions en aquests sistemes. La primera contribució mostra que, quan es tenen en compte alhora la precàrrega d'adreces de memòria amb maquinari (hardware prefetching) i les decisions d'ubicació dels fils d'execució i les dades als sistemes NUMA, es poden trobar millors configuracions que quan es condieren per separat. Una combinació dels resultats de rendiment i dels comptadors disponibles al sistema s'utilitza per construir un model de rendiment per fer la predicció, tant per avançat com també en temps d'execució, de la millor configuració per aplicacions que no es troben al model. L'avaluació es du a terme a dos sistemes NUMA d'alt rendiment, i els comptadors mesurats en un sistema s'usen per predir les millors configuracions a l'altre sistema. La segona contribució es basa en la idea que el prefetching pot tenir un efecte considerable als sistemes NUMA i proposa un esquema de precàrrega a nivell de maquinari que té en compte els efectes NUMA. L'esquema és genèric i es pot aplicar als algorismes de precàrrega existents amb un cost de maquinari molt baix però amb molt bons resultats. S'avalua amb un simulador arquitectural acurat a nivell de cicle i proporciona resultats detallats del rendiment, la reducció de les comunicacions de dades i els costos energètics. La tercera i darrera contribució consisteix en algorismes de planificació per models de programació basats en tasques. Aquests simplifiquen la programabilitat de les aplicacions paral·leles i proveeixen informació molt útil al sistema en temps d'execució (runtime system) que en controla el funcionament. Amb aquesta informació es construeix un graf de dependències entre tasques (task dependency graph, TDG), un graf dirigit i acíclic que modela l'aplicació i en el qual els nodes són fragments de codi seqüencial (o tasques) i els arcs són les dependències de dades entre les tasques. Els algorismes de planificació proposats fan servir tècniques de particionat de grafs i proporcionen una planificació de les tasques del TDG que minimitza la comunicació de dades entre les diferents regions NUMA del sistema. Els resultats han estat avaluats en sistemes ccNUMA reals amb múltiples regions NUMA.El final de la ley de Dennard ha provocado un estancamiento de la frecuencia de reloj de los computadores. Con el objetivo de superar este problema, durante las últimas dos décadas los fabricantes han integrado más unidades de cómputo en los sistemas mediante la interconexión de nodos diferentes, la inclusión de múltiples chips en los nodos y el incremento de núcleos de procesador en cada chip. La rapidez de la memoria principal no ha evolucionado con el mismo factor que los procesadores; es mucho más lenta y hay la necesidad de proporcionar más ancho de banda a los procesadores, especialmente con el incremento del número de núcleos y chips. Aun manteniendo un sistema de direccionamiento compartido en el que todos los procesadores pueden acceder al conjunto de la memoria, las soluciones han oscilado alrededor de la integración de más memorias: usando tecnologías modernas como las memorias de alto ancho de banda (highbandwidth memories, HBM) y memorias no volátiles (non-volatile memories, NVM), haciendo que grupos de núcleos (como zócalos completos) tengan acceso más veloz a un subconjunto de la DRAM, o con la combinación de soluciones. Esto ha provocado una heterogeneidad en la velocidad de acceso a la memoria principal, en función del núcleo que solicita el acceso a una dirección de memoria en particular y la ubicación física de esta dirección, lo que provoca unos comportamientos no uniformes en el acceso a la memoria (non-uniform memory access, NUMA). Además, muchos de estos sistemas tienen memorias caché coherentes (cache-coherent NUMA, ccNUMA), lo que implica que cualquier cambio hecho en la memoria desde un núcleo de un procesador debe ser visible por el resto de procesadores de forma transparente para los programadores. Estos comportamientos NUMA reducen el rendimiento de las aplicaciones y pueden suponer un reto para los programadores. Para abordar dicho problema, en esta tesis se proponen soluciones, a nivel de software y hardware, que mejoran la localidad de datos en los sistemas NUMA y, en consecuencia, el rendimiento de las aplicaciones en estos sistemas informáticos. La primera contribución muestra que, cuando se tienen en cuenta a la vez la precarga de direcciones de memoria mediante hardware (o hardware prefetching ) y las decisiones de la ubicación de los hilos de ejecución y los datos en los sistemas NUMA, se pueden hallar mejores configuraciones que cuando se consideran ambos aspectos por separado. Con una combinación de los resultados de rendimiento y de los contadores disponibles en el sistema se construye un modelo de rendimiento, tanto por avanzado como en en tiempo de ejecución, de la mejor configuración para aplicaciones que no están incluidas en el modelo. La evaluación se realiza en dos sistemas NUMA de alto rendimiento, y los contadores medidos en uno de los sistemas se usan para predecir las mejores configuraciones en el otro sistema. La segunda contribución se basa en la idea de que el prefetching puede tener un efecto considerable en los sistemas NUMA y propone un esquema de precarga a nivel hardware que tiene en cuenta los efectos NUMA. Este esquema es genérico y se puede aplicar a diferentes algoritmos de precarga existentes con un coste de hardware muy bajo pero que proporciona muy buenos resultados. Dichos resultados se obtienen y evalúan mediante un simulador arquitectural preciso a nivel de ciclo y proporciona resultados detallados del rendimiento, la reducción de las comunicaciones de datos y los costes energéticos. Finalmente, la tercera y última contribución consiste en algoritmos de planificación para modelos de programación basados en tareas. Estos modelos simplifican la programabilidad de las aplicaciones paralelas y proveen información muy útil al sistema en tiempo de ejecución (runtime system) que controla su funcionamiento. Esta información se utiliza para construir un grafo de dependencias entre tareas (task dependency graph, TDG), un grafo dirigido y acíclico que modela la aplicación y en el ue los nodos son fragmentos de código secuencial, conocidos como tareas, y los arcos son las dependencias de datos entre las distintas tareas. Los algoritmos de planificación que se proponen usan técnicas e particionado de grafos y proporcionan una planificación de las tareas del TDG que minimiza la comunicación de datos entre las distintas regiones NUMA del sistema. Los resultados se han evaluado en sistemas ccNUMA reales con múltiples regiones NUMA.Postprint (published version

    Explorando a substituição de DRAM por NVM na memória principal através de simulação

    Get PDF
    Orientadores: Rodolfo Jardim de Azevedo, Emílio de Camargo FrancesquiniDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O sistema de memória dos computadores tem se baseado fortemente no uso de memórias voláteis para prover um bom desempenho. A tecnologia SRAM é utilizada como um intermediário que acelera o acesso à memória principal, comumente composta pela tecnologia DRAM. Memórias não-voláteis são colocadas como memórias secundárias. Pelo fato dos dados persistentes estarem armazenados no nível de memória mais distante do processador, eles normalmente são manipulados de maneira indireta através de cópias transientes. Tais cópias transientes, além de possívelmente estarem presentes em mais de um nível de memória volátil, podem não ter a mesma forma de suas formas persistentes, o que leva à necessidade de uma tradução entre essas formas. Tecnologias emergentes de memórias não-voláteis (NVMs) prometem possibilitar a existência de dados persistentes na memória principal, permitindo que os mesmos sejam manipulados diretamente, e potencialmente reduzindo a quantidade de cópias transientes. Infelizmente, NVMs ainda não estão amplamente disponíveis no mercado, e pesquisas em seu uso são normalmente feitas através de simulação. Neste documento é apresentado um simulador que tem como fim explorar o uso de NVMs na memória principal. Por enquanto, a tecnologia DRAM provê um tempo de acesso inferior ao das NVMs, restringindo o uso de NVMs na memória principal em questão de desempenho. São mostrados aqui dois cenários para o uso do simulador. No primeiro caso, há a utilização de uma memória principal composta apenas de NVM. Como NVM é mais lenta, são observados certos slowdowns de até 5,3, mas em alguns programas o desempenho é marginalmente afetado. Em um segundo caso, há a exploração da memória híbrida, onde DRAM e NVM coexistem na memória principal. Uma API, chamada NVMalloc, é fornecida para permitir que programas consigam utilizar a não volatilidade presente na memória principal. É mostrado que há casos onde a manipulação direta dos dados persistentes é vantajosa, mas existem outros em que ainda é preferível trabalhar com cópias transientes na DRAM. É esperado que esse simulador seja utilizado como um ponto de partida para futuras pesquisas sobre o uso de NVMsAbstract: Computer memory systems have relied on volatile memories to enhance their performance for quite a time by now. SRAM technology is used at the closest layer to the CPU to accelerate the access time to the main memory, which is traditionally composed by DRAM technology. Non-volatile memories are left as secondary memories, serving as an extension of the main memory and allowing data to be persisted. Persistent data, for residing in the farthest memory layer from the CPU, are commonly not manipulated directly. They are indirectly manipulated with their transient copies that may differ, in form, from their persistent form. These transient copies will also be scattered throughout the several volatile memories in the memory hierarchy, incurring in data replication. This scenario may change with the adoption of emerging non-volatile memories (NVMs), like phase change memory for example, that may allow persistent data to exist in the main memory. This might allow a direct manipulation of persistent data, accelerating their access time and probably reducing the usage of replications. Unfortunately, NVMs are still not broadly available on the market, and research on their usage is still mostly done through simulation. We present a simulator to explore the usage of NVMs in the main memory. We demonstrate the usage of the simulator in two scenarios, the first where DRAM is completely replaced for NVMs, and the second in which a hybrid architecture employing DRAM and NVM is explored. For now, DRAM provides faster access times when compared with NVMs. We show that the use of a main memory composed exclusively of NVMs may incur in slowdowns as high as 5.3, but may be negligible in some cases. In the hybrid main memory scenario, we showed that, although persistent data can be manipulated directly, there are cases in which is still better to work with transient copies, depending on the frequency of usage of the persistent data. To allow programs to make use of the non-volatility presented in main memory, we provide an API, called NVMalloc, that is able to allocate persistent memory in the main memory. We expect the simulator to be a starting point for future researches regarding the usage of NVMsMestradoCiência da ComputaçãoMestre em Ciência da Computação1564396CAPE

    Cache implementation in the ArchC project

    Get PDF
    Orientadores: Paulo Cesar Centoducatte, Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O projeto ArchC visa criar uma linguagem de descrição de arquiteturas, com o objetivo de se construir simuladores e toolchains de arquiteturas computacionais completas. O objetivo deste trabalho é dotar ArchC com capacidade para gerar simuladores de caches. Para tanto foi realizado um estudo detalhado das caches (tipos, organizações, configurações etc) e do funcionamento e do código do ArchC. O resultado foi a descrição de uma coleção de caches parametrizáveis que podem ser adicionadas 'as arquiteturas descritas em ArchC. A implementação das caches é modular, possuindo código isolado para a memória de armazenamento da cache e políticas de operação. A corretude da cache foi verificada utilizando uma sequ¿encia de simulações de diversas configurações de cache e com comparações com o simulador dinero. A cache resultante apresentou um overhead, no tempo de simulaçao, que varia entre 10% e 60%, quando comparada a um simulador sem cacheAbstract: The ArchC project aims to create an architecture description language, with the goal of building complete computer architecture simulators and toolchains. The goal of this project is to add support in ArchC for simulating caches. To achieve this, a detailed study about caches (types, organization, configuration etc) and about the ArchC code was done. The result was a collection of parameterized caches that may be included on the architectures described with ArchC. The cache implementation is modular, having isolated code for the storage and operation policies. Implementation correctness was verified using a set of many cache configurations and with comparisons with the results from dinero simulator. The resulting cache showed an overhead varying between 10% and 60%, when compared to a simulator without cachesMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Auto-tuning compiler options for HPC

    Get PDF

    NRC-SIM: A Node-RED Based Multi-Level, Many-Core Cache Simulator

    Get PDF
    As computational systems become ever-more integral to daily life, so too does the importance of understanding how these complex systems work. For those unfamiliar with the underlying concepts, this can be a daunting task. In an effort to address such concerns, this paper presents a Node-RED based cache simulator that enables users to observe the effects of their desired cache configuration, with users having the ability to easily modify various parameters, such as the core count, the number of levels within the cache, and coherence protocols, among other parameters. Through the use of Node-RED, NRC-SIM allows for simplicity of use by providing web-based cache simulation to any web-connected device, including but not limited to, computers, laptops, and mobile devices. As such, users need only select their desired parameters, allowing the formatting for execution to be handled in the background. Node-RED’s modular, flow-based design enables the execution of NRC-SIM, allowing the user’s selected inputs to be manipulated in such a way that they are easily converted into the proper format. As NRC-SIM is a trace-driven cache simulator, various trace files have been collected from benchmark programs found in PARSEC and SPLASH2 through the use of Intel’s Pin, an open–source dynamic instrumentation tool framework. NRC-SIM’s experimental results show that it is a highly efficient, well-rounded cache simulator that generates results that are either comparable or an improvement over similar established cache simulators, such as SMPCache and SIMNCORE. As such, NRC-SIM can be an effective simulation tool for research purposes, or an educational tool that provides those less familiar with the concepts of cache memory an introduction into the subject matter

    Reducing Waste in Memory Hierarchies

    Get PDF
    Memory hierarchies play an important role in microarchitectural design to bridge the performance gap between modern microprocessors and main memory. However, memory hierarchies are inefficient due to storing waste. This dissertation quantifies two types of waste, dead blocks and data redundancy. This dissertation studies waste in diverse memory hierarchies and proposes techniques to reduce waste to improve performance with limited overhead. This dissertation observes that waste of dead blocks in an inclusive last level cache consists of two kinds of blocks: blocks that are highly accessed in core caches and blocks that have low temporal locality in both core caches and the last-level cache. Blindly replacing all dead blocks in an inclusive last level cache may degrade performance. This dissertation proposes temporal-based multilevel correlating cache replacement to improve performance of inclusive cache hierarchies. This dissertation observes that waste exists in private caches of graphics processing units (GPUs) as zero-reuse blocks. This dissertation defines zero-reuse blocks as blocks that are dead after being inserted into caches. This dissertation proposes adaptive GPU cache bypassing technique to improve performance as well as reducing power consumption by dynamically bypassing zero-reuse blocks. This dissertation exploits waste of data redundancy at the block-level granularity and finds that conventional cache design wastes capacity because it stores duplicate data. This dissertation quantifies the percentage of data duplication and analyze causes. This dissertation proposes a practical cache deduplication technique to increase the effectiveness of the cache with limited area and power consumption
    corecore