12 research outputs found

    Coordinate Channel-Aware Page Mapping Policy and Memory Scheduling for Reducing Memory Interference Among Multimedia Applications

    Full text link
    "© 2017 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] In a modern multicore system, memory is shared among more and more concurrently running multimedia applications. Therefore, memory contention and interference are more andmore serious, inducing system performance degradation significantly, the performance degradation of each thread differently, unfairness in resource sharing, and priority inversion, even starvation. In this paper, we propose an approach of coordinating channel-aware page mapping policy and memory scheduling (CCPS) to reduce intermultimedia application interference in a memory system. The idea is to map the data of different threads to different channels, together with memory scheduling. The key principles of the policies of page mapping and memory scheduling are: 1) the memory address space, the thread priority, and the load balance; and 2) prioritizing a low-memory request thread, a row-buffer hit access, and an older request. We evaluate the CCPS on a variety of mixed single-thread and multithread benchmarks and system configurations, and we compare them with four previously proposed state-of-the-art interference-reducing policies. Experimental results demonstrate that the CCPS improves the performance while reducing the energy consumption significantly; moreover, the CCPS incurs a much lower hardware overhead than the current existing policies.This work was supported in part by the Qing Lan Project; by the National Science Foundation of China under Grant 61003077, Grant 61100193, and Grant 61401147; and by the Zhejiang Provincial Natural Science Foundation under Grant LQ14F020011.Jia, G.; Han, G.; Li, A.; Lloret, J. (2017). Coordinate Channel-Aware Page Mapping Policy and Memory Scheduling for Reducing Memory Interference Among Multimedia Applications. IEEE Systems Journal. 11(4):2839-2851. https://doi.org/10.1109/JSYST.2015.2430522S2839285111

    Using Virtual Load/Store Queues (VLSQs) to Reduce the Negative Effects of Reordered Memory Instructions

    Get PDF
    The use of large instruction windows coupled with aggressive out-of order and prefetching capabilities has provided significant improvements in processor performance. In this paper, we quantify the effects of increased out-of-order aggressiveness on a processor’s memory ordering/consistency model as well as an application’s cache behavior. We observe that increasing reorder buffer sizes cause less than one third of issued memory instructions to be executed in actual program order. We show that increasing the reorder buffer size from 80 to 512 entries results in an increase in the frequency of memory traps by a factor of six and an increase in total execution overhead by 10–40%. Additionally, we observe that the reordering of memory instructions increases the L1 data cache accesses by 10–60% and the L1 data cache misses by 10–20%. These findings reveal that increased out-of-order capability can waste energy in two ways. First, re-fetching and re-executing instructions flushed due to traps require the fetch, map, and execution units to dissipate energy on work that has already been done before. Second, an increase in the number of cache accesses and cache misses needlessly dissipates energy. Both these side effects can be related to the reordering of memory instructions. Thus, to avoid wasting both energy and performance, we propose a virtual load/ store queue (VLSQ) within the existing physical load/store queue. The VLSQ reduces the reordering of memory instructions by limiting the number of memory instructions visible to the select and issue logic. We show that VLSQs can reduce trap overhead, cache accesses, and cache misses by as much as 45%, 50%, and 15% respectively when compared to traditional load/store queues. We observe that these reductions yield net power savings of 10–50% with degradation in performance by 1–5%

    Extensões para a compressão Base-Delta-Imediato

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Memórias cache há muito têm sido utilizadas para reduzir os problemas decorrentes da discrepância de desempenho entre a memória e o processador: muitos níveis de caches on-chip reduzem a latência média de memória ao custo de área e energia extra no die. Para diminuir o dispêndio desses componentes extras, técnicas de compressão de cache são usadas para armazenar dados comprimidos e permitir um aumento de capacidade de cache. Este projeto apresenta extensões para a Compressão Base-Delta-Imediato, várias modificações da técnica original que minimizam a quantidade de bits de preenchimento numa compressão através da flexibilização dos tamanhos de delta permitidos para cada base e do aumento do número de bases. As extensões foram testadas utilizando ZSim, avaliadas contra métodos estado da arte, e os resultados de desempenho foram comparados e avaliados para determinar a validade de utilização das técnicas propostas. Foi constatado um aumento do fator de compressão médio de 1.37x para 1.58x com um aumento de energia tão baixo quanto 27%Abstract: Cache memories have long been used to reduce problems deriving from the memory-processor performance discrepancy: many levels of on-chip cache reduce the average memory latency at the cost of extra die area and power. To decrease the outlay of these extra components, cache compression techniques are used to store compressed data and allow a cache capacity boost. This project introduces extensions to the Base-Delta-Immediate Compression, many modifications of the original technique that minimize the quantity of padding bits by relaxing the allowed delta sizes for each base and increasing number of bases. The extensions were tested using ZSim, evaluated against state-of-the-art methods, and the performance results were compared and evaluated to determine the validity of the proposed techniques. We verified an improvement of the original BDI compression factor from 1.37x to 1.58x at a energy increase as low as 27%MestradoCiência da ComputaçãoMestre em Ciência da Computação1564395CAPE

    Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms

    Get PDF
    As the performance gap between microprocessors and memory continues to increase, main memory accesses result in long latencies which become a factor limiting system performance. Previous studies show that main memory access streams contain significant localities and SDRAM devices provide parallelism through multiple banks and channels. These locality and parallelism have not been exploited thoroughly by conventional memory controllers. In this thesis, SDRAM address mapping techniques and memory access reordering mechanisms are studied and applied to memory controller design with the goal of reducing observed main memory access latency. The proposed bit-reversal address mapping attempts to distribute main memory accesses evenly in the SDRAM address space to enable bank parallelism. As memory accesses to unique banks are interleaved, the access latencies are partially hidden and therefore reduced. With the consideration of cache conflict misses, bit-reversal address mapping is able to direct potential row conflicts to different banks, further improving the performance. The proposed burst scheduling is a novel access reordering mechanism, which creates bursts by clustering accesses directed to the same rows of the same banks. Subjected to a threshold, reads are allowed to preempt writes and qualified writes are piggybacked at the end of the bursts. A sophisticated access scheduler selects accesses based on priorities and interleaves accesses to maximize the SDRAM data bus utilization. Consequentially burst scheduling reduces row conflict rate, increasing and exploiting the available row locality. Using a revised SimpleScalar and M5 simulator, both techniques are evaluated and compared with existing academic and industrial solutions. With SPEC CPU2000 benchmarks, bit-reversal reduces the execution time by 14% on average over traditional page interleaving address mapping. Burst scheduling also achieves a 15% reduction in execution time over conventional bank in order scheduling. Working constructively together, bit-reversal and burst scheduling successfully achieve a 19% speedup across simulated benchmarks

    Improving Processor Design by Exploiting Performance Variance

    Get PDF
    Programs exhibit significant performance variance in their access to microarchitectural structures. There are three types of performance variance. First, semantically equivalent programs running on the same system can yield different performance due to characteristics of microarchitectural structures. Second, program phase behavior varies significantly. Third, different types of operations on microarchitectural structure can lead to different performance. In this dissertation, we explore the performance variance and propose techniques to improve the processor design. We explore performance variance caused by microarchitectural structures and propose program interferometry, a technique that perturbs benchmark executables to yield a wide variety of performance points without changing program semantics or other important execution characteristics such as the number of retired instructions. By observing the behavior of the benchmarks over a range of branch prediction accuracies, we can estimate the impact of a microarchitectural optimization optimization and not the rest of the microarchitecture. We explore performance variance caused by phase changes and develop prediction-driven last-level cache (LLC) writeback techniques. We propose a rank idle time prediction driven LLC writeback technique and a last-write prediction driven LLC writeback technique. These techniques improve performance by reducing the write-induced interference. We explore performance variance caused by different types of operations to Non-Volatile Memory (NVM) and propose LLC management policies to reduce write overhead of NVM.We propose an adaptive placement and migration policy for an STT-RAM-based hybrid cache and writeback aware dynamic cache management for NVM-based main memory system. These techniques reduce write latency and write energy, thus leading to performance improvement and energy reduction

    Methodology for cycle-accurate performance analysis of DRAM memories

    Get PDF
    Главна меморија је једна од кључних компоненти сваког рачунарског система. У савременим рачунарским системима главна меморија се најчешће имплементира помоћу меморија типа DRAM. Она непосредно утиче на цену, потрошњу енергије и ефикасност система, а посредно и на његову интерну архитектуру и организацију. Из тих разлога се анализи перформанси DRAM меморије посвећује велика пажња приликом пројектовања. Код мерења и анализе перформанси DRAM меморија тачност је од суштинског значаја. Нетачност смањује поузданост резултата и закључака, што може довести до доношења погрешних одлука приликом пројектовања, а тиме и до значајног губитка времена, труда и новца. Општи трендови у домену технологије израде, архитектуре и организације рачунарских система и развоја системског софтвера воде сталном расту оптерећења главне меморије и њеној све већој виртуелизацији, чиме се проблем тачности додатно погоршава. С обзиром да се не види технологија која би у догледно време могла да замени DRAM, може се очекивати да ће значај овог проблема у будућности само добијати на тежини. Кључни проблем у анализи перформанси DRAM меморија је немогућност да се непосредно утврди да ли су поједини циклуси на меморијској магистрали слободни или заузети. То ствара проблем са мерењем чак и основних показатеља меморијских перформанси, као што су степен искоришћења и ефикасност. Због тога није јасно како одговорити на фундаментално питање: „Како мерити перформансе DRAM меморије са потребним нивоом тачности?“. Чак и за показатеље перформанси које је могуће тачно и непосредно мерити, као што је VI проток података, постоји проблем интерпретације измерених вредности. Теоријски максимуми показатеља меморијских перформанси се континуирано мењају, а не могу се непосредно мерити, па није могуће интерпретирати резултате поређењем измерених и максималних вредности. Стога није јасно како одговорити ни на следећа кључна питања: „Колико су измерене перформансе DRAM меморије добре или лоше?“ и „Како поредити перформансе DRAM меморије измерене у различитим периодима и за различита радна оптерећења?“. Основни циљ дисертације је проналажење начина да се превазиђу ови суштински проблеми у вези са мерењем и анализом перформанси DRAM меморија. Као резултат рада на тој проблематици створена је нова теоријска основа за мерење и анализу меморијских перформанси и формулисана је одговарајућа методологија која дефинише како да се мерење и анализа спроводе у пракси. Методологија се заснива на тачној и једнозначној карактеризацији меморијских циклуса на слободне (неискоришћени циклуси), активне (заузети циклуси чије стање је опсервабилно) и режијске (заузети циклуси чије стање заузетости није опсервабилно). Најважније компоненте предложене методологије су: Функционални и временски модел DRAM меморије који DRAM меморију апстрахује у виду генеричког уређаја чији рад се може потпуно описати помоћу параметризованог коначног аутомата и анализирати са жељеним нивоом тачности Модел за мерење и анализу перформанси DRAM меморија који омогућава прецизну карактеризацију меморијских циклуса Метрика за мерење и анализу перформанси DRAM меморија која представља нов теоријски основ за рад у овој области Метод за процену максимума перформанси DRAM меморија који омогућава решавање проблема интерпретације измерених резултата Методологија специфицира како се наведене компоненте дефинишу, конструишу и параметризују у складу са имплементацијом DRAM меморије и описује све релевантне поступке и процесе који омогућавају мерење и анализу перформанси DRAM меморија у реалном окружењу. VII Предложена методологија доноси суштински напредак у односу на постојећа решења. Њене најважније предности са теоријског становишта су: максимална тачност могућност прецизне процене теоријског максимума перформанси могућност идентификације главних узрока субоптималног рада комплетност (могућност анализе свих типова DRAM трансакција) Са становишта практичне примене, најважније предности су: независност од архитектуре система на коме се генерише радно оптерећење независност од величине радног оптерећења преносивост (могућност имплементације на различитим платформама) ефикасност (брзо генерисање резултата уз мали напор корисника) ниска цена и комплексност имплементације и верификације Предложена методологија може да замени постојећа решења у свим доменима где је потребно побољшати тачност и ефикасност анализе. Уједно, омогућава се и примена у потпуно новим областима, попут анализе критичних сценарија (анализа секвенци које се јављају спорадично, али имају битан ефекат на рад система), трансакционе анализе система (анализа рада система праћењем појединачних трансакција), компаративне анализе (поређење резултата са различитих плаформи, за различита радна оптерећења или у различитим периодима) и др. Методологија омогућава релативно једноставну примену на инжењерском нивоу уз мали утрошак ресурса и тиме обезбеђује висок ниво ефикасности и употребљивост у пракси. Као резултат систематизације на чврстим теоријским основама и решавања кључних проблема који су раније постојали у вези са мерењем и анализом перформанси DRAM меморија, направљен је суштински помак у овој области. Тај напредак омогућава прелазак анализе перформанси DRAM меморија из домена инжењерске вештине у домен научно-стручне дисциплине и подизање процеса анализе рада целокупног рачунарског система на квантитативно и квалитативно виши ниво.Main memory is one of the key components in a computer system. In modern systems, main memory is almost always implemented using DRAM type of memory. Memory has a direct impact on price, power consumption and performance of the system, and an indirect impact on its internal architecture and organization. That is why a lot of attention is paid to DRAM performance analysis during system development. Accuracy is of utmost importance in measurement and analysis of DRAM performance. Inaccuracy reduces reliability of the results and conclusions, which can lead to wrong architectural or design decisions, and a significant loss of time, effort and money. General trends in manufacturing technology, computer architecure and organization, and system software development lead toward its increasing virtualization and greater utilization, which exacerbates the problem of accuracy. Considering that there is no technology in sight that could replace DRAM in the near future, the importance of this problem will only increase. The main problem in analysis of DRAM performance is inability to determine which cycles on the memory bus are busy or idle. That makes even basic memory performance parameters, such as utilization or efficiency, difficult to measure. In essence, it is not clear how to answer the fundamental question: “How can DRAM performance be measured with the required level of accuracy?”. Even for performance paramters that can be measured based on the observable signals, such as data bandwidth, there is a problem of interpretation of the measured results. Theoretical maximums of performance parameters continually fluctuate, and they cannot be directly measured, so it is not possible to interpret them by comparing measured and maximum values. It is thus not X clear how to answer the following key questions either: “How good or bad are measured performance results?” and “How can results measured in two different time periods or for different workloads be compared?”. The main goal of the dissertation was to overcome these fundamental problems. As a result, a new theoretical foundation for DRAM performance measurement and analysis was created, along with a methodology that specifies how to conduct the measurement and analysis in practice. The methodology is based on accurate characterization of memory cycles as idle (cycles not used), active (cycles used, and their state is observable), or overhead (cycle that cannot be used due to DRAM protocol constraints, and their state is not observable). The most important components of the proposed methodology are: Functional and timing model of DRAM memory that abstracts DRAM memory as a generic device in a form of a state machine parameterized by DRAM device configuration and timing parameters, whose operation can be analyzed with the desired level of accuracy DRAM measurement and performance analysis model that enables accurate performance characterization of memory cycles DRAM performance metric that represents a new theoretical foundation for DRAM performance measurement and analysis Method for estimating DRAM performance maximum that enables solving of the problem of interpretation of results The methodology specifies how these components are defined, constructed, and paramterized according to a particular DRAM implementation and describes all the relevant processes and procedures that enable measurement and analysis of DRAM performance in real systems. The proposed methodology makes fundamental advancements over the existing solutions. Its most important advantages, from the theoretical point of view, include: guaranteed maximum accuracy enabling accurate estimation of theoretical maximum enabling root-causing of sub-optimal DRAM performance XI completeness (takes into account all DRAM commands and timing parameters) The most important advantages from the practical point of view include: system agnostic (does not depend on the system that generates workload) workload agnostic (does not depend on the size or type of workload) portability (can be implemented on any type of system) efficiency (generates results fast and with little user effort) low implementation and verification complexity and cost The proposed methodology can replace existing solutions in all domains where accuracy and efficiency are of importance. At the same time, it can be applied in completely new domains, such as analysis of critical scenarios (analysis of sequences that occur sporadically, but have tangible impact on performance), transactional analysis (analysis of system operation by following individual transactions), comparative analysis (comparing resutls from different platforms, for different workloads, or in different time periods), etc. The methodology enables relatively simple application at the engineerig level, with small use of resources, and high level of efficiency. As a result of systematization on firm theoretical grounds, all key problems in the domain of DRAM performance measurement and analysis were solved. The fundamental improvement made in this domain allows DRAM performance measurement and analysis to be elevated from an engineering art to a scientific method, which enables a quantitative and qualitative leap in computer system analysis

    Iterative Compilation and Performance Prediction for Numerical Applications

    Get PDF
    Institute for Computing Systems ArchitectureAs the current rate of improvement in processor performance far exceeds the rate of memory performance, memory latency is the dominant overhead in many performance critical applications. In many cases, automatic compiler-based approaches to improving memory performance are limited and programmers frequently resort to manual optimisation techniques. However, this process is tedious and time-consuming. Furthermore, a diverse range of a rapidly evolving hardware makes the optimisation process even more complex. It is often hard to predict the potential benefits from different optimisations and there are no simple criteria to stop optimisations i.e. when optimal memory performance has been achieved or sufficiently approached. This thesis presents a platform independent optimisation approach for numerical applications based on iterative feedback-directed program restructuring using a new reasonably fast and accurate performance prediction technique for guiding optimisations. New strategies for searching the optimisation space, by means of profiling to find the best possible program variant, have been developed. These strategies have been evaluated using a range of kernels and programs on different platforms and operating systems. A significant performance improvement has been achieved using new approaches when compared to the state-of-the-art native static and platform-specific feedback directed compilers

    ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction

    Get PDF
    The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies
    corecore