Search CORE

26 research outputs found

Vote the OS off your Core

Author: Agarwal Anant
Belay Adam
Wentzlaff David
Publication venue
Publication date: 27/07/2011
Field of study

Recent trends in OS research have shown evidence that there are performance benefits to running OS services on different cores than the user applications that rely on them. We quantitatively evaluate this claim in terms of one of the most significant architectural constraints: memory performance. To this end, we have created CachEMU, an open-source memory trace generator and cache simulator built as an extension to QEMU for working with system traces. Using CachEMU, we determined that for five common Linux test workloads, it was best to run the OS close, but not too close on the same package, but not on the same core

DSpace@MIT

Оценка требуемых скоростей передачи данных при организации беспроводной связи между ядрами центрального процессора

Author: D. Moltchanov
E. Koucheryavy
K. Borunova
Maria S. Komar
V. Petrov
Виталий Петров Игоревич
Дмитрий Молчанов Александрович
Евгений Кучерявый Андреевич
Каролина Борунова Дмитриевна
Мария Комар Сергеевна
Publication venue: 'P.G. Demidov Yaroslavl State University'
Publication date: 20/04/2015
Field of study

In this paper, a principal architecture of common purpose CPU and its main components are discussed, CPUs evolution is considered and drawbacks that prevent future CPU development are mentioned. Further, solutions proposed so far are addressed and a new CPU architecture is introduced. The proposed architecture is based on wireless cache access that enables a reliable interaction between cores in multicore CPUs using terahertz band, 0.1-10THz. The presented architecture addresses the scalability problem of existing processors and may potentially allow to scale them to tens of cores. As in-depth analysis of the applicability of the suggested architecture requires accurate prediction of traffic in current and next generations of processors, we consider a set of approaches for traffic estimation in modern CPUs discussing their benefits and drawbacks. The authors identify traffic measurements by using existing software tools as the most promising approach for traffic estimation, and they use Intel Performance Counter Monitor for this purpose. Three types of CPU loads are considered including two artificial tests and background system load. For each load type the amount of data transmitted through the L2-L3 interface is reported for various input parameters including the number of active cores and their dependences on the number of cores and operational frequency.Рассматривается современная архитектура процессоров общего назначения, ее основные компоненты, описывается эволюция, а также подчеркиваются проблемы, препятствующие дальнейшему развитию такой архитектуры. Далее рассмотрены предложенные ранее пути развития процессоров, подчеркиваются их недостатки и предлагается новая архитектура, основанная на беспроводном доступе к кеш-памяти в многоядерных процессорах. В основе предлагаемого решения лежит организация надежного обмена данными между кешем третьего уровня и ядрами процессора через беспроводной канал в терагерцовом диапазоне. Таким образом, масштабируемость системы повышается до десятков и, потенциально, сотен ядер. В то же время, детальный анализ применимости предложенного решения требует точного предсказания количества информации, передаваемой между ядрами и кеш-памятью в процессорах текущего и следующего поколения. В данной работе рассматриваются основные подходы к построению оценки количества передаваемых данных, выделены их достоинства и недостатки. Авторы останавливают свой выбор на непосредственных измерениях количества данных с помощью существующих программных инструментов. Для измерений используется программный инструмент Intel Performance Counter Monitor, позволяющей оценить количе- ство данных, передаваемых между кеш-памятью второго и третьего уровней каждого ядра. В работе рассматриваются три варианта нагрузки на ядро – два искусственных теста и фоновая нагрузка от операционной системы. Для каждого типа нагрузки в работе приведены численные значения количества данных, проходящих по шине между кешем второго и третьего уровней, и показана их зависимость от тактовой частоты работы процессора и количества ядер

Modeling and Analysis of Information Systems / Моделирование и анализ информационных систем (МАИС)

Parallelism-Aware Memory Interference Delay Analysis for COTS Multicore Systems

Author: Yun Heechul
Publication venue
Publication date: 25/07/2014
Field of study

In modern Commercial Off-The-Shelf (COTS) multicore systems, each core can generate many parallel memory requests at a time. The processing of these parallel requests in the DRAM controller greatly affects the memory interference delay experienced by running tasks on the platform. In this paper, we model a modern COTS multicore system which has a nonblocking last-level cache (LLC) and a DRAM controller that prioritizes reads over writes. To minimize interference, we focus on LLC and DRAM bank partitioned systems. Based on the model, we propose an analysis that computes a safe upper bound for the worst-case memory interference delay. We validated our analysis on a real COTS multicore platform with a set of carefully designed synthetic benchmarks as well as SPEC2006 benchmarks. Evaluation results show that our analysis is more accurately capture the worst-case memory interference delay and provides safer upper bounds compared to a recently proposed analysis which significantly under-estimate the delay.Comment: Technical Repor

arXiv.org e-Print Archive

CiteSeerX

Implementing a hybrid SRAM / eDRAM NUCA architecture

Author: Brooks David
González Colás Antonio María
Lira Rueda Javier
Molina Clemente Carlos
Publication venue
Publication date: 01/01/2010
Field of study

In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are replacedPreprin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Simple, Parallel, High-Performance Virtual Machines for Extreme Computations

Author: Nejad Bijan Chokoufe
Ohl Thorsten
Reuter Jürgen
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

We introduce a high-performance virtual machine (VM) written in a numerically fast language like Fortran or C to evaluate very large expressions. We discuss the general concept of how to perform computations in terms of a VM and present specifically a VM that is able to compute tree-level cross sections for any number of external legs, given the corresponding byte code from the optimal matrix element generator, O'Mega. Furthermore, this approach allows to formulate the parallel computation of a single phase space point in a simple and obvious way. We analyze hereby the scaling behaviour with multiple threads as well as the benefits and drawbacks that are introduced with this method. Our implementation of a VM can run faster than the corresponding native, compiled code for certain processes and compilers, especially for very high multiplicities, and has in general runtimes in the same order of magnitude. By avoiding the tedious compile and link steps, which may fail for source code files of gigabyte sizes, new processes or complex higher order corrections that are currently out of reach could be evaluated with a VM given enough computing power.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

DESY Publication Database

DESY

nanoBench: A Low-Overhead Tool for Running Microbenchmarks on x86 Systems

Author: Abel Andreas
Reineke Jan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/03/2020
Field of study

We present nanoBench, a tool for evaluating small microbenchmarks using hardware performance counters on Intel and AMD x86 systems. Most existing tools and libraries are intended to either benchmark entire programs, or program segments in the context of their execution within a larger program. In contrast, nanoBench is specifically designed to evaluate small, isolated pieces of code. Such code is common in microbenchmark-based hardware analysis techniques. Unlike previous tools, nanoBench can execute microbenchmarks directly in kernel space. This allows to benchmark privileged instructions, and it enables more accurate measurements. The reading of the performance counters is implemented with minimal overhead avoiding functions calls and branches. As a consequence, nanoBench is precise enough to measure individual memory accesses. We illustrate the utility of nanoBench at the hand of two case studies. First, we briefly discuss how nanoBench has been used to determine the latency, throughput, and port usage of more than 13,000 instruction variants on recent x86 processors. Second, we show how to generate microbenchmarks to precisely characterize the cache architectures of eleven Intel Core microarchitectures. This includes the most comprehensive analysis of the employed cache replacement policies to date

arXiv.org e-Print Archive

Crossref

Reducing cache coherence traffic with a NUMA-aware runtime approach

Author: Caheny Paul
Casas Guix Marc
Derradji Said
Moreto Planas Miquel
Valero Cortés Mateo
Álvarez Martí Lluc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Cache Coherent NUMA (ccNUMA) architectures are a widespread paradigm due to the benefits they provide for scaling core count and memory capacity. Also, the flat memory address space they offer considerably improves programmability. However, ccNUMA architectures require sophisticated and expensive cache coherence protocols to enforce correctness during parallel executions, which trigger a significant amount of on- and off-chip traffic in the system. This paper analyses how coherence traffic may be best constrained in a large, real ccNUMA platform comprising 288 cores through the use of a joint hardware/software approach. For several benchmarks, we study coherence traffic in detail under the influence of an added hierarchical cache layer in the directory protocol combined with runtime managed NUMA-aware scheduling and data allocation techniques to make most efficient use of the added hardware. The effectiveness of this joint approach is demonstrated by speedups of 3.14× to 9.97× and coherence traffic reductions of up to 99% in comparison to NUMA-oblivious scheduling and data allocation.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU’s H2020 Framework Programme (H2020/2014-2020) under grant agreement no 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Cooperative cache scrubbing

Author: Blackburn Stephen M.
Eeckhout Lieven
Heirman Wim
McKinley Kathryn S
Sartor Jennifer
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2014
Field of study

Managing the limited resources of power and memory bandwidth while improving performance on multicore hardware is challeng-ing. In particular, more cores demand more memory bandwidth, and multi-threaded applications increasingly stress memory sys-tems, leading to more energy consumption. However, we demon-strate that not all memory traffic is necessary. For modern Java pro-grams, 10 to 60 % of DRAM writes are useless, because the data on these lines are dead- the program is guaranteed to never read them again. Furthermore, reading memory only to immediately zero ini-tialize it wastes bandwidth. We propose a software/hardware coop-erative solution: the memory manager communicates dead and zero lines with cache scrubbing instructions. We show how scrubbing instructions satisfy MESI cache coherence protocol invariants and demonstrate them in a Java Virtual Machine and multicore simula-tor. Scrubbing reduces average DRAM traffic by 59%, total DRAM energy by 14%, and dynamic DRAM energy by 57 % on a range of configurations. Cooperative software/hardware cache scrubbing reduces memory bandwidth and improves energy efficiency, two critical problems in modern systems

CiteSeerX

Crossref

Ghent University Academic Bibliography

The Australian National University

Spatial Locality Speculation to Reduce Energy in Chip-Multiprocessor Networks-on-Chip

Author: Gratz Paul V.
Grot Boris
Jimenez Daniel A.
Kim Hyungjun
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/05/2014
Field of study

As processor chips become increasingly parallel, an efficient communication substrate is critical for meeting performance and energy targets. In this work, we target the root cause of network energy consumption through techniques that reduce link and router-level switching activity. We specifically focus on memory subsystem traffic, as it comprises the bulk of NoC load in a CMP. By transmitting only the flits that contain words predicted useful using a novel spatial locality predictor, our scheme seeks to reduce network activity. We aim to further lower NoC energy through microarchitectural mechanisms that inhibit datapath switching activity for unused words in individual flits. Using simulation-based performance studies and detailed energy models based on synthesized router designs and different link wire types, we show that 1) the prediction mechanism achieves very high accuracy, with an average rate of false-unused prediction of just 2.5 percent; 2) the combined NoC energy savings enabled by the predictor and microarchitectural support is 36 percent, on average, and up to 57 percent in the best case; and 3) there is no system performance penalty as a result of this technique

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX