100 research outputs found

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    A generic framework to integrate data caches in the WCET analysis of real-time systems

    Get PDF
    Worst-case execution time (WCET) analysis of systems with data caches is one of the key challenges in real-time systems. Caches exploit the inherent reuse properties of programs by temporarily storing certain memory contents near the processor, in order that further accesses to such contents do not require costly memory transfers. Current worst-case data cache analysis methods focus on specific cache organizations (set-associative LRU, locked, ACDC, etc.), most of the times adapting techniques designed to analyze instruction caches. On the other hand, there are methodologies to analyze the data reuse of a program, independently of the data cache. In this paper we propose a generic WCET analysis framework to analyze data caches taking profit of such reuse information. It includes the categorization of data references and their integration in an IPET model. We apply it to a conventional LRU cache, an ACDC, and other baseline systems, and compare them using the TACLeBench benchmark suite. Our results show that persistence-based LRU analyses dismiss essential information on data, and a reuse-based analysis improves the WCET bound around 17% in average. In general, the best WCET estimations are obtained with optimization level 2, where the ACDC cache performs 39% better than a set-associative LRU

    A comparison of cache hierarchies for SMT processors

    Get PDF
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer ReviewedPostprint (author's final draft

    Late allocation and early release of physical registers

    Get PDF
    The register file is one of the critical components of current processors in terms of access time and power consumption. Among other things, the potential to exploit instruction-level parallelism is closely related to the size and number of ports of the register file. In conventional register renaming schemes, both register allocation and releasing are conservatively done, the former at the rename stage, before registers are loaded with values, and the latter at the commit stage of the instruction redefining the same register, once registers are not used any more. We introduce VP-LAER, a renaming scheme that allocates registers later and releases them earlier than conventional schemes. Specifically, physical registers are allocated at the end of the execution stage and released as soon as the processor realizes that there will be no further use of them. VP-LAER enhances register utilization, that is, the fraction of allocated registers having a value to be read in the future. Detailed cycle-level simulations show either a significant speedup for a given register file size or a reduction in the register file size for a given performance level, especially for floating-point codes, where the register file pressure is usually high.Peer ReviewedPostprint (published version

    Balancer: bandwidth allocation and cache partitioning for multicore processors

    Get PDF
    The management of shared resources in multicore processors is an open problem due to the continuous evolution of these systems. The trend toward increasing the number of cores and organizing them in clusters sets out new challenges not considered in previous works. In this paper, we characterize the use of the shared cache and memory bandwidth of an AMD Rome processor executing multiprogrammed workloads and propose several mechanisms that control the use of these resources to improve the system performance and fairness. Our control mechanisms require no hardware or operating system modifications. We evaluate Balancer on a real system running SPEC CPU2006 and CPU2017 applications. Balancer tuned for performance shows an average increase of 7.1% in system performance and an unfairness reduction of 18.6% with respect to a system without any control mechanism. Balancer tuned for fairness decreases the performance by 1.3% in exchange for a 64.5% reduction of unfairness

    Filtering directory lookups in CMPs

    Get PDF
    Coherence protocols consume an important fraction of power to determine which coherence action should take place. In this paper we focus on CMPs with a shared cache and a directory-based coherence protocol implemented as a duplicate of local caches tags. We observe that a big fraction of directory lookups produce a miss since the block looked up is not cached in any local cache. We propose to add a filter before the directory lookup in order to reduce the number of lookups to this structure. The filter identifies whether the current block was last accessed as a data or as an instruction. With this information, looking up the whole directory can be avoided for most accesses. We evaluate the filter in a CMP with 8 in-order processors with 4 threads each and a memory hierarchy with a shared L2 cache.We show that a filter with a size of 3% of the tag array of the shared cache can avoid more than 70% of all comparisons performed by directory lookups with a performance loss of just 0.2% for SPLASH2 and 1.5% for Specweb2005. On average, the number of 15-bit comparisons avoided per cycle is 54 out of 77 for SPLASH2 and 29 out of 41 for Specweb2005. In both cases, the filter requires less than one read of 1 bit per cycle.Postprint (published version

    Scaling a convolutional neural network for classification of adjective noun pairs with TensorFlow on GPU clusters

    Get PDF
    Deep neural networks have gained popularity in recent years, obtaining outstanding results in a wide range of applications such as computer vision in both academia and multiple industry areas. The progress made in recent years cannot be understood without taking into account the technological advancements seen in key domains such as High Performance Computing, more specifically in the Graphic Processing Unit (GPU) domain. These kind of deep neural networks need massive amounts of data to effectively train the millions of parameters they contain, and this training can take up to days or weeks depending on the computer hardware we are using. In this work, we present how the training of a deep neural network can be parallelized on a distributed GPU cluster. The effect of distributing the training process is addressed from two different points of view. First, the scalability of the task and its performance in the distributed setting are analyzed. Second, the impact of distributed training methods on the training times and final accuracy of the models is studied. We used TensorFlow on top of the GPU cluster of servers with 2 K80 GPU cards, at Barcelona Supercomputing Center (BSC). The results show an improvement for both focused areas. On one hand, the experiments show promising results in order to train a neural network faster. The training time is decreased from 106 hours to 16 hours in our experiments. On the other hand we can observe how increasing the numbers of GPUs in one node rises the throughput, images per second, in a near-linear way. Morever an additional distributed speedup of 10.3 is achieved with 16 nodes taking as baseline the speedup of one node.This work is partially supported by the Spanish Ministry of Economy and Competitivity under contract TIN2012-34557, by the BSC-CNS Severo Ochoa program (SEV-2011-00067), by the SGR programmes (2014-SGR-1051 and 2014-SGR-1421 ) of the Catalan Government and by the framework of the project BigGraph TEC2013-43935-R, funded by the Spanish Ministerio de Economia y Competitividad and the European Regional Development Fund (ERDF). We also would like to thank the technical support team at the Barcelona Supercomputing center (BSC) especially to Carlos Tripiana.Peer ReviewedPostprint (author's final draft

    Caracterización de instrucciones en aplicaciones de cloud

    Get PDF
    Las tendencias de mercado indican que el negocio de los procesadores para grandes centros de datos va a seguir creciendo, impulsado por la economía de la virtualización y la gran penetración empresarial y social de las aplicaciones que residen en las nubes (cloud computing). Para diseñar un procesador de futuro adaptado a este mercado es necesario experimentar con una carga de trabajo apropiada. Por ello, en este proyecto nos hemos centrado en caracterizar el comportamiento de la cache de instrucciones para un sistema de cuatro procesadores, usando el conjunto de aplicaciones Cloudsuite 2.0 del laboratorio de investigación Parsa, representativo del cloud computing. Hemos usado la plataforma de simulación Simics, un simulador de sistema completo, trabajando con las cinco aplicaciones de Cloudsuite que están acompañadas de checkpoints públicos. Además, se ha contribuido con un tutorial de Simics, acompañado de material práctico, para facilitar y agilizar la fase de formación de otros proyectos que también utilicen esta plataforma. Para realizar los experimentos deseados se han programado dos módulos de Simics de jerarquía de memoria basados en el módulo g-cache, que implementan dos algoritmos eficientes y específicos para registrar tasas de fallos y huellas de memoria. Un algoritmo obtiene resultados para múltiples caches en una sola simulación y el otro está especializado en caches completamente asociativas. A partir de estos experimentos hemos analizado los benchmarks en cuanto a su tasa de fallos, en función de su tamaño y de su asociatividad, sugiriendo configuraciones prácticas de tamaño y asociatividad para cada aplicación. También se ha examinado la huella de memoria de instrucciones a lo largo del tiempo, concluyendo que todas las aplicaciones tardan muchos segundos en entrar en régimen estacionario y que la aparición de varias fases complica la selección de ventanas de simulación. Y finalmente, se ha calculado el ancho de banda de instrucciones agregado para los cuatro procesadores simulados, concluyendo que la presión sobre el siguiente nivel puede ser bastante grande, y sugiriendo configuraciones de ese segundo nivel con capacidad para absorber las demandas del primero

    Aplicación del método SERT para analizar la eficiencia energética del computador al variar voltaje y frecuencia del procesador

    Get PDF
    El acelerado proceso de digitalización que esta teniendo lugar a nivel global ha llevado a un creciente interés en la optimización de la eficiencia energética de los sistemas informáticos. Esto plantea el complejo reto de cuantificar dicha eficiencia. Es por ello que en los últimos años se han dado grandes pasos en el desarrollo de benchmarks capaces de puntuar un sistema informático en base a su eficiencia energética cuando es sometido a una carga de trabajo típica.La suite SERT de la cooperativa SPEC una de las herramientas más reconocidas, hasta el punto de ser recientemente adoptada por la agencia de protección medioambiental de Estados Unidos (EPA) para el programa Energy Star de certificación energética de servidores (Version 3.0, ENERGY STAR Computer Server Specification, junio 2019).En este trabajo se realiza un estudio experimental de eficiencia energética en una plataforma Skylake-X de Intel, experimentando con el procesador i7 7800X sobre la placa ASUS Rampage VI Extreme Omega, seleccionada por su facilidad de cambio de frecuencias y tensiones de alimentación. En primer lugar se han realizado pruebas de estabilidad de sistema, seguidas de una caracterización de la potencia consumida por el procesador al variar tensión de alimentación, frecuencia y temperatura. Se ha puesto un gran interés en la temperatura, ya que se trata de una variable difícil de controlar e infravalorada en otros estudios experimentales. Se han comentado en detalle los resultados, así como las anomalías con respecto a los modelos teóricos de consumo en tecnología CMOS. Además se han propuesto explicaciones, tanto físicas como microarquitectónicas, para dichas anomalías.Posteriormente se ha realizado un análisis de la eficiencia energética de la plataforma mediante la SERT Suite haciendo uso un conjunto de diferentes combinaciones de tensión de alimentación y frecuencia, entre las cuales se encuentra la frecuencia de fábrica del procesador, así como configuraciones que hacen uso de overclocking y undervolting. De esta manera, se comentan los resultados en cuanto a las configuraciones más óptimas, hablando en un principio de la mejor configuración para un uso equilibrado entre CPU, memoria y almacenamiento, seguido de las configuraciones óptimas para cargas de trabajo centradas en cada uno de los tres componentes mencionados.Finalmente se propone una metodología alternativa para medir la eficiencia centrada en una carga de trabajo de CPU mucho más intensa que la impuesta por la SERT Suite. Se aporta también un análisis mediante el uso de este método sobre el mismo conjunto de configuraciones usadas al aplicar el método SERT, buscando la mayor eficiencia energética bajo una carga de trabajo realmente intensa en términos de CPU.<br /

    Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP

    Get PDF
    SPEC CPU is one of the most common benchmark suites used in computer architecture research. CPU2017 has recently been released to replace CPU2006. In this paper we present a detailed evaluation of the memory hierarchy performance for both the CPU2006 and single-threaded CPU2017 benchmarks. The experiments were executed on an Intel Xeon Skylake-SP, which is the first Intel processor to implement a mostly non-inclusive last-level cache (LLC). We present a classification of the benchmarks according to their memory pressure and analyze the performance impact of different LLC sizes. We also test all the hardware prefetchers showing they improve performance in most of the benchmarks. After comprehensive experimentation, we can highlight the following conclusions: i) almost half of SPEC CPU benchmarks have very low miss ratios in the second and third level caches, even with small LLC sizes and without hardware prefetching, ii) overall, the SPEC CPU2017 benchmarks demand even less memory hierarchy resources than the SPEC CPU2006 ones, iii) hardware prefetching is very effective in reducing LLC misses for most benchmarks, even with the smallest LLC size, and iv) from the memory hierarchy standpoint the methodologies commonly used to select benchmarks or simulation points do not guarantee representative workloads
    corecore