11 research outputs found

    POSTER: SPiDRE: accelerating sparse memory access patterns

    Get PDF
    Development in process technology has led to an exponential increase in processor speed and memory capacity. However, memory latencies have not improved as dramatically and represent a well-known problem in computer architecture. Cache memories provide more bandwidth with lower latencies than main memories but they are capacity limited. Locality-friendly applications benefit from a large and deep cache hierarchy. Nevertheless, this is a limited solution for applications suffering from sparse and irregular memory access patterns, such as data analytics. In order to accelerate them, we should maximize usable bandwidth, reduce latency and maximize moved data reuse. In this work we explore the Sparse Data Rearrange Engine (SPiDRE), a novel hardware approach to accelerate these applications through near-memory data reorganization.This work has been supported by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P, Ramon y Cajal fellowship number RYC-2016-21104 and FPI fellowship number BES-2017-080635), and by the Arm-BSC Centre of Excellence initiative.Peer ReviewedPostprint (author's final draft

    An optimized predication execution for SIMD extensions

    Get PDF
    Vector processing is a widely used technique to improve performance and energy efficiency in modern processors. Most of them rely on predication to support divergence control. However, performance and energy consumption in predicated instructions are usually independent on the number of true values in a mask. This means that the efficiency of the system becomes sub-optimal as vector length increases. In this work we propose the Optimized Predication Execution (OPE) technique. OPE delays the execution of sparse masked vector instructions sharing the same PC, extracts their active elements and creates a new dense instruction with a higher mask density. After executing such dense instruction, results are restored to the original sparse instructions. Our approach improves performance by up to 25% and reduces dynamic energy consumption by up to 43% on real applications with predication.This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Spanish Government (contract TIN2015-65316-P). A. Barredo has been supported by the Spanish Government under Formación del Personal Investigador fellowship number BES-2017-080635. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Novel techniques to improve the performance and the energy of vector architectures

    Get PDF
    The rate of annual data generation grows exponentially. At the same time, there is a high demand to analyze that information quickly. In the past, every processor generation came with a substantial frequency increase, leading to higher application throughput. Nowadays, due to the cease of Dennard scaling, further performance must come from exploiting parallelism. Vector architectures offer an efficient manner, in terms of performance and energy, of exploting parallelism at data-level by means of instructions that operate over multiple elements at the same time. This is popularly known as Single Instruction Multiple Data (SIMD). Traditionally, vector processors were employed to accelerate applications in research, and they were not industry-oriented. However, vector processors are becoming widely used for data processing in multimedia applications, and entering in new application domains such as machine learning and genomics. In this thesis, we study the circumstances that cause inefficiencies in vector processors, and new hardware/software techniques are proposed to improve the performance and energy consumption of these processors. We first analyze the behavior of predicated vector instructions in a real machine. We observe that their execution time is dependent on the vector register length and not on the source mask employed. Therefore, a hardware/software mechanism is proposed to alleviate this situation, that will have a higher impact in future processors with wider vector register lengths. We then study the impact of memory accesses to performance. We identify that an irregular memory access pattern prevents an efficient vectorization, which is automatically discarded by the compiler. For this reason, we propose a near-memory accelerator capable of rearranging data structures and transforming irregular memory accesses to dense ones. This operation may be performed by the devices as the host processor is computing other code regions. Finally, we observe that many applications with irregular memory access patterns just perform a simple operation on the data before it is evicted back to main memory. In these situations, there is a lack of data access locality, leading to an inefficient use of the memory hierarchy. For this reason, we propose to utilize the accelerators previously described to compute directly near memory.La tasa de generación de información aumenta cada año. Al mismo tiempo, existe una alta demanda para analizar dicha información en el menor tiempo posible. En el pasado, se recurría a aumentar la frecuencia de los procesadores para conseguir una mayor velocidad de procesamiento de los datos. En la actualidad, debido al fin de la ley de Dennard, la frecuencia deja de ser una opción y se apunta al paralelismo como la mejor alternativa. Las arquitecturas vectoriales ofrecen una manera eficiente, en términos de rendimiento y energía, de explotar el paralelismo a nivel de datos a través de instrucciones que operan sobre múltiples elementos al mismo tiempo, conocidas popularmente como SIMD. Tradicionalmente, los procesadores vectoriales se utilizaban para acelerar las aplicaciones en la investigación y no estaban orientados a la industria. Sin embargo, dichos procesadores están siendo cada vez más utilizados para el procesamiento de datos en aplicaciones multimedia. En esta tesis doctoral, se investigan las causas que pueden suponer la ineficiencia de las arquitecturas vectoriales, y se proponen mejoras a nivel de hardware y software con el fin de mejorar el rendimiento y el consumo de estos procesadores. En primer lugar, se estudia el funcionamiento de las instrucciones vectoriales predicadas en una máquina real. Como resultado, se observa que el tiempo de ejecución y el consumo de dichas instrucciones es independiente de la máscara empleada, mientras que sí es dependiente de la longitud de los registros vectoriales que contienen los datos. Por tanto, se propone un mecanismo hardware/software para aliviar esta situación, que se agravará en el futuro con la aparición de procesadores con la longitud de los registros vectoriales más alta. En segundo lugar, se analiza el impacto de los accesos a memoria por parte del procesador vectorial. En este caso, se comprueba que un acceso irregular a memoria impide una vectorización eficiente de las aplicaciones, que es descartada automáticamente por el compilador. Por tanto, en esta tesis se propone un acelerador cerca de memoria capaz de reordenar los datos y proporcionar accesos secuenciales a memoria mientras el procesador está computando otras regiones de la aplicación. En tercer lugar, se propone utilizar los aceleradores previamente descritos como elementos de cómputo, dado que muchas aplicaciones acceden a memoria de manera irregular para realizar un cómputo muy sencillo en el procesador. Este movimiento de datos puede ser evitado si la operación es realizada cerca de memoria. El rendimiento de estos aceleradores es evaluado en aplicaciones de computación de altas prestaciones y en grafos, un campo de la ciencia muy afectado por esta situación.Postprint (published version

    Evaluation of Raspberry Pi as a Microprocessor Architecture teaching platform

    Get PDF
    El estudio y la comprensión del funcionamiento de los procesadores es uno de los requisitos fundamentales en el proceso formativo de un ingeniero de telecomunicación. Se debe sobre todo, al amplio abanico de posibilidades que proporciona en diversos sectores del mercado actual, como en equipos de comunicaciones tales como routers, moduladores, ordenadores, smartphones y tablets. La necesidad tecnológica de dispositivos cada vez más pequeños y con características cada vez más punteras demanda a profesionales más especializados en su desarrollo. A lo largo de la historia, han sido muchos los microprocesadores que han pretendido hacerse un hueco en el desarrollo tecnológico. Entre ellos, los denominados ARM han logrado mayor repercusión en el mercado, siendo en la actualidad la arquitectura de 32 bits más exitosa en el mundo respecto al nivel de producción. Estos hechos han provocado que en la titulación se desee que el alumno adquiera los conocimientos base de la arquitectura ARM, necesitando un entorno capaz de ejecutar y depurar código ensamblador de este tipo de sistemas. El objetivo de este proyecto es el de buscar una plataforma capaz de instruir al estudiante en la arquitectura ARM, proporcionándole nuevas herramientas para su estudio y así poder profundizar en los conocimientos adquiridos en asignaturas previas de la titulación de temática similar. Así mismo, se pretende buscar una plataforma asequible que permita al alumno investigar por su cuenta el funcionamiento del procesador, consiguiendo dos fines: por un lado, que se motive al hacer de sus estudios una realidad, y por otro, que aprenda haciendo de la asignatura una forma de vida.Grado en Ingeniería de Tecnologías de Telecomunicació

    Study and improvement of emerging applications by means of vector architectures

    No full text

    PLANAR: a programmable accelerator for near-memory data rearrangement

    Get PDF
    Many applications employ irregular and sparse memory accesses that cannot take advantage of existing cache hierarchies in high performance processors. To solve this problem, Data Layout Transformation (DLT) techniques rearrange sparse data into a dense representation, improving locality and cache utilization. However, prior proposals in this space fail to provide a design that (i) scales with multi-core systems, (ii) hides rearrangement latency, and (iii) provides the necessary interfaces to ease programmability. In this work we present PLANAR, a programmable near-memory accelerator that rearranges sparse data into dense. By placing PLANAR devices at the memory controller level we enable a design that scales well with multi-core systems, hides operation latency by performing non-blocking fine-grain data rearrangements, and eases programmability by supporting virtual memory and conventional memory allocation mechanisms. Our evaluation shows that PLANAR leads to significant reductions in data movement and dynamic energy, providing an average 4.58× speedup.A. Barredo has been partially supported by the Spanish Government under Formación del Personal Investigadorfellowship number BES-2017-080635. A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number IJCI-2017-33945. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. This work has been partially supported by the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877), by the Spanish Ministry of Science and Innovation (PID2019-107255GB-C21/AEI/10.13039/501100011033), by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328), and by the Arm-BSC Centre of Excellence.Peer ReviewedPostprint (author's final draft

    Improving predication efficiency through compaction/restoration of SIMD instructions

    Get PDF
    Vector processors offer a wide range of unexplored opportunities to improve performance and energy efficiency. However, despite its potential, vector code generation and execution have significant challenges, the most relevant ones being control flow divergence. Most modern processors including SIMD extensions (such as AVX) rely on predication to support divergence control. In predicated codes, performance and energy consumption are usually insensitive to the number of true values in a predicated mask. This implies that the system efficiency becomes sub-optimal as vector length increases. In this paper we focus on SIMD extensions and propose a novel approach to improve execution efficiency in predicated SIMD instructions, the Compaction/Restoration (CR) technique. CR delays predicated SIMD instructions with inactive elements and compacts them with instances of the same instruction from different loop iterations to form an equivalent dense vector instruction, where, in the best case, all the elements are active. After executing such dense instructions, their results are restored to the original instructions. Our evaluation shows that CR improves performance by up to 25% and reduces dynamic energy consumption by up to 43% on real unmodified applications with predicated execution. Moreover, CR allows executing unmodified legacy code with short vector instructions (AVX-2) on newer architectures with wider vectors (AVX-512), achieving up to 56% performance benefits.This work has been partially supported by the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence, the Spanish Government (contract TIN2015-65316-P) and the European Union’s Horizon 2020 research and innovation program under the Mont-Blanc 2020 project (grant agreement 779877). A. Barredo has been supported by the Spanish Government under Formación del Personal Investigador fellowship number BES-2017-080635. M. Moretó and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship numbers RYC-2016-21104 and RYC-2017-23269.Peer ReviewedPostprint (author's final draft

    Efficiency analysis of modern vector architectures: vector ALU sizes, core counts and clock frequencies

    Get PDF
    Moore’s Law predicted that the number of transistors on a chip would double approximately every 2 years. However, this trend is arriving at an impasse. Optimizing the usage of the available transistors within the thermal dissipation capabilities of the packaging is a pending topic. Multi-core processors exploit coarse-grain parallelism to improve energy efficiency. Vectorization allows developers to exploit data-level parallelism, operating on several elements per instruction and thus, reducing the pressure to the fetch and decode pipeline stages. In this paper, we perform an analysis of different resource optimization strategies for vector architectures. In particular, we expose the need to break down voltage and frequency domains for LLC, ALUs and vector ALUs if we aim to optimize the energy efficiency and performance of our system. We also show the need for a dynamic reconfiguration strategy that adapts vector register length at runtime.Funding was provided by RoMoL ERC Advanced Grant (Grant No. GA 321253), Juan de la Cierva (Grant No. JCI-2012-15047), Marie Curie (Grant No. 2013 BP_B 00243).Peer Reviewe

    Semi-automatic validation of cycle-accurate simulation infrastructures: The case for gem5-x86

    No full text
    Since the early 70s, simulation infrastructures have been a keystone in computer architecture research, providing a fast and reliable way to prototype and evaluate ideas for future computing systems. There are different types of simulators, from most detailed (cycle-accurate) to time-based/functional and analytical modeling. Increasing accuracy translates into several orders of magnitude in terms of simulation speed. Yet, a question remains open: are the results derived from the simulation infrastructure representative of a real machine? Validation of these infrastructures is complex and costly, usually performed upon release. However, most simulators do not provide the appropriate means to verify or validate new architectural models. In this paper, we introduce a semi-automatic validation framework based on real-hardware performance counter information. The framework provides two levels of abstraction: (a) a high level definition of the processor behavior (Top-Down model) and (b) detailed per-structure and per-pipeline-stage usage breakdown to pinpoint simulator issues. We used this framework to validate the latest available gem5-x86 simulation environment, and found several sources of error that alter the expected behavior of the simulated processor, which we were later to document and correct.This work has been partially supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation (contract TIN2015- 65316-P), Generalitat de Catalunya, Spain (contracts 2014-SGR1051 and 2014-SGR-1272), the RoMoL ERC Advanced, Spain Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-779877). A. Barredo has been supported by the Spanish Government under Formación del Personal Investigador fellowship number BES2017-080635. M. Moreto and M. Casas have been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness, Spain under Ramon y Cajal fellowship numbers RYC2016-21104 and RYC-2017-23269.Peer ReviewedPostprint (author's final draft
    corecore