64 research outputs found

    Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications

    Get PDF
    High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.The research was supported by European Unions 7th Framework Programme [FP7/2007-2013] under project Mont-Blanc (288777), the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925), Generalitat de Catalunya (2014-SGR-1051 and 2014-SGR-1272), HiPEAC-3 Network of Excellence (ICT-287759), and finally the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP.Postprint (published version

    Multicore architecture optimizations for HPC applications

    Get PDF
    From single-core CPUs to detachable compute accelerators, supercomputers made a tremendous progress by using available transistors on chip and specializing hardware for a given type of computation. Today, compute nodes used in HPC employ multi-core CPUs tailored for serial execution and multiple accelerators (many-core devices or GPUs) for throughput computing. However, designing next-generation HPC system requires not only the performance improvement but also better energy efficiency. Current trend of reaching exascale level of computation asks for at least an order of magnitude increase in both of these metrics. This thesis explores HPC-specific optimizations in order to make better utilization of the available transistors and to improve performance by transparently executing parallel code across multiple GPU accelerators. First, we analyze several HPC benchmark suites, compare them against typical desktop applications, and identify the differences which advocate for proper core tailoring. Moreover, within the HPC applications, we evaluate serial and parallel code sections separately, resulting in an Asymmetric Chip Multiprocessor (ACMP) design with one core optimized for single-thread performance and many lean cores for parallel execution. Our results presented here suggests downsizing of core front-end structures providing an HPC-tailored lean core which saves 16% of the core area and 7% of power, without performance loss. Further improving an ACMP design, we identify that multiple lean cores run the same code during parallel regions. This motivated us to evaluate the idea where lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections presented in this thesis show additional 11% area savings with a 5% energy reduction at no performance cost. These area and power savings might be attractive for many-core accelerators either for increasing the performance per area and power unit, or adding additional cores and thus improving the performance for the same hardware budget. Finally, in this thesis we study the effects of future NUMA accelerators comprised of multiple GPU devices. Reaching the limits of a single-GPU die size, next-generation GPU compute accelerators will likely embrace multi-socket designs increasing the core count and memory bandwidth. However, maintaining the UMA behavior of a single-GPU in multi-GPU systems without code rewriting stands as a challenge. We investigate multi-socket NUMA GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for performance scaling of future compute nodes used in HPC.Empezando por CPUs de un solo procesador, y pasando por aceleradores discretos, los supercomputadores han avanzado enormemente utilizando todos los transistores disponibles en el chip, y especializando los diseños para cada tipo de cálculo. Actualmente, los nodos de cálculo de un sistema de Computación de Altas Prestaciones (CAP) utilizan CPUs de múltiples procesadores, optimizados para el cálculo serial de instrucciones, y múltiples aceleradores (aceleradores gráficos, o many-core), optimizados para el cálculo paralelo. El diseño de un sistema CAP de nueva generación requiere no solo mejorar el rendimiento de cálculo, sino también mejorar la eficiencia energética. La siguiente generación de sistemas requiere mejorar un orden de magnitud en ambas métricas simultáneamente. Esta tesis doctoral explora optimizaciones específicas para sistemas CAP para hacer un mejor uso de los transistores, y para mejorar las prestaciones de forma transparente ejecutando las aplicaciones en múltiples aceleradores en paralelo. Primero, analizamos varios conjuntos de aplicaciones CAP, y las comparamos con aplicaciones para servidores y escritorio, identificando las principales diferencias que nos indican cómo ajustar la arquitectura para CAP. En las aplicaciones CAP, también analizamos la parte secuencial del código y la parte paralela de forma separada, . El resultado de este análisis nos lleva a proponer una arquitectura multiprocesador asimétrica (ACMP) , con un procesador optimizado para el código secuencial, y múltiples procesadores, más pequeños, optimizados para el procesamiento paralelo. Nuestros resultados muestran que reducir el tamaño de las estructuras del front-end (fetch, y predicción de saltos) en los procesadores paralelos nos proporciona un 16% extra de área en el chip, y una reducción de consumo del 7%. Como mejora a nuestra arquitectura ACMP, proponemos explotar el hecho de que todos los procesadores paralelos ejecutan el mismo código al mismo tiempo. Evaluamos una propuesta en que los procesadores paralelos comparten la caché de instrucciones, con la intención de que uno de ellos precargue las instrucciones para los demás procesadores (prefetching), sin aumentar la latencia media de acceso. Nuestra exploración de los distintos parámetros determina que el punto óptimo requiere una interconexión de alto ancho de banda para acceder a la caché compartida, y el uso de unos pocos line buffers para mantener el ancho de banda y la latencia necesarios. Nuestras proyecciones muestran un ahorro adicional del 11% en área y el 5% en energía, sin impacto en el rendimiento. Estos ahorros de área y energía permiten a un multiprocesador incrementar la eficiencia energética, o aumentar el rendimiento añadiendo procesador adicionales. Por último, estudiamos el efecto de usar múltiples aceleradores (GPU) en una arquitectura con tiempo de acceso a memoria no uniforme (NUMA). Una vez alcanzado el límite de número de transistores y tamaño máximo por chip, la siguiente generación de aceleradores deberá utilizar múltiples chips para aumentar el número de procesadores y el ancho de banda de acceso a memoria. Sin embargo, es muy difícil mantener la ilusión de un tiempo de acceso a memoria uniforme en un sistema multi-GPU sin reescribir el código de la aplicación. Nuestra investigación sobre sistemas multi-GPU muestra retos significativos en el diseño de la interconexión entre las GPU y la jerarquía de memorias cache. Nuestros resultados muestran que se puede explotar el comportamiento en fases de las aplicaciones para optimizar la configuración de la interconexión y las cachés de forma dinámica, minimizando el impacto de la arquitectura NUMA. Nuestro diseño mejora el rendimiento de un sistema con una única GPU en 1.5x, 2.3x y 3.2x (el 89%, 84%, y 76% del máximo teórico) usando 2, 4, y 8 GPUs en paralelo. Siendo su implementación posible hoy en dia, los nodos de cálculo con múltiples aceleradores son una alternativa atractiva para futuros sistemas CAP

    Cooperative Data and Computation Partitioning for Decentralized Architectures.

    Full text link
    Scalability of future wide-issue processor designs is severely hampered by the use of centralized resources such as register files, memories and interconnect networks. While the use of centralized resources eases both hardware design and compiler code generation efforts, they can become performance bottlenecks as access latencies increase with larger designs. The natural solution to this problem is to adapt the architecture to use smaller, decentralized resources. Decentralized architectures use smaller, faster components and exploit distributed instruction-level parallelism across the resources. A multicluster architecture is an example of such a decentralized processor, where subsets of smaller register files, functional units, and memories are grouped together in a tightly coupled unit, forming a cluster. These clusters can then be replicated and connected together to form a scalable, high-performance architecture. The main difficulty with decentralized architectures resides in compiler code generation. In a centralized Very Long Instruction Word (VLIW) processor, the compiler must statically schedule each operation to both a functional unit and a time slot for execution. In contrast, for a decentralized multicluster VLIW, the compiler must consider the additional effects of cluster assignment, recognizing that communication between clusters will result in a delay penalty. In addition, if the multicluster processor also has partitioned data memories, the compiler has the additional task of assigning data objects to their respective memories. Each decision, of cluster, functional unit, memory, and time slot, are highly interrelated and can have dramatic effects on the best choice for every other decision. This dissertation addresses the issues of extracting and exploiting inherent parallelism across decentralized resources through compiler analysis and code generation techniques. First, a static analysis technique to partition data objects is presented, which maps data objects to decentralized scratchpad memories. Second, an alternative profile-guided technique for memory partitioning is presented which can effectively map data access operations onto distributed caches. Finally, a detailed, resource-aware partitioning algorithm is presented which can effectively split computation operations of an application across a set of processing elements. These partitioners work in tandem to create a high-performance partition assignment of both memory and computation operations for decentralized multicluster or multicore processors.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57649/2/mchu_1.pd

    Architecting a One-to-many Traffic-Aware and Secure Millimeter-Wave Wireless Network-in-Package Interconnect for Multichip Systems

    Get PDF
    With the aggressive scaling of device geometries, the yield of complex Multi Core Single Chip(MCSC) systems with many cores will decrease due to the higher probability of manufacturing defects especially, in dies with a large area. Disintegration of large System-on-Chips(SoCs) into smaller chips called chiplets has shown to improve the yield and cost of complex systems. Therefore, platform-based computing modules such as embedded systems and micro-servers have already adopted Multi Core Multi Chip (MCMC) architectures overMCSC architectures. Due to the scaling of memory intensive parallel applications in such systems, data is more likely to be shared among various cores residing in different chips resulting in a significant increase in chip-to-chip traffic, especially one-to-many traffic. This one-to-many traffic is originated mainly to maintain cache-coherence between many cores residing in multiple chips. Besides, one-to-many traffics are also exploited by many parallel programming models, system-level synchronization mechanisms, and control signals. How-ever, state-of-the-art Network-on-Chip (NoC)-based wired interconnection architectures do not provide enough support as they handle such one-to-many traffic as multiple unicast trafficusing a multi-hop MCMC communication fabric. As a result, even a small portion of such one-to-many traffic can significantly reduce system performance as traditional NoC-basedinterconnect cannot mask the high latency and energy consumption caused by chip-to-chipwired I/Os. Moreover, with the increase in memory intensive applications and scaling of MCMC systems, traditional NoC-based wired interconnects fail to provide a scalable inter-connection solution required to support the increased cache-coherence and synchronization generated one-to-many traffic in future MCMC-based High-Performance Computing (HPC) nodes. Therefore, these computation and memory intensive MCMC systems need an energy-efficient, low latency, and scalable one-to-many (broadcast/multicast) traffic-aware interconnection infrastructure to ensure high-performance. Research in recent years has shown that Wireless Network-in-Package (WiNiP) architectures with CMOS compatible Millimeter-Wave (mm-wave) transceivers can provide a scalable, low latency, and energy-efficient interconnect solution for on and off-chip communication. In this dissertation, a one-to-many traffic-aware WiNiP interconnection architecture with a starvation-free hybrid Medium Access Control (MAC), an asymmetric topology, and a novel flow control has been proposed. The different components of the proposed architecture are individually one-to-many traffic-aware and as a system, they collaborate with each other to provide required support for one-to-many traffic communication in a MCMC environment. It has been shown that such interconnection architecture can reduce energy consumption and average packet latency by 46.96% and 47.08% respectively for MCMC systems. Despite providing performance enhancements, wireless channel, being an unguided medium, is vulnerable to various security attacks such as jamming induced Denial-of-Service (DoS), eavesdropping, and spoofing. Further, to minimize the time-to-market and design costs, modern SoCs often use Third Party IPs (3PIPs) from untrusted organizations. An adversary either at the foundry or at the 3PIP design house can introduce a malicious circuitry, to jeopardize an SoC. Such malicious circuitry is known as a Hardware Trojan (HT). An HTplanted in the WiNiP from a vulnerable design or manufacturing process can compromise a Wireless Interface (WI) to enable illegitimate transmission through the infected WI resulting in a potential DoS attack for other WIs in the MCMC system. Moreover, HTs can be used for various other malicious purposes, including battery exhaustion, functionality subversion, and information leakage. This information when leaked to a malicious external attackercan reveals important information regarding the application suites running on the system, thereby compromising the user profile. To address persistent jamming-based DoS attack in WiNiP, in this dissertation, a secure WiNiP interconnection architecture for MCMC systems has been proposed that re-uses the one-to-many traffic-aware MAC and existing Design for Testability (DFT) hardware along with Machine Learning (ML) approach. Furthermore, a novel Simulated Annealing (SA)-based routing obfuscation mechanism was also proposed toprotect against an HT-assisted novel traffic analysis attack. Simulation results show that,the ML classifiers can achieve an accuracy of 99.87% for DoS attack detection while SA-basedrouting obfuscation could reduce application detection accuracy to only 15% for HT-assistedtraffic analysis attack and hence, secure the WiNiP fabric from age-old and emerging attacks

    On-Chip Optical Interconnection Networks for Multi/Manycore Architectures

    Get PDF
    The rapid development of multi/manycore technologies offers the opportunity for highly parallel architectures implemented on a single chip. While the first, low-parallelism multicore products have been based on simple interconnection structures (single bus, very simple crossbar), the emerging highly parallel architectures will require complex, limited-degree interconnection networks. This thesis studies this trend according to the general theory of interconnection structures for parallel machines, and investigates some solutions in terms of performance, cost, fault-tolerance, and run-time support to shared-memory and/or message passing programming mechanisms

    Polymorphic computing abstraction for heterogeneous architectures

    Get PDF
    Integration of multiple computing paradigms onto system on chip (SoC) has pushed the boundaries of design space exploration for hardware architectures and computing system software stack. The heterogeneity of computing styles in SoC has created a new class of architectures referred to as Heterogeneous Architectures. Novel applications developed to exploit the different computing styles are user centric for embedded SoC. Software and hardware designers are faced with several challenges to harness the full potential of heterogeneous architectures. Applications have to execute on more than one compute style to increase overall SoC resource utilization. The implication of such an abstraction is that application threads need to be polymorphic. Operating system layer is thus faced with the problem of scheduling polymorphic threads. Resource allocation is also an important problem to be dealt by the OS. Morphism evolution of application threads is constrained by the availability of heterogeneous computing resources. Traditional design optimization goals such as computational power and lower energy per computation are inadequate to satisfy user centric application resource needs. Resource allocation decisions at application layer need to permeate to the architectural layer to avoid conflicting demands which may affect energy-delay characteristics of application threads. We propose Polymorphic computing abstraction as a unified computing model for heterogeneous architectures to address the above issues. Simulation environment for polymorphic applications is developed and evaluated under various scheduling strategies to determine the effectiveness of polymorphism abstraction on resource allocation. User satisfaction model is also developed to complement polymorphism and used for optimization of resource utilization at application and network layer of embedded systems

    Memory Hierarchy Design for Next Generation Scalable Many-core Platforms

    Get PDF
    Performance and energy consumption in modern computing platforms is largely dominated by the memory hierarchy. The increasing computational power in the multiprocessors and accelerators, and the emergence of the data-intensive workloads (e.g. large-scale graph traversal and scientific algorithms) requiring fast transfer of large volumes of data, are two main trends which intensify this problem by putting even higher pressure on the memory hierarchy. This increasing gap between computation speed and data transfer speed is commonly referred as the “memory wall” problem. With the emergence of heterogeneous Three Dimensional (3D) Integration based on through-silicon-vias (TSV), this situation has started to recover in the past years. On one hand, it is now possible to improve memory access bandwidth and/or latency by either stacking memories directly on top of processors or through abstracted memory interfaces such as Micron’s Hybrid Memory Cube (HMC). On the other hand, near memory computation has become worthy of revisiting due to the cost-effective integration of logic and memory in 3D stacks. These two directions bring about several interesting opportunities including performance improvement, energy and cost reduction, product miniaturization, and modular design for improved time to market. In this research, we study the effectiveness of the 3D integration technology and the optimization opportunities which it can provide in the different layers of the memory hierarchy in cluster-based many-core platforms ranging from intra-cluster L1 to inter-cluster L2 scratchpad memories (SPMs), as well as the main memory. In addition, by moving a part of the computation to where data resides, in the 3D-stacked memory context, we demonstrate further energy and performance improvement opportunities

    Doctor of Philosophy in Computing

    Get PDF
    dissertatio
    corecore