752 research outputs found

    Parallel and Distributed Computing

    Get PDF
    The 14 chapters presented in this book cover a wide variety of representative works ranging from hardware design to application development. Particularly, the topics that are addressed are programmable and reconfigurable devices and systems, dependability of GPUs (General Purpose Units), network topologies, cache coherence protocols, resource allocation, scheduling algorithms, peertopeer networks, largescale network simulation, and parallel routines and algorithms. In this way, the articles included in this book constitute an excellent reference for engineers and researchers who have particular interests in each of these topics in parallel and distributed computing

    Runtime home mapping for effective memory resource usage

    Full text link
    In tiled Chip Multiprocessors (CMPs) last-level cache (LLC) banks are usually shared but distributed among the tiles. A static mapping of cache blocks to the LLC banks leads to poor efficiency since a block may be mapped away from the tiles actually accessing it. Dynamic policies either rely on the static mapping of blocks to a set of banks (D-NUCA) or rely on the OS to dynamically load pages to statically mapped addresses (first-touch). In this paper, we propose Runtime Home Mapping (RHM), a new dynamic approach where the LLC home bank is determined at runtime by the memory controller when the block is fetched from main memory, trying to map each block as close as possible to the requestor thus speeding up execution time and lowering message latencies. Block migration and replication provide further improvements to basic RHM. Also, in a further optimization we eliminate the directory structure. All these optimizations involve specific NoC optimizations and co-designs. Results with PARSEC and SPLASH-2 applications show, when compared with alternative solutions, that RHM achieves a 41% and 35% average reduction in load and store latencies respectively compared to static mapping. This leads to an average reduction of 28% in applications execution.Lodde, M.; Flich Cardo, J. (2014). Runtime home mapping for effective memory resource usage. Microprocessors and Microsystems. 38(4):276-291. doi:10.1016/j.micpro.2014.03.008S27629138

    Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

    Full text link
    The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

    Fault- and Yield-Aware On-Chip Memory Design and Management

    Get PDF
    Ever decreasing device size causes more frequent hard faults, which becomes a serious burden to processor design and yield management. This problem is particularly pronounced in the on-chip memory which consumes up to 70% of a processor' s total chip area. Traditional circuit-level techniques, such as redundancy and error correction code, become less effective in error-prevalent environments because of their large area overhead. In this work, we suggest an architectural solution to building reliable on-chip memory in the future processor environment. Our approaches have two parts, a design framework and architectural techniques for on-chip memory structures. Our design framework provides important architectural evaluation metrics such as yield, area, and performance based on low level defects and process variations parameters. Processor architects can quickly evaluate their designs' characteristics in terms of yield, area, and performance. With the framework, we develop architectural yield enhancement solutions for on-chip memory structures including L1 cache, L2 cache and directory memory. Our proposed solutions greatly improve yield with negligible area and performance overhead. Furthermore, we develop a decoupled yield model of compute cores and L2 caches in CMPs, which show that there will be many more L2 caches than compute cores in a chip. We propose efficient utilization techniques for excess caches. Evaluation results show that excess caches significantly improve overall performance of CMPs

    CROSS-LAYER CUSTOMIZATION FOR LOW POWER AND HIGH PERFORMANCE EMBEDDED MULTI-CORE PROCESSORS

    Get PDF
    Due to physical limitations and design difficulties, computer processor architecture has shifted to multi-core and even many-core based approaches in recent years. Such architectures provide potentials for sustainable performance scaling into future peta-scale/exa-scale computing platforms, at affordable power budget, design complexity, and verification efforts. To date, multi-core processor products have been replacing uni-core processors in almost every market segment, including embedded systems, general-purpose desktops and laptops, and super computers. However, many issues still remain with multi-core processor architectures that need to be addressed before their potentials could be fully realized. People in both academia and industry research community are still seeking proper ways to make efficient and effective use of these processors. The issues involve hardware architecture trade-offs, the system software service, the run-time management, and user application design, which demand more research effort into this field. Due to the architectural specialties with multi-core based computers, a Cross-Layer Customization framework is proposed in this work, which combines application specific information and system platform features, along with necessary operating system service support, to achieve exceptional power and performance efficiency for targeted multi-core platforms. Several topics are covered with specific optimization goals, including snoop cache coherence protocol, inter-core communication for producer-consumer applications, synchronization mechanisms, and off-chip memory bandwidth limitations. Analysis of benchmark program execution with conventional mechanisms is made to reveal the overheads in terms of power and performance. Specific customizations are proposed to eliminate such overheads with support from hardware, system software, compiler, and user applications. Experiments show significant improvement on system performance and power efficiency

    Scalable directoryless shared memory coherence using execution migration

    Get PDF
    We introduce the concept of deadlock-free migration-based coherent shared memory to the NUCA family of architectures. Migration-based architectures move threads among cores to guarantee sequential semantics in large multicores. Using a execution migration (EM) architecture, we achieve performance comparable to directory-based architectures without using directories: avoiding automatic data replication significantly reduces cache miss rates, while a fast network-level thread migration scheme takes advantage of shared data locality to reduce remote cache accesses that limit traditional NUCA performance. EM area and energy consumption are very competitive, and, on the average, it outperforms a directory-based MOESI baseline by 6.8% and a traditional S-NUCA design by 9.2%. We argue that with EM scaling performance has much lower cost and design complexity than in directory-based coherence and traditional NUCA architectures: by merely scaling network bandwidth from 128 to 256 (512) bit flits, the performance of our architecture improves by an additional 8% (12%), while the baselines show negligible improvement

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les últimes dècades el rendiment dels processadors i de les memòries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferència de rendiment és un camp d'investigació d'actualitat i que requereix de noves sol·lucions. Una sol·lució a aquest problema són les memòries “cache”, que permeten reduïr l'impacte d'unes latències de memòria creixents i que conformen la jerarquia de memòria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, però, el creixent nombre de transistors disponible per xip ha permès l'aparició de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memòria específiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiència energètica de la jerarquia de memòria per CMPs, des de les “caches” fins als controladors de memòria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bé per a algunes aplicacions, un sistema que s'ajustés dinàmicament seria més eficient. Tècniques com el Cooperative Caching (CC) combinen els avantatges de les dues tècniques però requereixen un mecanisme centralitzat de coherència que té un consum energètic molt elevat. És per això que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherència en CMPs i aplica el concepte del cooperative caching de forma distribuïda. Mitjançant l'ús de directoris distribuïts s'obté una sol·lució més escalable i que, a més, disposa d'un mecanisme de marcatge més flexible i eficient energèticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribució de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organització capaç de redistribuïr la memòria “cache” dinàmicament segons els requeriments de cada aplicació. Una de les contribucions més importants d'aquesta tècnica és que la reconfiguració es decideix completament a través del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuïdes, permetent una millor escalabilitat. ElasticCC no només és capaç de reparticionar les “caches” segons els requeriments de cada aplicació, sinó que, a més a més, és capaç d'adaptar-se a les diferents fases d'execució de cada una d'elles. La nostra avaluació també demostra que la reconfiguració dinàmica de l'ElasticCC és tant eficient que gairebé proporciona la mateixa taxa de fallades que una configuració amb el doble de memòria.Finalment, la tesi es centra en l'estudi del comportament de les memòries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accés obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat múltiples sol·lucions per CMPs però totes elles es veuen limitades per un compromís entre el rendiment global i l'equitat en l'assignació de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memòries DRAM que permetria guardar files de dades específiques per a cada aplicació. Aquest mecanisme permet proporcionar un accés equitatiu a la memòria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memòria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tècniques proposades proporcionen un millor rendiment i eficiència energètica que les millors tècniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS

    Get PDF
    Accelerator-based -or heterogeneous- computing has become increasingly important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can play here a key role, as they enable unprecedented levels of power-efficiency compared to CPUs/GPUs. However, such paradigms are still immature and deeper exploration is indispensable. This dissertation investigates customizability and proposes novel solutions for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent scratchpad memory with a configurable bank remapping system to reduce bank conflicts. The experimental results show the benefits of both using a customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed synchronization master better suits many-cores than standard centralized solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based on the sparse directory approach, with a selective coherence maintenance system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and non-coherent architectural mechanism along with an extended coherence protocol can enhance performance. The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration of power-efficient high-performance computing architectures. The system is based on a NoC and a customizable GPU-like accelerator core, as well as a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological results as part of the contribution in this dissertation. In fact, as a key benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms do not always support a comprehensive heterogeneous architecture exploration
    corecore