137 research outputs found

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    Broadcast-enabled massive multicore architectures: a wireless RF approach

    Get PDF
    Broadcast traditionally has been regarded as a prohibitive communication transaction in multiprocessor environments. Nowadays, such a constraint largely drives the design of architectures and algorithms all-pervasive in diverse computing domains, directly and indirectly leading to diminishing performance returns as the many-core era is approaching. Novel interconnect technologies could help revert this trend by offering, among others, improved broadcast support, even in large-scale chip multiprocessors. This article outlines the prospects of wireless on-chip communication technologies pointing toward low-latency (a few cycles) and energy-efficient broadcast (a few picojoules per bit). It also discusses the challenges and potential impact of adopting these technologies as key enablers of unconventional hardware architectures and algorithmic approaches, in the pathway of significantly improving the performance, energy efficiency, scalability, and programmability of many-core chips.Peer ReviewedPostprint (author's final draft

    On the simulation and design of manycore CMPs

    Get PDF
    The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market

    SCORPIO: A 36-Core Research Chip Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-Network Ordering

    Get PDF
    URL to conference programIn the many-core era, scalable coherence and on-chip interconnects are crucial for shared memory processors. While snoopy coherence is common in small multicore systems, directory-based coherence is the de facto choice for scalability to many cores, as snoopy relies on ordered interconnects which do not scale. However, directory-based coherence does not scale beyond tens of cores due to excessive directory area overhead or inaccurate sharer tracking. Prior techniques supporting ordering on arbitrary unordered networks are impractical for full multicore chip designs. We present SCORPIO, an ordered mesh Network-on-Chip(NoC) architecture with a separate fixed-latency, bufferless network to achieve distributed global ordering. Message delivery is decoupled from the ordering, allowing messages to arrive in any order and at any time, and still be correctly ordered. The architecture is designed to plug-and-play with existing multicore IP and with practicality, timing, area, and power as top concerns. Full-system 36 and 64-core simulations on SPLASH-2 and PARSEC benchmarks show an average application run time reduction of 24.1% and 12.9%, in comparison to distributed directory and AMD HyperTransport coherence protocols, respectively. The SCORPIO architecture is incorporated in an 11 mm-by- 13 mm chip prototype, fabricated in IBM 45nm SOI technology, comprising 36 Freescale e200 Power Architecture TM cores with private L1 and L2 caches interfacing with the NoC via ARM AMBA, along with two Cadence on-chip DDR2 controllers. The chip prototype achieves a post synthesis operating frequency of 1 GHz (833 MHz post-layout) with an estimated power of 28.8 W (768 mW per tile), while the network consumes only 10% of tile area and 19 % of tile power.United States. Defense Advanced Research Projects Agency (DARPA UHPC grant at MIT (Angstrom))Center for Future Architectures ResearchMicroelectronics Advanced Research Corporation (MARCO)Semiconductor Research Corporatio

    Runtime-assisted cache coherence deactivation in task parallel programs

    Get PDF
    With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability. This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64x smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Unions Horizon 2020 research and innovation programme (grant agreements 671697 and 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Graphene-enabled wireless communication for massive multicore architectures

    Get PDF

    Lagarto I RISC-V Multi-core: Research Challenges to Build and Integrate a Network-on-Chip

    Get PDF
    Current compute-intensive applications largely exceed the resources of single-core processors. To face this problem, multi-core processors along with parallel computing techniques have become a solution to increase the computational performance. Likewise, multi-processors are fundamental to support new technologies and new science applications challenges. A specific objective of the Lagarto project developed at the National Polytechnic Institute of Mexico is to generate an ecosystem of high-performance processors for the industry and HPC in Mexico, supporting new technologies and scientific applications. This work presents the first approach of the Lagarto project to the design of multi-core processors and the research challenges to build an infrastructure that allows the flagship core of the Lagarto project to scale to multi- and many-cores. Using the OpenPiton platform with the Ariane RISC-V core, a functional tile has been built, integrating a Lagarto I core with memory coherence that executes atomic instructions, and a NoC that allows scaling the project to many-core versions. This work represents the initial state of the design of mexican multi-and many-cores processors

    Programming models for many-core architectures: a co-design approach

    Get PDF
    Common many-core processors contain tens of cores and distributed memory. Compared to a multicore system, which only has a few tightly coupled cores sharing a single bus and memory, several complex problems arise. Notably, many cores require many parallel tasks to fully utilize the cores, and communication happens in a distributed and decentralized way. Therefore, programming such a processor requires the application to exhibit concurrency. In contrast to a single-core application, a concurrent application has to deal with memory state changes with an observable (non-deterministic) intermediate state. The complexity introduced by these problems makes programming a many-core system with a single-core-based programming approach notoriously hard.\ud \ud The central concept of this thesis is that abstractions, which are related to (many-core) programming, are structured in a single platform model. A platform is a layered view of the hardware, a memory model, a concurrency model, a model of computation, and compile-time and run-time tooling. Then, a programming model is a specific view on this platform, which is used by a programmer. In this view, some details can be hidden from the programmer's perspective, some details cannot. For example, an operating system presents an infinite number of parallel virtual execution units to the application whilst it hides details regarding scheduling. On the other hand, a programmer usually has balance workload among threads by hand.\ud \ud This thesis presents modifications to different abstraction layers of a many-core architecture, in order to make the system as a whole more efficient, and to reduce the programming complexity. These modifications influence other abstractions in the platform, and especially the programming model. Therefore, this thesis applies co-design on all models. Notably, co-design of the memory model, concurrency model, and model of computation is required for a scalable implementation of lambda-calculus. Moreover, only the combination of requirements of the many-core hardware from one side and the concurrency model from the other leads to a memory model abstraction. Hence, this thesis shows that to cope with the current trends in many-core architectures from a programming perspective, it is essential and feasible to inspect and adapt all abstractions collectively
    corecore