35 research outputs found
MPSoCBench : um framework para avaliação de ferramentas e metodologias para sistemas multiprocessados em chip
Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Recentes metodologias e ferramentas de projetos de sistemas multiprocessados em chip (MPSoC) aumentam a produtividade por meio da utilização de plataformas baseadas em simuladores, antes de definir os últimos detalhes da arquitetura. No entanto, a simulação só é eficiente quando utiliza ferramentas de modelagem que suportem a descrição do comportamento do sistema em um elevado nível de abstração. A escassez de plataformas virtuais de MPSoCs que integrem hardware e software escaláveis nos motivou a desenvolver o MPSoCBench, que consiste de um conjunto escalável de MPSoCs incluindo quatro modelos de processadores (PowerPC, MIPS, SPARC e ARM), organizado em plataformas com 1, 2, 4, 8, 16, 32 e 64 núcleos, cross-compiladores, IPs, interconexões, 17 aplicações paralelas e estimativa de consumo de energia para os principais componentes (processadores, roteadores, memória principal e caches). Uma importante demanda em projetos MPSoC é atender às restrições de consumo de energia o mais cedo possível. Considerando que o desempenho do processador está diretamente relacionado ao consumo, há um crescente interesse em explorar o trade-off entre consumo de energia e desempenho, tendo em conta o domínio da aplicação alvo. Técnicas de escalabilidade dinâmica de freqüência e voltagem fundamentam-se em gerenciar o nível de tensão e frequência da CPU, permitindo que o sistema alcance apenas o desempenho suficiente para processar a carga de trabalho, reduzindo, consequentemente, o consumo de energia. Para explorar a eficiência energética e desempenho, foram adicionados recursos ao MPSoCBench, visando explorar escalabilidade dinâmica de voltaegem e frequência (DVFS) e foram validados três mecanismos com base na estimativa dinâmica de energia e taxa de uso de CPUAbstract: Recent design methodologies and tools aim at enhancing the design productivity by providing a software development platform before the definition of the final Multiprocessor System on Chip (MPSoC) architecture details. However, simulation can only be efficiently performed when using a modeling and simulation engine that supports system behavior description at a high abstraction level. The lack of MPSoC virtual platform prototyping integrating both scalable hardware and software in order to create and evaluate new methodologies and tools motivated us to develop the MPSoCBench, a scalable set of MPSoCs including four different ISAs (PowerPC, MIPS, SPARC, and ARM) organized in platforms with 1, 2, 4, 8, 16, 32, and 64 cores, cross-compilers, IPs, interconnections, 17 parallel version of software from well-known benchmarks, and power consumption estimation for main components (processors, routers, memory, and caches). An important demand in MPSoC designs is the addressing of energy consumption constraints as early as possible. Whereas processor performance comes with a high power cost, there is an increasing interest in exploring the trade-off between power and performance, taking into account the target application domain. Dynamic Voltage and Frequency Scaling techniques adaptively scale the voltage and frequency levels of the CPU allowing it to reach just enough performance to process the system workload while meeting throughput constraints, and thereby, reducing the energy consumption. To explore this wide design space for energy efficiency and performance, both for hardware and software components, we provided MPSoCBench features to explore dynamic voltage and frequency scalability (DVFS) and evaluated three mechanisms based on energy estimation and CPU usage rateDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã
A tuneable software cache coherence protocol for heterogeneous MPSoCs
ABSTRACT In a multiprocessor system-on-chip (MPSoC) private caches introduce the cache coherence problem. Here, we target at heterogeneous MPSoCs with a network-on-chip (NoC). Existing hardware cache coherence protocols are less suitable for MPSoCs because many off-the-shelf processors used in MPSoCs do not support these protocols. Furthermore, these protocols typically rely on global visibility and serialization of writes which does not match well with the parallel pointto-point communication provided by a NoC. Therefore, we propose a software cache coherence protocol, which can be applied in a heterogeneous MPSoC with a NoC. The software cache coherence protocol relies on explicit synchronization in the software. More specifically, caches are guaranteed to be coherent according to the Release Consistency model, on top of which we have implemented the standard Pthreads communication library. Heterogeneous MPSoCs with off-the-shelf processors can easily be supported, because processors are only required to provide cache control operations, e.g., clean and invalidate. All cache coherence operations are interruptible and do not impact the execution of tasks on other processors, therefore this protocol is suitable for predictable MPSoCs. Our software cache coherence protocol is implemented on an ARM926EJ-S MPSoC which is mapped on an FPGA. From experiments we conclude that the protocol overhead is low for the applications taken from the SPLASH-2 benchmark set. For these applications we observed a speedup between 1.89 and 2.01 on the two processor MPSoC
Memory Consistency and Cache Coherency in Network-on-Chip Based Multi-Core Systems
The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user defined, real-time applications, but the computer engineering society aims at hiding underlying platform specific characteristics and providing user with platform-independent services.
Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing.
In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, influenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is confirmed by simulation-based experimental results.The complexity of modern Systems-on-Chips (SoC) is increasing with technology innovations. Designers of such systems are devoting significant attention not only to computation attributes, but increasingly more and more on communications characteristics. Having in mind scalability challenges, Networks-on-Chip (NoC) are already de facto standard for the communication backbone of SoC systems. As such, those systems are targeting more and more parallel execution of user defined, real-time applications, but the computer engineering society aims at hiding underlying platform specific characteristics and providing user with platform-independent services.
Shared memory services are quite often a needed crucial property of such systems, therefore providing a coherent view, ensuring memory consistency, and still achieving the desired performance system characteristics is a huge challenge for scientists nowadays. With the invention of 3D integration, and opportunities of stacking memory modules on top of it, the concept of scalable shared memory will be one of the main memory access concepts besides message passing.
In this thesis, the concept of a scalable coherency protocol which dynamically adopts to inputs of system and shared resources, is presented. Protocol ingredients, structure and internal modules interaction are described in detail. The conceptual idea of this protocol, influenced by widely accepted best practices in bus based systems as well of other NoC systems, is implemented for one particular type of NoC platform - XhiNoC (extendable Hierarchical Network-on Chip). The feasibility of the presented concept for distributed shared memory (DSM) coherency within NoC-based SoC architectures is confirmed by simulation-based experimental results
Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions
Today, leveraging the enormous modular power, diversity and flexibility of
manycore systems-on-a-chip (SoCs) requires careful orchestration of complex
resources, a task left to low-level software, e.g. hypervisors. In current
architectures, this software forms a single point of failure and worthwhile
target for attacks: once compromised, adversaries gain access to all
information and full control over the platform and the environment it controls.
This paper proposes Midir, an enhanced manycore architecture, effecting a
paradigm shift from SoCs to distributed SoCs. Midir changes the way platform
resources are controlled, by retrofitting tile-based fault containment through
well known mechanisms, while securing low-overhead quorum-based consensus on
all critical operations, in particular privilege management and, thus,
management of containment domains. Allowing versatile redundancy management,
Midir promotes resilience for all software levels, including at low level. We
explain this architecture, its associated algorithms and hardware mechanisms
and show, for the example of a Byzantine fault tolerant microhypervisor, that
it outperforms the highly efficient MinBFT by one order of magnitude
Behind the Last Line of Defense -- Surviving SoC Faults and Intrusions
Today, leveraging the enormous modular power, diversity and flexibility of manycore systems-on-a-chip (SoCs) requires careful orchestration of complex resources, a task left to low-level software, e.g. hypervisors. In current architectures, this software forms a single point of failure and worthwhile target for attacks: once compromised, adversaries gain access to all information and full control over the platform and the environment it controls. This paper proposes Midir, an enhanced manycore architecture, effecting a paradigm shift from SoCs to distributed SoCs. Midir changes the way platform resources are controlled, by retrofitting tile-based fault containment through well known mechanisms, while securing low-overhead quorum-based consensus on all critical operations, in particular privilege management and, thus, management of containment domains. Allowing versatile redundancy management, Midir promotes resilience for all software levels, including at low level. We explain this architecture, its associated algorithms and hardware mechanisms and show, for the example of a Byzantine fault tolerant microhypervisor, that it outperforms the highly efficient MinBFT by one order of magnitude
A Reactive and Cycle-True IP Emulator for MPSoC Exploration
The design of MultiProcessor Systems-on-Chip
(MPSoC) emphasizes intellectual-property (IP)-based
communication-centric approaches. Therefore, for the optimization
of the MPSoC interconnect, the designer must develop
traffic models that realistically capture the application behavior
as executing on the IP core. In this paper, we introduce a
Reactive IP Emulator (RIPE) that enables an effective emulation
of the IP-core behavior in multiple environments, including bitand
cycle-true simulation. The RIPE is built as a multithreaded
abstract instruction-set processor, and it can generate reactive
traffic patterns. We compare the RIPE models with cycle-true
functional simulation of complex application behavior (tasksynchronization,
multitasking, and input/output operations).
Our results demonstrate high-accuracy and significant speedups.
Furthermore, via a case study, we show the potential use of the
RIPE in a design-space-exploration context
Multi-core devices for safety-critical systems: a survey
Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft
On the simulation and design of manycore CMPs
The progression of Moore’s Law has resulted in both embedded and performance
computing systems which use an ever increasing number of processing cores integrated
in a single chip. Commercial systems are now available which provide hundreds
of cores, and academics have proposed architectures for up to 1024 cores. Embedded
multicores are increasingly popular as it is easier to guarantee hard-realtime constraints
using individual cores dedicated for tasks, than to use traditional time-multiplexed processing.
However, finding the optimal hardware configuration to meet these requirements
at minimum cost requires extensive trial and error approaches to investigate the
design space.
This thesis tackles the problems encountered in the design of these large scale multicore
systems by first addressing the problem of fast, detailed micro-architectural simulation.
Initially addressing embedded systems, this work exploits the lack of hardware
cache-coherence support in many deeply embedded systems to increase the available
parallelism in the simulation. Then, through partitioning the NoC and using packet
counting and cycle skipping reduces the amount of computation required to accurately
model the NoC interconnect. In combination, this enables simulation speeds significantly
higher than the state of the art, while maintaining less error, when compared
to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS
(Million (target) Instructions Per Second), or 110MHz, which is better than typical
FPGA prototypes, and approaching final ASIC production speeds. This is achieved
while maintaining an error of only 2.1%, significantly lower than other similar simulators.
The thesis continues by scaling the simulator past large embedded systems up to
64-1024 core processors, adding support for coherent architectures using the same
packet counting techniques along with low overhead context switching to enable the
simulation of such large systems with stricter synchronisation requirements. The new
interconnect model was partitioned to enable parallel simulation to further improve
simulation speeds in a manner which did not sacrifice any accuracy.
These innovations were leveraged to investigate significant novel energy saving optimisations
to the coherency protocol, processor ISA, and processor micro-architecture.
By introducing a new instruction, with the name wait-on-address, the energy spent during
spin-wait style synchronisation events can be significantly reduced. This functions
by putting the core into a low-power idle state while the cache line of the indicated
address is monitored for coherency action. Upon an update or invalidation (or traditional
timer or external interrupts) the core will resume execution, but the active
energy of running the core pipeline and repeatedly accessing the data and instruction
caches is effectively reduced to static idle power. The thesis also shows that existing
combined software-hardware schemes to track data regions which do not require coherency
can adequately address the directory-associativity problem, and introduces a
new coherency sharer encoding which reduces the energy consumed by sharer invalidations
when sharers are grouped closely together, such as would be the case with a
system running many tasks with a small degree of parallelism in each.
The research concludes by using the extremely fast simulation speeds developed to
produce a large set of training data, collecting various runtime and energy statistics for
a wide range of embedded applications on a huge diverse range of potential MPSoC
designs. This data was used to train a series of machine learning based models which
were then evaluated on their capacity to predict performance characteristics of unseen
workload combinations across the explored MPSoC design space, using only two sample
simulations, with promising results from some of the machine learning techniques.
The models were then used to produce a ranking of predicted performance across the
design space, and on average Random Forest was able to predict the best design within
89% of the runtime performance of the actual best tested design, and better than 93%
of the alternative design space. When predicting for a weighted metric of energy, delay
and area, Random Forest on average produced results within 93% of the optimum
result.
In summary this thesis improves upon the state of the art for cycle accurate multicore
simulation, introduces novel energy saving changes the the ISA and microarchitecture
of future multicore processors, and demonstrates the viability of machine
learning techniques to significantly accelerate the design space exploration required to
bring a new manycore design to market
Predictable multi-processor system on chip design for multimedia applications
The design of multimedia systems has become increasingly complex due to consumer requirements. Consumers demand the functionalities offered by a huge desktop from these systems. Many of these systems are mobile. Therefore, power consumption and size of these devices should be small. These systems are increasingly becoming multi-processor based (MPSoCs) for the reasons of power and performance. Applications execute on these systems in different combinations also known as use-cases. Applications may have different performance requirements in each use-case. Currently, verification of all these use-cases takes bulk of the design effort. There is a need for analysis based techniques so that the platforms have a predictable behaviour and in turn provide guarantees on performance without expending precious man hours on verification. In this dissertation, techniques and architectures have been developed to design and manage these multi-processor based systems efficiently. The dissertation presents predictable architectural components for MPSoCs, a Predictable MPSoC design strategy, automatic platform synthesis tool, a run-time system and an MPSoC simulation technique. The introduction of predictability helps in rapid design of MPSoC platforms. Chapter 1 of the thesis studies the trends in modern multimedia applications and processor architectures. The chapter further highlights the problems in the design of MPSoC platforms and emphasizes the need of predictable design techniques. Predictable design techniques require predictable application and architectural components. The chapter further elaborates on Synchronous Data Flow Graphs which are used to model the applications throughout this thesis. The chapter presents the architecture template used in this thesis and enlists the contributions of the thesis. One of the contributions of this thesis is the design of a predictable component called communication assist. Chapter 2 of the thesis describes the architecture of this communication assist. The communication assist presented in this thesis not only decouples the communication from computation but also provides timing guarantees. Based on this communication assist, an MPSoC platform generation technique has been presented that can design MPSoC platforms capable of satisfying the throughput constraints of multiple applications in all use-cases. The technique is presented in Chapter 3. The design strategy uses three simple steps for platform design. In the first step it finds the required number of processors. The second step minimizes the communication interconnect between the processors and the third step minimizes the communication memory requirement of the platform. Further in Chapter 4, a tool has been developed to generate CA-based platforms for FPGAs. The output of this tool can be used to synthesize platforms on real hardware with the help of FPGA synthesis tools. The applications executing on these platforms often exhibit dynamism e.g. variation in task execution times and change in application throughput requirements. Further, new applications may often be added by consumers at run-time. Resource managers have been presented in literature to handle such dynamic situations. However, the scalability of these resource managers becomes an issue with the increase in number of processors and applications. Chapter 5 presents distributed run-time resource management techniques. Two versions of distributed resource managers have been presented which are scalable with the number of applications and processors. MPSoC platforms for real-time applications are designed assuming worst-case task execution times. It is known that the difference between average-case and worst-case behaviour can be quite large. Therefore, knowing the average case performance is also important for the system designer, and software simulation is often employed to estimate this. However, simulation in software is slow and does not scale with the number of applications and processing elements. In Chapter 6, a fast and scalable simulation methodology is introduced that can simulate the execution of multiple applications on an MPSoC platform. It is based on parallel execution of SDF (Synchronous Data Flow) models of applications. The simulation methodology uses Parallel Discrete Event Simulation (PDES) primitives and it is termed as "Smart Conservative PDES". The methodology generates a parallel simulator which is synthesizable on FPGAs. The framework can also be used to model dynamic arbitration policies which are difficult to analyse using models. The generated platform is also useful in carrying out Design Space Exploration as shown in the thesis. Finally, Chapter 7 summarizes the main findings and (practical) implications of the studies described in previous chapters of this dissertation. Using the contributions mentioned in the thesis, a designer can design and implement predictable multiprocessor based systems capable of satisfying throughput constraints of multiple applications in given set of use-cases, and employ resource management strategies to deal with dynamism in the applications. The chapter also describes the main limitations of this dissertation and makes suggestions for future research