Search CORE

1,969 research outputs found

Fast and Accurate TLM Simulations using Temporal Decoupling for FIFO-based Communications

Author: Cornet Jérôme
Galilée Bruno
Helmstetter Claude
Moy Matthieu
Vivet Pascal
Publication venue: HAL CCSD
Publication date: 01/01/2013
Field of study

International audienceUntimed models of large embedded systems, generally written using SystemC/TLM, allow the software team to start simulations before the RTL description is available, and then provide a golden reference model to the verification team. For those two purposes, only a correct functional behavior is required, but users are asking more and more for timing estimations early in the design flow. Because companies cannot afford to maintain two simulators for the same chip, only local modifications of the untimed model are considered. A known approach is to add timing annotations into the code and to reduce the number of costly context switches using temporal decoupling, meaning that a process can go ahead of the simulation time before synchronizing again. Our current goal is to apply temporal decoupling to the TLM platform of a many-core SoC dedicated to high performance computing. Part of this SoC communicates using classic memory-mapped buses, but it can be extended with hardware accelerators communicating using FIFOs. Whereas temporal decoupling for memory-based transactions has been widely studied, FIFO-based communications raise issues that have not been addressed before. In this paper, we provide an efficient solution to combine temporal decoupling and FIFO-based communications

Crossref

Hal - Université Grenoble Alpes

HAL-CEA

A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing

Author: Buyya Rajkumar
Ramamohanarao Kotagiri
Venugopal Srikumar
Publication venue
Publication date: 10/06/2005
Field of study

Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases. We then provide comprehensive taxonomies that cover various aspects of architecture, data transportation, data replication and resource allocation and scheduling. Finally, we map the proposed taxonomy to various Data Grid systems not only to validate the taxonomy but also to identify areas for future exploration. Through this taxonomy, we aim to categorise existing systems to better understand their goals and their methodology. This would help evaluate their applicability for solving similar problems. This taxonomy also provides a "gap analysis" of this area through which researchers can potentially identify new issues for investigation. Finally, we hope that the proposed taxonomy and mapping also helps to provide an easy way for new practitioners to understand this complex area of research.Comment: 46 pages, 16 figures, Technical Repor

arXiv.org e-Print Archive

CiteSeerX

University of Melbourne Institutional Repository

Modeling Power Consumption and Temperature in TLM Models

Author: Bouhadiba Tayeb
Helmstetter Claude
Maraninchi Florence
Moy Matthieu
Publication venue: European Design and Automation Association (EDAA) \ EMbedded Systems Special Interest Group (EMSIG) and Schloss Dagstuhl -- Leibniz-Zentrum für Informatik GmbH, Dagstuhl Publishing.
Publication date: 01/01/2016
Field of study

International audienceMany techniques and tools exist to estimate the power consumption and the temperature map of a chip. These tools help the hardware designers develop power efficient chips in the presence of temperature constraints. For this task, the application can be ignored or at least abstracted by some high level scenarios; at this stage, the actual embedded software is generally not available yet. However, after the hardware is defined, the embedded software can still have a significant influence on the power consumption; i.e., two implementations of the same application can consume more or less power. Moreover, the actual software powe

Hal - Université Grenoble Alpes

Architectural techniques to extend multi-core performance scaling

Author: Sohail Hamza Bin
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2015
Field of study

Multi-cores have successfully delivered performance improvements over the past decade; however, they now face problems on two fronts: power and off-chip memory bandwidth. Dennard\u27s scaling is effectively coming to an end which has lead to a gradual increase in chip power dissipation. In addition, sustaining off-chip memory bandwidth has become harder due to the limited space for pins on the die and greater current needed to drive the increasing load . My thesis focuses on techniques to address the power and off-chip memory bandwidth challenges in order to avoid the premature end of the multi-core era. ^ In the first part of my thesis, I focus on techniques to address the power problem. One option to cope with the power limit, as suggested by some recent papers, is to ensure that an increasing number of cores are kept powered down (i.e., dark silicon) due to lack of power; but this option imposes a low upper bound on performance. The alternative option of customizing the cores to improve power efficiency may incur increased effort for hardware design, verification and test, and degraded programmability. I propose a gentler evolutionary path for multi-cores, called successive frequency unscaling ( SFU), to cope with the slowing of Dennard\u27s scaling. SFU keeps powered significantly more cores (compared to the option of keeping them \u27dark\u27) running at clock frequencies on the extended Pareto frontier that are successively lowered every generation to stay within the power budget. ^ In the second part of my thesis, I focus on techniques to avert the limited off-chip memory bandwidth problem. Die-stacking of DRAM on a processor die promises to continue scaling the pin bandwidth to off-chip memory. While the die-stacked DRAM is expected to be used as a cache, storing any part of the tag in the DRAM itself erodes the bandwidth advantage of die-stacking. As such, the on-die space overhead of the large DRAM cache\u27s tag is a concern. A well-known compromise is to employ a small on-die tag cache (T

) for the tag metadata while the full tag stays in the DRAM. However, tag caching fundamentally requires exploiting page-level metadata locality to ensure efficient use of the 3-D DRAM bandwidth. Plain sub-blocking exploits this locality but incurs holes in the cache (i.e., diminished DRAM cache capacity), whereas decoupled organizations avoid holes but destroy this locality. I propose Bandwidth-Efficient Tag Access (BETA) DRAM cache (β

) which avoids holes while exploiting the locality through various metadata organizational techniques. Using simulations, I conclusively show that the primary concern in DRAM caches is bandwidth and not latency, and that due to β

\u27s tag bandwidth efficiency, β

with a T

performs 15% better than the best previous scheme with a similarly-sized T

Purdue E-Pubs

CONTREX: Design of embedded mixed-criticality CONTRol systems under consideration of EXtra-functional properties

Author: Azzoni Paolo
Bocchio Sara
Brandolese Carlo
Ceva Luca
Cusenza Salvatore
Favaro John
Fornaciari William
Gadioli Davide
Goren Ralph
Gruttner Kim
Herrera Fernando
Macii Enrico
Medina Julio
Palermo Gianluca
Penil Pablo
Poncino Massimo
Quaglia Davide
Rosvall Kathrin
Sander Ingo
Valencia Raul
Villar Eugenio
Vinco Sara
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The increasing processing power of today’s HW/SW platforms leads to the integration of more and more functions in a single device. Additional design challenges arise when these functions share computing resources and belong to different criticality levels. The paper presents the CONTREX European project and its preliminary results. CONTREX complements current activities in the area of predictable computing platforms and segregation mechanisms with techniques to consider the extra-functional properties, i.e., timing constraints, power, and temperature. CONTREX enables energy efficient and cost aware design through analysis and optimization of these properties with regard to application demands at different criticality levels

Archivio istituzionale della ricerca - Politecnico di Milano

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Energy Efficient Network-on-Chip Architectures for Many-Core Near-Threshold Computing System

Author: Chakraborty Koushik
Rajamanikkam Chidhambaranathan
Rajesh Jayashankara S.
Samineni Meher
Publication venue: Hosted by Utah State University Libraries
Publication date: 01/06/2019
Field of study

Near threshold computing has unraveled a promising design space for energy efficient computing. However, it is still plagued by sub-optimal system performance. Application characteristics and hardware non-idealities of conventional architectures (those optimized for nominal voltage) prevent us from fully leveraging the potential of NTC systems. Increasing the computational core count still forms the bedrock of a multitude of contemporary works that address the problem of performance degradation in NTC systems. However, these works do not categorically address the shortcomings of the conventional on-chip interconnect fabric in a many core environment. In this work, we quantitatively demonstrate the performance bottleneck created by a conventional NTC architecture in many-core NTC systems. To reclaim the performance lost due to a sub-optimal NoC in many-core NTC systems, we propose BoostNoC—a power efficient, multi-layered network-on-chip architecture. BoostNoC improves the system performance by nearly 2× over a conventional NTC system, while largely sustaining its energy benefits. Further, capitalizing on the application characteristics, we propose two BoostNoC derivative designs: (i) PG BoostNoC; and (ii) Drowsy BoostNoC; to improve the energy efficiency by 1.4× and 1.37×, respectively over conventional NTC system

DigitalCommons@USU

MPSoCBench : um framework para avaliação de ferramentas e metodologias para sistemas multiprocessados em chip

Author: Garanhani Liana Dessandre Duenha, 1977-
Publication venue: [s.n.]
Publication date: 30/08/2018
Field of study

Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Recentes metodologias e ferramentas de projetos de sistemas multiprocessados em chip (MPSoC) aumentam a produtividade por meio da utilização de plataformas baseadas em simuladores, antes de definir os últimos detalhes da arquitetura. No entanto, a simulação só é eficiente quando utiliza ferramentas de modelagem que suportem a descrição do comportamento do sistema em um elevado nível de abstração. A escassez de plataformas virtuais de MPSoCs que integrem hardware e software escaláveis nos motivou a desenvolver o MPSoCBench, que consiste de um conjunto escalável de MPSoCs incluindo quatro modelos de processadores (PowerPC, MIPS, SPARC e ARM), organizado em plataformas com 1, 2, 4, 8, 16, 32 e 64 núcleos, cross-compiladores, IPs, interconexões, 17 aplicações paralelas e estimativa de consumo de energia para os principais componentes (processadores, roteadores, memória principal e caches). Uma importante demanda em projetos MPSoC é atender às restrições de consumo de energia o mais cedo possível. Considerando que o desempenho do processador está diretamente relacionado ao consumo, há um crescente interesse em explorar o trade-off entre consumo de energia e desempenho, tendo em conta o domínio da aplicação alvo. Técnicas de escalabilidade dinâmica de freqüência e voltagem fundamentam-se em gerenciar o nível de tensão e frequência da CPU, permitindo que o sistema alcance apenas o desempenho suficiente para processar a carga de trabalho, reduzindo, consequentemente, o consumo de energia. Para explorar a eficiência energética e desempenho, foram adicionados recursos ao MPSoCBench, visando explorar escalabilidade dinâmica de voltaegem e frequência (DVFS) e foram validados três mecanismos com base na estimativa dinâmica de energia e taxa de uso de CPUAbstract: Recent design methodologies and tools aim at enhancing the design productivity by providing a software development platform before the definition of the final Multiprocessor System on Chip (MPSoC) architecture details. However, simulation can only be efficiently performed when using a modeling and simulation engine that supports system behavior description at a high abstraction level. The lack of MPSoC virtual platform prototyping integrating both scalable hardware and software in order to create and evaluate new methodologies and tools motivated us to develop the MPSoCBench, a scalable set of MPSoCs including four different ISAs (PowerPC, MIPS, SPARC, and ARM) organized in platforms with 1, 2, 4, 8, 16, 32, and 64 cores, cross-compilers, IPs, interconnections, 17 parallel version of software from well-known benchmarks, and power consumption estimation for main components (processors, routers, memory, and caches). An important demand in MPSoC designs is the addressing of energy consumption constraints as early as possible. Whereas processor performance comes with a high power cost, there is an increasing interest in exploring the trade-off between power and performance, taking into account the target application domain. Dynamic Voltage and Frequency Scaling techniques adaptively scale the voltage and frequency levels of the CPU allowing it to reach just enough performance to process the system workload while meeting throughput constraints, and thereby, reducing the energy consumption. To explore this wide design space for energy efficiency and performance, both for hardware and software components, we provided MPSoCBench features to explore dynamic voltage and frequency scalability (DVFS) and evaluated three mechanisms based on energy estimation and CPU usage rateDoutoradoCiência da ComputaçãoDoutora em Ciência da Computaçã

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio da Producao Cientifica e Intelectual da Unicamp

GPU Based Acceleration of SystemC and Transaction Level Models for MPSOC Simulation

Author: Gangrade Rohit
Publication venue
Publication date: 08/07/2016
Field of study

With increasing number of cores on a chip, the complexity of modeling hardware using virtual prototype is increasing rapidly. Typical SOCs today have multipro-cessors connected through a bus or NOC architecture which can be modeled using SystemC framework. SystemC is a popular language used for early design exploration and performance analysis of complex embedded systems. TLM2.0, an extension of SystemC, is increasingly used in MPSOC designs for simulating loosely and approxi-mately timed transaction level models. The OSCI reference kernel which implements SystemC library runs on a single thread, slowing up the simulation speed to a large extent. Previous works have used the computational power of multi-core systems and GPUs which can run multiple threads simultaneously, speeding up the simu-lation. Multi-core simulations are not as eﬀective in cases where thread runtime is low, because synchronization overhead becomes comparable to thread runtime. Modern GPUs can run thousands of threads at a time and have shown good results for synthesizable designs in recent eﬀorts. However, development in these works are limited to synthesizable subsets of SystemC models, not supporting timed events for process communication. In this research work, a methodology is proposed for accelerating timed event based SystemC TLM2.0 model to GPU based kernel, which maps SystemC processes to CUDA threads in GPU, providing high data level par-allelism. This work aims to provide a scalable solution for simulating large MPSOC designs, facilitating early design exploration and performance analysis. Experiments have shown that the proposed technique provides a speed-up of the order of 100x for typical MPSOC designs

Texas A&M Repository

Store-Ordered Streaming of Shared Memory

Author: Ailamaki Anastassia
Falsafi Babak
Gniady Chris
Hardavellas Nikolaos
Kim Jangwoo
Somogyi Stephen
Wenisch Thomas F.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/01/2009
Field of study

Coherence misses in shared-memory multiprocessors account for a substantial fraction of execution time in many important scientific and commercial workloads. Memory streaming provides a promising solution to the coherence miss bottleneck because it improves memory level parallelism and lookahead while using on-chip resources efficiently. We observe that the order in which shared data are consumed by one processor is correlated to the order in which they were produced by another. We investigate this phenomenon and demonstrate that it can be exploited to send Store- ORDered Streams (SORDS) of shared data from producers to consumers, thereby eliminating coherent read misses. Using a trace-driven analysis of all user and OS memory references in a cache-coherent distributed shared- memory multiprocessor, we show that SORDS based memory streaming can eliminate between 36% and 100% of all coherent read misses in scientific workloads and between 23% and 48%in online transaction processing workloads

Infoscience - École polytechnique fédérale de Lausanne