161 research outputs found
Exploiting memory allocations in clusterized many-core architectures
Power-efficient architectures have become the most important feature required for future embedded systems. Modern
designs, like those released on mobile devices, reveal that clusterization is the way to improve energy efficiency. However, such
architectures are still limited by the memory subsystem (i.e., memory latency problems). This work investigates an alternative
approach that exploits on-chip data locality to a large extent, through distributed shared memory systems that permit efficient
reuse of on-chip mapped data in clusterized many-core architectures. First, this work reviews the current literature on memory
allocations and explore the limitations of cluster-based many-core architectures. Then, several memory allocations are introduced
and benchmarked scalability, performance and energy-wise, compared to the conventional centralized shared memory solution to
reveal which memory allocation is the most appropriate for future mobile architectures. Our results show that distributed shared
memory allocations bring performance gains and opportunities to reduce energy consumption
Energy-Efficient and Reliable Computing in Dark Silicon Era
Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability
Optimizing Coherence Traffic in Manycore Processors Using Closed-Form Caching/Home Agent Mappings
[Abstract]
Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting “tiles”, each of which contains computation cores, local caches, and coherence masters. The distributed coherence subsystem must be queried for every out-of-tile access, imposing an overhead on memory latency. This paper studies the physical layout of an Intel Knights Landing processor, with a particular focus on the coherence subsystem, and uncovers the pseudo-random mapping function of physical memory blocks across the pieces of the distributed directory. Leveraging this knowledge, candidate optimizations to improve memory latency through the minimization of coherence traffic are studied. Although these optimizations do improve memory throughput, ultimately this does not translate into performance gains due to inherent overheads stemming from the computational complexity of the mapping functions.Ministerio de Educación; FPU16/00816U.S. National Science Foundation; CCF-1750399Xunta de Galicia and FEDER; ED431G 2019/01Ministerio de Ciencia e Innovación; PID2019-104184RB-I0
EM2: A Scalable Shared-Memory Multicore Architecture
We introduce the Execution Migration Machine (EM2), a novel, scalable shared-memory architecture for large-scale multicores constrained by off-chip memory bandwidth. EM2 reduces cache miss rates, and consequently off-chip memory usage, by permitting only one copy of data to be stored anywhere in the system: when a thread wishes to access an address not locally cached on the core it is executing on, it migrates to the appropriate core and continues execution. Using detailed simulations of a range of 256-core configurations on the SPLASH-2 benchmark suite, we show that EM2 improves application completion times by 18% on the average while remaining competitive with traditional architectures in silicon area
Performance evaluation of caching techniques for video on demand workload in named data network
The rapid growing use of the Internet in the contemporary context is mainly for content
distribution. This is derived primarily due to the emergence of Information-Centric Networking (ICN) in the wider domains of academia and industry. Named Data Network (NDN) is one of ICN architectures. In addition, the NDN has been emphasized as the video traffic architecture that ensures smooth communication between the request and receiver of online video. The concise research problem of the current study is the issue of congestion in Video on Demand (VoD) workload caused by frequent storing of signed content object in the local repositories, which leads to buffering problems and data packet loss. The study will assess the NDN cache techniques to select the preferable cache replacement technique suitable for dealing with the congestion issues, and evaluate its performance. To do that, the current study adopts a research process based on the Design Research Methodology (DRM) and VoD approach in order to explain the main activities that produced an increase in the expected findings at the end of the activities or research. Datasets, as well as Internet2 network topology and the statistics of video views were gathered from the PPTV platform. Actually, a total of 221 servers is connected to the network from the same access points as in the real deployment
of PPTV. In addition, an NS3 analysis the performance metrics of caching replacement
technique (LRU, LFU, and FIFO) for VoD in Named Data Network (NDN) in terms of cache hit ratio, throughput, and server load results in reasonable outcomes that appears to serve as a potential replacement with the current implementation of the Internet2 topology, where nodes are distributed randomly. Based on the results, LFU technique gives the preferable result for congestion from among the presented
techniques. Finally, the research finds that the performance metrics of cache hit ratio,
throughput, and server load for the LFU that produces the lowest congestion rate which
is sufficient. Therefore, the researcher concluded that the efficiency of the different replacement techniques needs to be well investigated in order to provide the insights
necessary to implement these techniques in certain context. However, this result enriches
the current understanding of replacement techniques in handling different cache sizes. After having addressed the different replacement techniques and examined their
performances, the performance characteristics along with their expected performance were also found to stimulate a cache model for providing a relatively fast running time of across a broad range of embedded applications
CROSS-LAYER DESIGN, OPTIMIZATION AND PROTOTYPING OF NoCs FOR THE NEXT GENERATION OF HOMOGENEOUS MANY-CORE SYSTEMS
This thesis provides a whole set of design methods to enable and manage the
runtime heterogeneity of features-rich industry-ready Tile-Based Networkon-
Chips at different abstraction layers (Architecture Design, Network Assembling,
Testing of NoC, Runtime Operation). The key idea is to maintain
the functionalities of the original layers, and to improve the performance
of architectures by allowing, joint optimization and layer coordinations. In
general purpose systems, we address the microarchitectural challenges by codesigning
and co-optimizing feature-rich architectures. In application-specific
NoCs, we emphasize the event notification, so that the platform is continuously
under control. At the network assembly level, this thesis proposes a
Hold Time Robustness technique, to tackle the hold time issue in synchronous
NoCs. At the network architectural level, the choice of a suitable synchronization
paradigm requires a boost of synthesis flow as well as the coexistence
with the DVFS. On one hand this implies the coexistence of mesochronous
synchronizers in the network with dual-clock FIFOs at network boundaries.
On the other hand, dual-clock FIFOs may be placed across inter-switch links
hence removing the need for mesochronous synchronizers. This thesis will
study the implications of the above approaches both on the design flow and
on the performance and power quality metrics of the network. Once the manycore
system is composed together, the issue of testing it arises. This thesis
takes on this challenge and engineers various testing infrastructures. At the
upper abstraction layer, the thesis addresses the issue of managing the fully
operational system and proposes a congestion management technique named
HACS. Moreover, some of the ideas of this thesis will undergo an FPGA
prototyping. Finally, we provide some features for emerging technology by
characterizing the power consumption of Optical NoC Interfaces
- …