192 research outputs found
Scalability of broadcast performance in wireless network-on-chip
Networks-on-Chip (NoCs) are currently the paradigm of choice to interconnect the cores of a chip multiprocessor. However, conventional NoCs may not suffice to fulfill the on-chip communication requirements of processors with hundreds or thousands of cores. The main reason is that the performance of such networks drops as the number of cores grows, especially in the presence of multicast and broadcast traffic. This not only limits the scalability of current multiprocessor architectures, but also sets a performance wall that prevents the development of architectures that generate moderate-to-high levels of multicast. In this paper, a Wireless Network-on-Chip (WNoC) where all cores share a single broadband channel is presented. Such design is conceived to provide low latency and ordered delivery for multicast/broadcast traffic, in an attempt to complement a wireline NoC that will transport the rest of communication flows. To assess the feasibility of this approach, the network performance of WNoC is analyzed as a function of the system size and the channel capacity, and then compared to that of wireline NoCs with embedded multicast support. Based on this evaluation, preliminary results on the potential performance of the proposed hybrid scheme are provided, together with guidelines for the design of MAC protocols for WNoC.Peer ReviewedPostprint (published version
Reliability-aware and energy-efficient system level design for networks-on-chip
2015 Spring.Includes bibliographical references.With CMOS technology aggressively scaling into the ultra-deep sub-micron (UDSM) regime and application complexity growing rapidly in recent years, processors today are being driven to integrate multiple cores on a chip. Such chip multiprocessor (CMP) architectures offer unprecedented levels of computing performance for highly parallel emerging applications in the era of digital convergence. However, a major challenge facing the designers of these emerging multicore architectures is the increased likelihood of failure due to the rise in transient, permanent, and intermittent faults caused by a variety of factors that are becoming more and more prevalent with technology scaling. On-chip interconnect architectures are particularly susceptible to faults that can corrupt transmitted data or prevent it from reaching its destination. Reliability concerns in UDSM nodes have in part contributed to the shift from traditional bus-based communication fabrics to network-on-chip (NoC) architectures that provide better scalability, performance, and utilization than buses. In this thesis, to overcome potential faults in NoCs, my research began by exploring fault-tolerant routing algorithms. Under the constraint of deadlock freedom, we make use of the inherent redundancy in NoCs due to multiple paths between packet sources and sinks and propose different fault-tolerant routing schemes to achieve much better fault tolerance capabilities than possible with traditional routing schemes. The proposed schemes also use replication opportunistically to optimize the balance between energy overhead and arrival rate. As 3D integrated circuit (3D-IC) technology with wafer-to-wafer bonding has been recently proposed as a promising candidate for future CMPs, we also propose a fault-tolerant routing scheme for 3D NoCs which outperforms the existing popular routing schemes in terms of energy consumption, performance and reliability. To quantify reliability and provide different levels of intelligent protection, for the first time, we propose the network vulnerability factor (NVF) metric to characterize the vulnerability of NoC components to faults. NVF determines the probabilities that faults in NoC components manifest as errors in the final program output of the CMP system. With NVF aware partial protection for NoC components, almost 50% energy cost can be saved compared to the traditional approach of comprehensively protecting all NoC components. Lastly, we focus on the problem of fault-tolerant NoC design, that involves many NP-hard sub-problems such as core mapping, fault-tolerant routing, and fault-tolerant router configuration. We propose a novel design-time (RESYN) and a hybrid design and runtime (HEFT) synthesis framework to trade-off energy consumption and reliability in the NoC fabric at the system level for CMPs. Together, our research in fault-tolerant NoC routing, reliability modeling, and reliability aware NoC synthesis substantially enhances NoC reliability and energy-efficiency beyond what is possible with traditional approaches and state-of-the-art strategies from prior work
Exploiting Properties of CMP Cache Traffic in Designing Hybrid Packet/Circuit Switched NoCs
Chip multiprocessors with few to tens of processing cores are already commercially available. Increased scaling of technology is making it feasible to integrate even more cores on a single chip. Providing the cores with fast access to data is vital to overall system performance. When a core requires access to a piece of data, the core's private cache memory is searched first. If a miss occurs, the data is looked up in the next level(s) of the memory hierarchy, where often one or more levels of cache are shared between two or more cores. Communication between the cores and the slices of the on-chip shared cache is carried through the network-on-chip(NoC). Interestingly, the cache and NoC mutually affect the operation of each other; communication over the NoC affects the access latency of cache data, while the cache organization generates the coherence and data messages, thus affecting the communication patterns and latency over the NoC.
This thesis considers hybrid packet/circuit switched NoCs, i.e., packet switched NoCs enhanced with the ability to configure circuits. The communication and performance benefit that come from using circuits is predicated on amortizing the time cost incurred for configuring the circuits. To address this challenge, NoC designs are proposed that take advantage of properties of the cache traffic, namely temporal locality and predictability, to amortize or hide the circuit configuration time cost.
First, a coarse-grained circuit configuration policy is proposed that exploits the temporal locality in the cache traffic to periodically configure circuits for the heavily communicating nodes. This allows the design of a locality-aware cache that promotes temporal communication locality through data placement, while designing suitable data replacement and migration policies.
Next, a fine-grained configuration policy, called DĂ©jĂ Vu switching, is proposed for leveraging predictability of data messages by initiating a circuit configuration as soon as a cache hit is detected and before the data becomes available. Its benefit is demonstrated for saving interconnect energy in multi-plane NoCs.
Finally, a more proactive configuration policy is proposed for fast caches, where circuit reservations are initiated by request messages, which can greatly improve communication latency and system performance
Addressing Manufacturing Challenges in NoC-based ULSI Designs
Hernández Luz, C. (2012). Addressing Manufacturing Challenges in NoC-based ULSI Designs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/1669
Architectural Support for Efficient Communication in Future Microprocessors
Traditionally, the microprocessor design has focused on the computational aspects
of the problem at hand. However, as the number of components on a single chip
continues to increase, the design of communication architecture has become a crucial
and dominating factor in defining performance models of the overall system. On-chip
networks, also known as Networks-on-Chip (NoC), emerged recently as a promising
architecture to coordinate chip-wide communication.
Although there are numerous interconnection network studies in an inter-chip
environment, an intra-chip network design poses a number of substantial challenges
to this well-established interconnection network field. This research investigates designs
and applications of on-chip interconnection network in next-generation microprocessors
for optimizing performance, power consumption, and area cost. First,
we present domain-specific NoC designs targeted to large-scale and wire-delay dominated
L2 cache systems. The domain-specifically designed interconnect shows 38%
performance improvement and uses only 12% of the mesh-based interconnect. Then,
we present a methodology of communication characterization in parallel programs
and application of characterization results to long-channel reconfiguration. Reconfigured
long channels suited to communication patterns enhance the latency of the
mesh network by 16% and 14% in 16-core and 64-core systems, respectively. Finally,
we discuss an adaptive data compression technique that builds a network-wide frequent value pattern map and reduces the packet size. In two examined multi-core
systems, cache traffic has 69% compressibility and shows high value sharing among
flows. Compression-enabled NoC improves the latency by up to 63% and saves energy
consumption by up to 12%
Scale-Out Processors
Global-scale online services, such as Google’s Web search and Facebook’s social networking, run in large-scale datacenters. Due to their massive scale, these services are designed to scale out (or distribute) their respective loads and datasets across thousands of servers in datacenters. The growing demand for online services forced service providers to build networks of datacenters, which require an enormous capital outlay for infrastructure, hardware, and power consumption. Consequently, efficiency has become a major concern in the design and operation of such datacenters, with processor efficiency being of, utmost importance, due to the significant contribution of processors to the overall datacenter performance and cost. Scale-out workloads, which are behind today’s online services, serve independent requests, and have large instruction footprints and little data locality. As such, they benefit from processor designs that feature many cores and a modestly sized Last-Level Cache (LLC), a fast access path to the LLC, and high-bandwidth interfaces to memory. Existing server-class processors with large LLCs and a handful of aggressive out-of-order cores are inefficient in executing scale-out workloads. Moreover, the scaling trajectory for these processors leads to even lower efficiency in future technology nodes. This thesis presents a family of throughput-optimal processors, called Scale-Out Processors, for the efficient execution of scale-out workloads. A unique feature of Scale-Out Processors is that they consist of multiple stand-alone modules, called pods, wherein each module is a server running an operating system and a full software stack. To design a throughput-optimal processor, we developed a methodology based on performance density, defined as throughput per unit area, to quantify how effectively an architecture uses the silicon real estate. The proposed methodology derives a performance-density optimal processor building block (i.e., pod), which tightly couples a number of cores to a small LLC via a fast interconnect. Scale-Out Processors simply consist of multiple pods with no inter-pod connectivity or coherence. Moreover, they deliver the highest throughput in today’s technology and afford near-ideal scalability as process technology advances. We demonstrate that Scale-Out Processors improve datacenters’ efficiency by 4.4x-7.1x over datacenters designed using existing server-class processors
Recommended from our members
On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
In this era of exascale computing, conventional synchronous design techniques are facing unprecedented challenges. The consumer electronics market is replete with many-core systems in the range of 16 cores to thousands of cores on chip, integrating multi-billion transistors. However, with this ever increasing complexity, the traditional design approaches are facing key issues such as increasing chip power, process variability, aging, thermal problems, and scalability. An alternative paradigm that has gained significant interest in the last decade is asynchronous design. Asynchronous designs have several potential advantages: they are naturally energy proportional, burning power only when active, do not require complex clock distribution, are robust to different forms of variability, and provide ease of composability for heterogeneous platforms. Networks-on-chip (NoCs) is an interconnect paradigm that has been introduced to deal with the ever-increasing system complexity. NoCs provide a distributed, scalable, and efficient interconnect solution for today’s many-core systems. Moreover, NoCs are a natural match with asynchronous design techniques, as they separate communication infrastructure and timing from the computational elements. To this end, globally-asynchronous locally-synchronous (GALS) systems that interconnect multiple processing cores, operating at different clock speeds, using an asynchronous NoC, have gained significant interest. While asynchronous NoCs have several advantages, they also face a key challenge of supporting new types of traffic patterns. Once such pattern is multicast communication, where a source sends packets to arbitrary number of destinations. Multicast is not only common in parallel computing, such as for cache coherency, but also for emerging areas such as neuromorphic computing. This important capability has been largely missing from asynchronous NoCs. This thesis introduces several efficient multicast solutions for these interconnects. In particular, techniques, and network architectures are introduced to support high-performance and low-power multicast. Two leading network topologies are the focus: a variant mesh-of-trees (MoT) and a 2D mesh. In addition, for a more realistic implementation and analysis, as well as significantly advancing the field of asynchronous NoCs, this thesis also targets synthesis of these NoCs on commercial FPGAs. While there has been significant advances in FPGA technologies, there has been only limited research on implementing asynchronous NoCs on FPGAs. To this end, a systematic computeraided design (CAD) methodology has been introduced to efficiently and safely map asynchronous NoCs on FPGAs. Overall, this thesis makes the following three contributions. The first contribution is a multicast solution for a variant MoT network topology. This topology consists of simple low-radix switches, and has been used in high-performance computing platforms. A novel local speculation technique is introduced, where a subset of the network’s switches are speculative that always broadcast every packet. These switches are very simple and have high performance. Speculative switches are surrounded by non-speculative ones that route packets based on their destinations and also throttle any redundant copies created by the former. This hybrid network architecture achieved significant performance and power benefits over other multicast approaches. The second contribution is a multicast solution for a 2D-mesh topology, which is more complex with higher-radix switches and also is more commonly used. A novel continuous-time replication strategy is introduced to optimize the critical multi-way forking operation of a multicast transmission. In this technique, a multicast packet is first stored in an input port of a switch, from where it is sent through distinct output ports towards different destinations concurrently, at each output’s own rate and in continuous time. This strategy is shown to have significant latency and energy benefits over an approach that performs multicast using multiple distinct serial unicasts to each destination. Finally, a systematic CAD methodology is introduced to synthesize asynchronous NoCs on commercial FPGAs. A two-fold goal is targeted: correctness and high performance. For ease of implementation, only existing FPGA synthesis tools are used. Moreover, since asynchronous NoCs involve special asynchronous components, a comprehensive guide is introduced to map these elements correctly and efficiently. Two asynchronous NoC switches are synthesized using the proposed approach on a leading Xilinx FPGA in 28 nm: one that only handles unicast, and the other that also supports multicast. Both showed significant energy benefits with some performance gains over a state-of-the-art synchronous switch
Implementation of Bus-Based and NoC-Based MP3 Decoders on FPGA
The trend of modern System-on-Chip (SoC) design is increasing in size and number of Processing Elements (PE) for various and general purpose tasks. Emergence of Field Programmable Gate Array (FPGA) into the world of technology has lowered the limitations faced by Application Specific Integrated Circuit (ASIC) design. FPGA has a less timeto- market and is a perfect candidate for prototyping purposes due to the flexibility they
create for the design and this is the key feature of the FPGA technology. Technology advancements have introduced reconfiguration concepts which increase the flexibility of FPGA designs more. One method to improve SoC's performance is to adopt a sophi sticated communication medium between PEs to achieve a high throughput. Bus architecture has been improved to meet the requirements of high-performance SoCs, however, its inherently poor scalability limjts their enhancement. The Network-on-Chip (NoC) design paradigm has emerged to overcome the scalability limitations of point-to-point and bus communkation. This thesis presents an investigation towards NoC versus bus based implementation of an SoC. An MP3 decoder has been selected as an application to be implemented on the proposed design. The final design in the thes is demonstrated that the NoC based MP3 decoder achieves a 14% faster clock frequency and real time operation with the NoC based
design decode an MP3 frame on average in 10% less time that the bus based MP3 decoder
Design Space Exploration and Resource Management of Multi/Many-Core Systems
The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends
Network-on-Chip
Limitations of bus-based interconnections related to scalability, latency, bandwidth, and power consumption for supporting the related huge number of on-chip resources result in a communication bottleneck. These challenges can be efficiently addressed with the implementation of a network-on-chip (NoC) system. This book gives a detailed analysis of various on-chip communication architectures and covers different areas of NoCs such as potentials, architecture, technical challenges, optimization, design explorations, and research directions. In addition, it discusses current and future trends that could make an impactful and meaningful contribution to the research and design of on-chip communications and NoC systems
- …