3,652 research outputs found
Congestion-aware wireless network-on-chip for high-speed communication
The design of system-on-chip (SoC) requires the complex integration between a multi-number of cores on a single chip. To establish the effective communication between multiple cores there aremore challenging issues on designing the network-on-chip (NoC) architectures. The proposed system deals with the utilization of on-chip antennas for the wireless communication between the long distance cores to minimize the latency and power. In this proposed work, we have designed high-speed wireless NoC (WiNoC) for on-chip communication. This high-speed WiNoC has been achieved by designing a congestion measure unit, which monitors and measures the congestion in the input data and establishes the effective wireless communication between the output channels and routers. The designed architecture is synthesized and implemented by using Altera Quartus II, where the SoC is designed using Qsys builder. The proposed WiNoC shows better performance parameters like throughput, latency and power than the conventional NoC
A Scalable Packet-Switch Based on Output-Queued NoCs for Data Centre Networks
The switch fabric in a Data-Center Network (DCN) handles constantly variable loads. This is stressing the need for high-performance packet switches able to keep pace with climbing throughput while maintaining resiliency and scalability. Conventional multistage switches with their space-memory variants proved to be performance limited as they do not scale well with the proliferating DC requirements. Most proposals are either too complex to implement or not cost effective. In this paper, we present a highly scalable multistage switching architecture for DC switching fabrics. We describe a three-stage Clos packet-switch fabric with Output-Queued Unidirectional NoC (OQ-UDN) modules and Round-Robin packets dispatching scheme. The proposed OQ Clos-UDN architecture avoids the need for complex and costly input modules and simplifies the scheduling process. Thanks to a dynamic packets dispatching and the multi-hop nature of the UDN modules, the switch provides load balancing and path-diversity. We compared our proposed architecture to state-of-the art previous architectures under extensive uniform and non-uniform DC traffic settings. Simulations of various switch settings have shown that the proposed OQ Clos-UDN outperforms previous proposals and maintains high throughput and latency performance
A efficacy of different buffer size on latency of network on chip (NoC)
Moore's prediction has been used to set targets for research and development in semiconductor industry for years now. A burgeoning number of processing cores on a chip demand competent and scalable communication architecture such as network-on-chip (NoC). NoC technology applies networking theory and methods to on-chip communication and brings noteworthy improvements over conventional bus and crossbar interconnections. Calculated performances such as latency, throughput, and bandwidth are characterized at design time to assured the performance of NoC. However, if communication pattern or parameters set like buffer size need to be altered, there might result in large area and power consumption or increased latency. Routers with large input buffers improve the efficiency of NoC communication while routers with small buffers reduce power consumption but result in high latency. This paper intention is to validate that size of buffer exert influence to NoC performance in several different network topologies. It is concluded that the way in which routers are interrelated or arranged affect NoC’s performance (latency) where different buffer sizes were adapted. That is why buffering requirements for different routers may vary based on their location in the network and the tasks assigned to them
Bibliometric Review of NoC Router Optimization
Network on chip (NoC) has been proposed as an emerging solution for scalability and performance demands of next generation System on Chip (SoC). NoC provides a solution for the bus based interconnection issue of SoC, where large numbers of Intellectual Property modules (IP) are integrated on a single chip for better performance. The NoC has several advantages such as scalability, low latency and low power consumption, high bandwidth over dedicated wires and buses. Interconnections between multiple chip cores have a significant impact on the communication and performance of the chip design in terms of region, latency, throughput and power. In the NoC architecture, the router is a dominant component that significantly affects the performance of the NoC. NoC router architectures evolved since the year 2002 and progress in the domain pertaining to the optimization in the NoC router architectures has been discussed. The key objective of this bibliometric review is to understand the extent of the existing literature in the domain of performance efficient NoC router architectures. The bibliometric analysis is primarily based on data extracted from Scopus. It reveals that major contributions are done by researchers from USA, China followed by India in the form of conference, journals and articles publications. The major contribution is by the subject areas of Computer Science and Engineering followed by Mathematics and Material Science. The geographical analysis is done by using the GPS visualize tool. The clusters were created using Gephi
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
A recent trend in DNN development is to extend the reach of deep learning
applications to platforms that are more resource and energy constrained, e.g.,
mobile devices. These endeavors aim to reduce the DNN model size and improve
the hardware processing efficiency, and have resulted in DNNs that are much
more compact in their structures and/or have high data sparsity. These compact
or sparse models are different from the traditional large ones in that there is
much more variation in their layer shapes and sizes, and often require
specialized hardware to exploit sparsity for performance improvement. Thus,
many DNN accelerators designed for large DNNs do not perform well on these
models. In this work, we present Eyeriss v2, a DNN accelerator architecture
designed for running compact and sparse DNNs. To deal with the widely varying
layer shapes and sizes, it introduces a highly flexible on-chip network, called
hierarchical mesh, that can adapt to the different amounts of data reuse and
bandwidth requirements of different data types, which improves the utilization
of the computation resources. Furthermore, Eyeriss v2 can process sparse data
directly in the compressed domain for both weights and activations, and
therefore is able to improve both processing speed and energy efficiency with
sparse models. Overall, with sparse MobileNet, Eyeriss v2 in a 65nm CMOS
process achieves a throughput of 1470.6 inferences/sec and 2560.3 inferences/J
at a batch size of 1, which is 12.6x faster and 2.5x more energy efficient than
the original Eyeriss running MobileNet. We also present an analysis methodology
called Eyexam that provides a systematic way of understanding the performance
limits for DNN processors as a function of specific characteristics of the DNN
model and accelerator design; it applies these characteristics as sequential
steps to increasingly tighten the bound on the performance limits.Comment: accepted for publication in IEEE Journal on Emerging and Selected
Topics in Circuits and Systems. This extended version on arXiv also includes
Eyexam in the appendi
Castell: a heterogeneous cmp architecture scalable to hundreds of processors
Technology improvements and power constrains have taken multicore architectures to dominate
microprocessor designs over uniprocessors. At the same time, accelerator based architectures
have shown that heterogeneous multicores are very efficient and can provide high throughput for
parallel applications, but with a high-programming effort. We propose Castell a scalable chip
multiprocessor architecture that can be programmed as uniprocessors, and provides the high
throughput of accelerator-based architectures.
Castell relies on task-based programming models that simplify software development. These
models use a runtime system that dynamically finds, schedules, and adds hardware-specific features
to parallel tasks. One of these features is DMA transfers to overlap computation and data
movement, which is known as double buffering. This feature allows applications on Castell
to tolerate large memory latencies and lets us design the memory system focusing on memory
bandwidth.
In addition to provide programmability and the design of the memory system, we have used
a hierarchical NoC and added a synchronization module. The NoC design distributes memory
traffic efficiently to allow the architecture to scale. The synchronization module is a consequence
of the large performance degradation of application for large synchronization latencies.
Castell is mainly an architecture framework that enables the definition of domain-specific
implementations, fine-tuned to a particular problem or application. So far, Castell has been
successfully used to propose heterogeneous multicore architectures for scientific kernels, video
decoding (using H.264), and protein sequence alignment (using Smith-Waterman and clustalW).
It has also been used to explore a number of architecture optimizations such as enhanced DMA
controllers, and architecture support for task-based programming models.
ii
- …