Search CORE

30 research outputs found

Exploration within the Network-on-Chip Paradigm

Author: Wolkotte Pascal Theodoor
Publication venue: University of Twente
Publication date: 01/01/2009
Field of study

A general purpose processor used to consist of a single processing core, which performed and controlled all tasks on the chip. Its functionality and maximum clock frequency grew steadily over the years. Due to the continuous increase of the number of transistors available on-chip and the operational clock frequency, it became impossible to reach every function within the chip in a single clock cycle. Furthermore, centralized control becomes hard with the increase in functionality. This lead to the split of the processing into a set of independent processing cores integrated into a single chip.\ud These multi-core architectures will rely on a well designed on-chip communication architecture. Global wires and bus-based systems need to be replaced to overcome the problem of wiring and the single point of arbitration. This is introduced as the Network-on-Chip (NoC) paradigm. Most of the communication architectures classified as a NoC are a network of routers on-chip, but the paradigm embodies a broader scope. The paradigm enables the sharing of on-chip wiring resources for multiple communication streams to reduce the total wiring required. Furthermore, it enables concurrent communication of concurrently handled data packets. The latter is in contrast to the central arbitration and single communication channel in bus-based systems.\ud In this thesis we explore the paradigm by implementation and characterization of multiple NoC router architectures. The scope of the communication architecture is the embedding in a heterogeneous multi-core System-on-Chip (SoC) for streaming applications. Six streaming applications, which are used in mobile devices, are analysed. Their common communication characteristics and specific bandwidth requirements are presented. One of the major constraints of these applications is the requirement of Quality of Service (QoS) for the interprocess communication.\ud \ud Based on application analysis we propose a circuit switched router architecture as opposed to a more flexible packet switched router architecture. The reason for this architecture is the observation that communication patterns in the applications are static. The circuit switched network is integrated in an ARM based heterogeneous reconfigurable multi-core SoC realized in a 0.13 μm CMOS technology.\ud \ud Besides this architecture, an existing packet switched router architecture, that also offers QoS, is improved and compared with the circuit switched router. Next to the exploration of those two router designs, two other packet switched routers, designed at the University of Cambridge, are included in the in-depth comparison. The four routers are placed and routed in 90 nm CMOS technology. The required buffering dominates the resource usage of all packet switched routers, which is significantly reduced in a circuit switched architecture. However, the latter pays a penalty by a larger required crossbar and reduced flexibility.\ud \ud The four routers are also compared for their latency performance and energy consumption. For latency the packet switched networks are simulated with popular synthetic traffic scenarios. The circuit switched router has a deterministic latency, due to the congestion free routes.\ud \ud The latency analysis shows the higher network utilization for NoCs using virtual channel flow control over wormhole flow control. Furthermore, the allocation mechanisms used in the improved packet switched router, cause a higher latency for randomly distributed packets compared to the router with speculation logic that is tailored for this type of traffic. Despite its higher latency for random traffic, the packet switched network is able to give end-to-end latency guarantees for specific connections, due to deterministic arbitration, as is shown in this thesis.\ud For the power analysis we compared the four routers using various traffic scenarios. One of the first observations is a high power consumption in idle mode, where no data is transported. The clock-tree and the connected synchronous elements consume the majority of the power. A minor part is the static power, which is directly related to the router's required chip area. Automatic insertion of fine-grain clock gating tremendously reduces this idle dynamic power consumption. With clock-gating, both the static and dynamic component have an equal share in the idle power at a clock frequency of 200 MHz.\ud \ud The increase in dynamic power consumption is directly related to the number of packets that are transported over the network and the amount of bit flips, i.e. activity, in the payload. Transportation of random payload, i.e. 25% activity, requires almost a factor three more in comparison with a payload of constant values, i.e. all bits inactive. Random activity is observed in the analysed streaming applications for most of the intermediate data. The buffer size has no influence on the packet's dynamic energy consumption, due to the fine-grain clock gating, which makes the packet switched routers as energy efficient as the circuit switched router. Most of the difference in energy consumption between the routers, is caused by the different crossbar dimensions and the extra bits in a packet which are required for routing and allocation. The larger crossbar is required for the circuit switched router to add flexibility, and for the improved packet switched router to enable QoS. A marginal increase in energy consumption is caused by the network congestion.\ud \ud During the design of the heterogeneous SoC architecture as well as the evaluation of the packet switched routers, we were hampered by the prohibitive simulation times of the architecture's bit and cycle accurate models. Motivated by simulation speed-ups of an FPGA in a Hardware-in-the-Loop (HIL) simulation, we developed a framework to simulate large many-core architectures on a single FPGA. Instead of the instantiation of the whole architecture in parallel in the FPGA, the individual cores are evaluated sequentially. Each core is modified such that the core's internal state and combinational functionality are separated.\ud \ud As all cores in a homogeneous many-core architecture are identical, we can construct a single hyper core, that embodies all combinational functionality of a single core. The state of the whole architecture, stored in the FPGA's memory blocks, is updated sequentially by offering a core's old state to the hyper core and store its new state. Using the sequential simulation approach in an FPGA, we are able to simulate two to three orders of magnitude faster compared to cycle and bit-accurate simulations in software.\u

University of Twente Research Information

Energy-Efficient NoC for Best-Effort Communication

Author: Becker Jens E.
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: IEEE Circuits and Systems Society
Publication date: 01/01/2005
Field of study

A Network-on-Chip (NoC) is an energy-efficient on-chip communication architecture forMulti-Processor System-on-Chip (MPSoC) architectures. In an earlier paper we proposed a energy-efficient reconfigurable circuit-switched NoC to reduce the energy consumption compared to a packetswitched NoC. In this paper we investigate a chordal slotted ring and a bus architecture that can be used to handle the best-effort traffic in the system and configure the circuitswitched network. Both architectures are compared on their latency behavior and power consumption. At the same clock frequency, the chordal ring has the major benefit of a lower latency and higher throughput. But the bus has a lower overall power consumption at the same frequency. However, if we tune the frequency of the network to meet the throughput requirements of control network, we see that the ring consumes less energy per transported bit

CiteSeerX

University of Twente Research Information

Low Power Implementation of Non Power-of-Two FFTs on Coarse-Grain Reconfigurable Architectures

Author: Quevremont Jérôme
Rivaton Arnaud
Smit Gerard
Wolkotte Pascal
Zhang Qiwei
Publication venue: Technology Foundation, STW
Publication date: 01/01/2005
Field of study

The DRM standard for digital radio broadcast in the AM band requires integrated devices for radio receivers at very low power. A System on Chip (SoC) call DiMITRI was developed based on a dual ARM9 RISC core architecture. Analyses showed that most computation power is used in the Coded Orthogonal Frequency Division Multiplexing (COFDM) demodulation to compute Fast Fourier Transforms (FFT) and inverse transforms (IFFT) on complex samples. These FFTs have to be computed on non power-of-two numbers of samples, which is very uncommon in the signal processing world. The results obtained with this chip, lead to the objective to decrease the power dissipated by the COFDM demodulation part using a coarse-grain reconfigurable structure as a coprocessor. This paper introduces two different coarse-grain architectures: PACT XPP technology and the Montium, developed by the University of Twente, and presents the implementation of a\ud Fast Fourier Transform on 1920 complex samples. The implementation result on the Montium shows a saving of a factor 35 in terms of processing time, and 14 in terms of power consumption compared to the RISC implementation, and a\ud smaller area. Then, as a conclusion, the paper presents the next steps of the development and some development issues

University of Twente Research Information

An Energy-Efficient Reconfigurable Circuit Switched Network-on-Chip

Author: Rauwerda Gerard K.
Smit Gerard J.M.
Smit Lodewijk T.
Wolkotte Pascal T.
Publication venue: IEEE Computer Society
Publication date: 01/01/2005
Field of study

Network-on-Chip (NoC) is an energy-efficient on-chip communication architecture for multi-tile System-on-Chip (SoC) architectures. The SoC architecture, including its run-time software, can replace inflexible ASICs for future ambient systems. These ambient systems have to be flexible as well as energy-efficient. To find an energy-efficient solution for the communication network we analyze three wireless applications. Based on their communication requirements we observe that revisiting of the circuit switching techniques is beneficial. In this paper we propose a new energy-efficient reconfigurable circuit-switched Network-on-Chip. By physically separating the concurrent data streams we reduce the overall energy consumption. The circuit-switched router has been synthesized and analyzed for its power consumption in 0.13 ¿m technology. A 5-port circuit-switched router has an area of 0.05 mm2 and runs at 1075 MHz. The proposed architecture consumes 3.5 times less energy compared to its packet-switched equivalen

CiteSeerX

University of Twente Research Information

Energy Model of Networks-on-Chip and a Bus

Author: Becker Jens E.
Becker Jürgen
Kavaldjiev Nikolay
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: IEEE Computer Society
Publication date: 01/01/2005
Field of study

A Network-on-Chip (NoC) is an energy-efficient onchip communication architecture for Multi-Processor Systemon-Chip (MPSoC) architectures. In earlier papers we proposed two Network-on-Chip architectures based on packet-switching and circuit-switching. In this paper we derive an energy model for both NoC architectures to predict their energy consumption per transported bit. Both architectures are also compared with a traditional bus architecture. The energy model is primarily needed to find a near optimal run-time mapping (from an energy point of view) of inter-process communication to NoC link

CiteSeerX

Crossref

University of Twente Research Information

An Approximate Maximum Common Subgraph Algorithm for Large Digital Circuits

Author: Hölzenspies Philip K.F.
Kuper Jan
Rutgers Jochem H.
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: IEEE Computer Society
Publication date: 01/01/2010
Field of study

This paper presents an approximate Maximum Common Subgraph (MCS) algorithm, specifically for directed, cyclic graphs representing digital circuits. \ud Because of the application domain, the graphs have nice properties: they are very sparse; have many different labels; and most vertices have only one predecessor. The algorithm iterates over all vertices once and uses heuristics to find the MCS. It is linear in computational complexity with respect to the size of the graph. Experiments show that very large common subgraphs were found in graphs of up to 200,000 vertices within a few minutes, when a quarter or less of the graphs differ. The variation in run-time and quality of the result is low

Crossref

University of Twente Research Information

A Virtual Channel Network-on-Chip for GT and BE traffic

Author: Jansen Pierre G.
Kavaldjiev Nikolay
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: Centre for Telematics and Information Technology (CTIT)
Publication date: 01/01/2005
Field of study

This paper presents an on-chip network for a run-time reconfigurable System-on-Chip. The network uses packet-switching with virtual channels. It can provide guaranteed services as well as best effort services. The guaranteed services are based on virtual channel allocation, in contrast to other on-chip networks where guarantees are provided by time-division multiplexing. The network is particularly suitable for systems in which the traffic is dominated by streams. We model the data traffic in the system and simulate the behaviour of the network with this model. The results show that the network is capable of handling the system traffic and can provide the required guarantees

University of Twente Research Information

The Chameleon Architecture for Streaming DSP Applications

Author: Burgwal Marcel D. van de
Heysters Paul M.
Hölzenspies Philip K.F.
Kokkeler André B.J.
Smit Gerard J.M.
Wolkotte Pascal T.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2007
Field of study

We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm

^2

in a 130 nm process), is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC) via a network interface (NI). Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT) and best effort (BE). For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

University of Twente Research Information

An Energy and Performance Exploration of Network-on-Chip Architectures

Author: Arnab Banerjee
Gerard J. M. Smit
Pascal T. Wolkotte
Robert D. Mullins
Senior Member
Simon W. Moore
Student Member
Publication venue: IEEE Circuits and Systems Society
Publication date: 01/01/2009
Field of study

In this paper, we explore the designs of a circuit-switched router, a wormhole router, a quality-of-service (QoS) supporting virtual channel router and a speculative virtual channel router and accurately evaluate the energy-performance tradeoffs they offer. Power results from the designs placed and routed in a 90-nm CMOS process show that all the architectures dissipate significant idle state power. The additional energy required to route a packet through the router is then shown to be dominated by the data path. This leads to the key result that, if this trend continues, the use of more elaborate control can be justified and will not be immediately limited by the energy budget. A performance analysis also shows that dynamic resource allocation leads to the lowest network latencies, while static allocation may be used to meet QoS goals. Combining the power and performance figures then allows an energy-latency product to be calculated to judge the efficiency of each of the networks. The speculative virtual channel router was shown to have a very similar efficiency to the wormhole router, while providing a better performance, supporting its use for general purpose designs. Finally, area metrics are also presented to allow a comparison of implementation costs

CiteSeerX

University of Twente Research Information