79 research outputs found
The effect of an optical network on-chip on the performance of chip multiprocessors
Optical networks on-chip (ONoC) have been proposed to reduce power consumption and increase bandwidth density in high performance chip multiprocessors (CMP), compared to electrical NoCs. However, as buffering in an ONoC is not viable, the end-to-end message path needs to be acquired in advance during which the message is buffered at the network ingress. This waiting latency is therefore a combination of path setup latency and contention and forms a significant part of the total message latency. Many proposed ONoCs, such as Single Writer, Multiple Reader (SWMR), avoid path setup latency at the expense of increased optical components. In contrast, this thesis investigates a simple circuit-switched ONoC with lower component count where nodes need to request a channel before transmission. To hide the path setup latency, a coherence-based message predictor is proposed, to setup circuits before message arrival. Firstly, the effect of latency and bandwidth on application performance is thoroughly investigated using full-system simulations of shared memory CMPs. It is shown that the latency of an ideal NoC affects the CMP performance more than the NoC bandwidth. Increasing the number of wavelengths per channel decreases the serialisation latency and improves the performance of both ONoC types. With 2 or more wavelengths modulating at 25 Gbit=s , the ONoCs will outperform a conventional electrical mesh (maximal speedup of 20%). The SWMR ONoC outperforms the circuit-switched ONoC. Next coherence-based prediction techniques are proposed to reduce the waiting latency. The ideal coherence-based predictor reduces the waiting latency by 42%. A more streamlined predictor (smaller than a L1 cache) reduces the waiting latency by 31%. Without prediction, the message latency in the circuit-switched ONoC is 11% larger than in the SWMR ONoC. Applying the realistic predictor reverses this: the message latency in the SWMR ONoC is now 18% larger than the predictive circuitswitched ONoC
An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime
This doctoral thesis focuses on how the application threads that are based on dataflow
execution model can be managed at Network-on-Chip (NoC) level. The roots of the
dataflow execution model date back to the early 1970âs. Applications adhering to such
program execution model follow a simple producer-consumer communication scheme for
synchronising parallel thread related activities. In dataflow execution environment, a
thread can run if and only if all its required inputs are available. Applications running
on a large and complex computing environment can significantly benefit from the
adoption of dataflow model.
In the first part of the thesis, the work is focused on the thread distribution mechanism.
It has been shown that how a scalable hash-based thread distribution mechanism
can be implemented at the router level with low overheads. To enhance the support further,
a tool to monitor the dataflow threadsâ status and a simple, functional model is
also incorporated into the design. Next, a software defined NoC has been proposed to
manage the distribution of dataflow threads by exploiting its reconfigurability.
The second part of this work is focused more on NoC microarchitecture level. Traditional
2D-mesh topology is combined with a standard ring, to understand how such
hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally,
a mixed-integer linear programming based analytical model has been proposed
to verify if the application threads mapped on to the free cores is optimal or not. The
proposed mathematical model can be used as a yardstick to verify the solution quality
of the newly developed mapping policy. It is not trivial to provide a complete low-level
framework for dataflow thread execution for better resource and power management.
However, this work could be considered as a primary framework to which improvements
could be carried out
Recommended from our members
Performance measurements and analysis of the existing wireless communication technology in Iraq.
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel UniversityIraq may be considered as the largest wireless market in the Gulf region. A key driving factor in the market of wireless communication, it has seen enormous growth in the mobile phone market over the last five years leading to almost 24 million subscribers in 2011. Moreover, there are several technologies and services working in Iraq; three GSM Operators, three CDMA national operators and three CDMA provinces operators. The recent growth in the mobile phone market is based on the Global System for Mobile (GSM) communications and Code Division Multiple Access (CDMA) standards creating the next-generation wireless technologies in the Iraqi Wireless Communication market. One of the essential issues of this research is to investigate the performance of the decreased Quality Of Service (QoS) caused by interferences in the services on GSM/CDMA operators in Iraq. Many issues should be studied and taken into consideration, such as; does the Multi-Coalition Forces cause the interferences, jamming, higher rate of calls drop and false ringing; or are they caused by bad design and planning? Do we need to optimise our network due to the large number of users? All these factors are investigated and the measurements of most service providers and government agencies will be gathered. A detailed analysis was included from the providers with measurements of performance and the reasons for the deterioration of wireless services. The novel contributions of this thesis is the extensive radio measurement campaign over the three mobile an CDMA operator networks and the analysis and recommendations that were drawn to suggest the best approach to improve the QoS of Wireless communication technologies. Awareness of actual reasons behind the deterioration of services will be raised to the Iraqi Government, CMC and the wireless service providers
Design of complex integrated systems based on networks-on-chip: Trading off performance, power and reliability
The steady advancement of microelectronics is associated with an escalating number of challenges for design engineers due to both the tiny dimensions and the enormous complexity of integrated systems. Against this background, this work deals with Network-On-Chip (NOC) as the emerging design paradigm to cope with diverse issues of nanotechnology. The detailed investigations within the chapters focus on the communication-centric aspects of multi-core-systems, whereas performance, power consumption as well as reliability are considered likewise as the essential design criteria
RA-LPEL: A Resource-Aware Light-Weight Parallel Execution Layer for Reactive Stream Processing Networks on The SCC Many-core Tiled Architecture
In computing the available computing power has continuously fallen short of the demanded computing performance. As a consequence, performance improvement has been the main focus of processor design. However, due to the phenomenon called âPower Wallâ it has become infeasible to build faster processors by just increasing the
processorâs clock speed. One of the resulting trends in hardware design is to integrate several simple and power-efficient cores on the same chip. This design shift poses challenges of its own. In the past, with increasing clock frequency the programs became automatically faster as well without modifications. This is no longer true with many-core architectures. To achieve maximum performance the programs have to run concurrently on more than one core, which forces the general computing paradigm to
become increasingly parallel to leverage maximum processing power.
In this thesis, we will focus on the Reactive Stream Program (RSP). In stream processing, the system consists of computing nodes, which are connected via communication streams. These streams simplify the concurrency management on modern many-core architectures due to their implicit synchronisation. RSP is a stream processing system that implements the reactive system. The RSPs work in tandem with their environment and the load imposed by the environment may vary over time. This provides a unique opportunity to increase performance per watt. In this thesis the
research contribution focuses on the design of the execution layer to run RSPs on tiled many-core architectures, using the Intelâs Single-chip Cloud Computer (SCC) processor as a concrete experimentation platform. Further, we have developed a
Dynamic Voltage and Frequency Scaling (DVFS) technique for RSP deployed on many-core architectures. In contrast to many other approaches, our DVFS technique does not require the capability of controlling the power settings of individual computing elements, thus making it applicable for modern many-core architectures, with
which power can be changed only for power islands. The experimental results confirm that the proposed DVFS technique can effectively improve the energy efficiency, i.e. increase the performance per watt, for RSPs
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
SpiNNaker - A Spiking Neural Network Architecture
20 years in conception and 15 in construction, the SpiNNaker project has delivered the worldâs largest neuromorphic computing platform incorporating over a million ARM mobile phone processors and capable of modelling spiking neural networks of the scale of a mouse brain in biological real time. This machine, hosted at the University of Manchester in the UK, is freely available under the auspices of the EU Flagship Human Brain Project. This book tells the story of the origins of the machine, its development and its deployment, and the immense software development effort that has gone into making it openly available and accessible to researchers and students the world over. It also presents exemplar applications from âTalkâ, a SpiNNaker-controlled robotic exhibit at the Manchester Art Gallery as part of âThe Imitation Gameâ, a set of works commissioned in 2016 in honour of Alan Turing, through to a way to solve hard computing problems using stochastic neural networks. The book concludes with a look to the future, and the SpiNNaker-2 machine which is yet to come
SpiNNaker - A Spiking Neural Network Architecture
20 years in conception and 15 in construction, the SpiNNaker project has delivered the worldâs largest neuromorphic computing platform incorporating over a million ARM mobile phone processors and capable of modelling spiking neural networks of the scale of a mouse brain in biological real time. This machine, hosted at the University of Manchester in the UK, is freely available under the auspices of the EU Flagship Human Brain Project. This book tells the story of the origins of the machine, its development and its deployment, and the immense software development effort that has gone into making it openly available and accessible to researchers and students the world over. It also presents exemplar applications from âTalkâ, a SpiNNaker-controlled robotic exhibit at the Manchester Art Gallery as part of âThe Imitation Gameâ, a set of works commissioned in 2016 in honour of Alan Turing, through to a way to solve hard computing problems using stochastic neural networks. The book concludes with a look to the future, and the SpiNNaker-2 machine which is yet to come
An FPGA implementation of an investigative many-core processor, Fynbos : in support of a Fortran autoparallelising software pipeline
Includes bibliographical references.In light of the power, memory, ILP, and utilisation walls facing the computing industry, this work examines the hypothetical many-core approach to finding greater compute performance and efficiency. In order to achieve greater efficiency in an environment in which Mooreâs law continues but TDP has been capped, a means of deriving performance from dark and dim silicon is needed. The many-core hypothesis is one approach to exploiting these available transistors efficiently. As understood in this work, it involves trading in hardware control complexity for hundreds to thousands of parallel simple processing elements, and operating at a clock speed sufficiently low as to allow the efficiency gains of near threshold voltage operation. Performance is there- fore dependant on exploiting a new degree of fine-grained parallelism such as is currently only found in GPGPUs, but in a manner that is not as restrictive in application domain range. While removing the complex control hardware of traditional CPUs provides space for more arithmetic hardware, a basic level of control is still required. For a number of reasons this work chooses to replace this control largely with static scheduling. This pushes the burden of control primarily to the software and specifically the compiler, rather not to the programmer or to an application specific means of control simplification. An existing legacy tool chain capable of autoparallelising sequential Fortran code to the degree of parallelism necessary for many-core exists. This work implements a many-core architecture to match it. Prototyping the design on an FPGA, it is possible to examine the real world performance of the compiler-architecture system to a greater degree than simulation only would allow. Comparing theoretical peak performance and real performance in a case study application, the system is found to be more efficient than any other reviewed, but to also significantly under perform relative to current competing architectures. This failing is apportioned to taking the need for simple hardware too far, and an inability to implement static scheduling mitigating tactics due to lack of support for such in the compiler
Many-core and heterogeneous architectures: programming models and compilation toolchains
1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc
- âŚ