3,996 research outputs found
Automated Hardware Prototyping for 3D Network on Chips
Vor mehr als 50 Jahren stellte IntelÂź MitbegrĂŒnder Gordon Moore eine Prognose zum Entwicklungsprozess der Transistortechnologie auf. Er prognostizierte, dass sich die Zahl der Transistoren in integrierten Schaltungen alle zwei Jahre verdoppeln wird. Seine Aussage ist immer noch gĂŒltig, aber ein Ende von Moores Gesetz ist in Sicht. Mit dem Ende von Mooreâs Gesetz mĂŒssen neue Aspekte untersucht werden, um weiterhin die Leistung von integrierten Schaltungen zu steigern. Zwei mögliche AnsĂ€tze fĂŒr "More than Mooreâ sind 3D-Integrationsverfahren und heterogene Systeme. Gleichzeitig entwickelt sich ein Trend hin zu Multi-Core Prozessoren, basierend auf Networks on chips (NoCs).
Neben dem Ende des Mooreschen Gesetzes ergeben sich bei immer kleiner werdenden TechnologiegröĂen, vor allem jenseits der 60 nm, neue Herausforderungen. Eine Schwierigkeit ist die WĂ€rmeableitung in groĂskalierten integrierten Schaltkreisen und die daraus resultierende Ăberhitzung des Chips. Um diesem Problem in modernen Multi-Core Architekturen zu begegnen, muss auch die Verlustleistung der Netzwerkressourcen stark reduziert werden. Diese Arbeit umfasst eine durch Hardware gesteuerte Kombination aus Frequenzskalierung und Power Gating fĂŒr 3D On-Chip Netzwerke, einschlieĂlich eines FPGA Prototypen. DafĂŒr wurde ein Takt-synchrones 2D Netzwerk auf ein dreidimensionales asynchrones Netzwerk mit mehreren Frequenzbereichen erweitert. ZusĂ€tzlich wurde ein skalierbares Online-Power-Management System mit geringem Ressourcenaufwand entwickelt.
Die Verifikation neuer Hardwarekomponenten ist einer der zeitaufwendigsten Schritte im Entwicklungsprozess hochintegrierter digitaler Schaltkreise. Um diese Aufgabe zu beschleunigen und um eine parallele Softwareentwicklung zu ermöglichen, wurde im Rahmen dieser Arbeit ein automatisiertes und benutzerfreundliches Tool fĂŒr den Entwurf neuer Hardware Projekte entwickelt. Eine grafische BenutzeroberflĂ€che zum Erstellen des gesamten Designablaufs, vom Erstellen der Architektur, Parameter Deklaration, Simulation, Synthese und Test ist Teil dieses Werkzeugs. Zudem stellt die GröĂe der Architektur fĂŒr die Erstellung eines Prototypen eine besondere Herausforderung dar. FrĂŒhere Arbeiten haben es versĂ€umt, eine schnelles und unkompliziertes Prototyping, insbesondere von Architekturen mit mehr als 50 Prozessorkernen, zu realisieren. Diese Arbeit umfasst eine Design Space Exploration und FPGA-basierte Prototypen von verschiedenen 3D-NoC Implementierungen mit mehr als 80 Prozessoren
Vector support for multicore processors with major emphasis on configurable multiprocessors
It recently became increasingly difficult to build higher speed uniprocessor chips because of performance degradation and high power consumption. The quadratically increasing circuit complexity forbade the exploration of more instruction-level parallelism (JLP). To continue raising the performance, processor designers then focused on thread-level parallelism (TLP) to realize a new architecture design paradigm. Multicore processor design is the result of this trend. It has proven quite capable in performance increase and provides new opportunities in power management and system scalability. But current multicore processors do not provide powerful vector architecture support which could yield significant speedups for array operations while maintaining arealpower efficiency.
This dissertation proposes and presents the realization of an FPGA-based prototype of a multicore architecture with a shared vector unit (MCwSV). FPGA stands for Filed-Programmable Gate Array. The idea is that rather than improving only scalar or TLP performance, some hardware budget could be used to realize a vector unit to greatly speedup applications abundant in data-level parallelism (DLP). To be realistic, limited by the parallelism in the application itself and by the compiler\u27s vectorizing abilities, most of the general-purpose programs can only be partially vectorized. Thus, for efficient resource usage, one vector unit should be shared by several scalar processors. This approach could also keep the overall budget within acceptable limits. We suggest that this type of vector-unit sharing be established in future multicore chips.
The design, implementation and evaluation of an MCwSV system with two scalar processors and a shared vector unit are presented for FPGA prototyping. The MicroBlaze processor, which is a commercial IP (Intellectual Property) core from Xilinx, is used as the scalar processor; in the experiments the vector unit is connected to a pair of MicroBlaze processors through standard bus interfaces. The overall system is organized in a decoupled and multi-banked structure. This organization provides substantial system scalability and better vector performance. For a given area budget, benchmarks from several areas show that the MCwSV system can provide significant performance increase as compared to a multicore system without a vector unit.
However, a MCwSV system with two MicroBlazes and a shared vector unit is not always an optimized system configuration for various applications with different percentages of vectorization. On the other hand, the MCwSV framework was designed for easy scalability to potentially incorporate various numbers of scalar/vector units and various function units. Also, the flexibility inherent to FPGAs can aid the task of matching target applications. These benefits can be taken into account to create optimized MCwSV systems for various applications. So the work eventually focused on building an architecture design framework incorporating performance and resource management for application-specific MCwSV (AS-MCwSV) systems. For embedded system design, resource usage, power consumption and execution latency are three metrics to be used in design tradeoffs. The product of these metrics is used here to choose the MCwSV system with the smallest value
The Design of a System Architecture for Mobile Multimedia Computers
This chapter discusses the system architecture of a portable computer, called Mobile Digital Companion, which provides support for handling multimedia applications energy efficiently. Because battery life is limited and battery weight is an important factor for the size and the weight of the Mobile Digital Companion, energy management plays a crucial role in the architecture. As the Companion must remain usable in a variety of environments, it has to be flexible and adaptable to various operating conditions. The Mobile Digital Companion has an unconventional architecture that saves energy by using system decomposition at different levels of the architecture and exploits locality of reference with dedicated, optimised modules. The approach is based on dedicated functionality and the extensive use of energy reduction techniques at all levels of system design. The system has an architecture with a general-purpose processor accompanied by a set of heterogeneous autonomous programmable modules, each providing an energy efficient implementation of dedicated tasks. A reconfigurable internal communication network switch exploits locality of reference and eliminates wasteful data copies
Towards the development of a reliable reconfigurable real-time operating system on FPGAs
In the last two decades, Field Programmable Gate Arrays (FPGAs) have been
rapidly developed from simple âglue-logicâ to a powerful platform capable of
implementing a System on Chip (SoC). Modern FPGAs achieve not only the high
performance compared with General Purpose Processors (GPPs), thanks to hardware
parallelism and dedication, but also better programming flexibility, in comparison to
Application Specific Integrated Circuits (ASICs). Moreover, the hardware
programming flexibility of FPGAs is further harnessed for both performance and
manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR
allows a part or parts of a circuit to be reconfigured at run-time, without interrupting
the rest of the chipâs operation. As a result, hardware resources can be more
efficiently exploited since the chip resources can be reused by swapping in or out
hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR
improves fault tolerance against transient errors and permanent damage, such as
Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid
error accumulation. Furthermore, power and heat can be reduced by removing
finished or idle tasks from the chip. For all these reasons above, DPR has
significantly promoted Reconfigurable Computing (RC) and has become a very hot
topic. However, since hardware integration is increasing at an exponential rate, and
applications are becoming more complex with the growth of user demands, highlevel
application design and low-level hardware implementation are increasingly
separated and layered. As a consequence, users can obtain little advantage from DPR
without the support of system-level middleware.
To bridge the gap between the high-level application and the low-level hardware
implementation, this thesis presents the important contributions towards a Reliable,
Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the
user exploitation of DPR from the application level, by managing the complex
hardware in the background. In R3TOS, hardware tasks behave just like software
tasks, which can be created, scheduled, and mapped to different computing resources
on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time
scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and
EVC), which schedule tasks in real-time and circumvent emerging faults while
maintaining more compact empty areas. 3) Design and implementation of a faulttolerant
microprocessor by harnessing the existing FPGA resources, such as Error
Correction Code (ECC) and configuration primitives. 4) A novel symmetric
multiprocessing (SMP)-based architectures that supports shared memory programing
interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest
Neighbour classifier, which is a non-parametric classification algorithm widely used
in various fields of data mining; and b) pairwise sequence alignment, namely the
Smith Waterman algorithm, used for identifying similarities between two biological
sequences.
R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking
applications, whereby resources can be dynamically managed in respect of
user requirements and hardware availability. Benefiting from this, not only the
hardware resources can be more efficiently used, but also the system performance
can be significantly increased. Results show that the scheduling and allocating
efficiencies have been improved up to 2x, and the overall system performance is
further improved by ~2.5x. Future work includes the development of Network on
Chip (NoC), which is expected to further increase the communication throughput; as
well as the standardization and automation of our system design, which will be
carried out in line with the enablement of other high-level synthesis tools, to allow
application developers to benefit from the system in a more efficient manner
Packet Switched vs. Time Multiplexed FPGA Overlay Networks
Dedicated, spatially configured FPGA interconnect
is efficient for applications that require high throughput connections
between processing elements (PEs) but with a limited degree
of PE interconnectivity (e.g. wiring up gates and datapaths).
Applications which virtualize PEs may require a large number
of distinct PE-to-PE connections (e.g. using one PE to simulate
100s of operators, each requiring input data from thousands of
other operators), but with each connection having low throughput
compared with the PEâs operating cycle time. In these highly interconnected
conditions, dedicating spatial interconnect resources
for all possible connections is costly and inefficient. Alternatively,
we can time share physical network resources by virtualizing
interconnect links, either by statically scheduling the sharing
of resources prior to runtime or by dynamically negotiating
resources at runtime. We explore the tradeoffs (e.g. area, route
latency, route quality) between time-multiplexed and packet-switched
networks overlayed on top of commodity FPGAs. We
demonstrate modular and scalable networks which operate on
a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed,
offline scheduling offers up to a 63% performance
increase over online, packet-switched scheduling for equivalent
topologies. When applying designs to equivalent area, packet-switching
is up to 2Ă faster for small area designs while time-multiplexing
is up to 5Ă faster for larger area designs. When
limited to the capacity of a XC2V6000, if all communication is
known, time-multiplexed routing outperforms packet-switching;
however when the active set of links drops below 40% of the
potential links, packet-switched routing can outperform time-multiplexing
A Scalable and Adaptive Network on Chip for Many-Core Architectures
In this work, a scalable network on chip (NoC) for future many-core architectures is proposed and investigated. It supports different QoS mechanisms to ensure predictable communication. Self-optimization is introduced to adapt the energy footprint and the performance of the network to the communication requirements. A fault tolerance concept allows to deal with permanent errors. Moreover, a template-based automated evaluation and design methodology and a synthesis flow for NoCs is introduced
- âŠ