36 research outputs found

    Novel parallel architectures and algorithms for linear algebra processors

    Get PDF
    Number representations, processing architectures and algorithms, optical linear algebra processor fabrication and test results, case study descriptions, and future system plans are covered

    Optimising a fluid plasma turbulence simulation on modern high performance computers

    Get PDF
    Nuclear fusion offers the potential of almost limitless energy from sea water and lithium without the dangers of carbon emissions or long term radioactive waste. At the forefront of fusion technology are the tokamaks, toroidal magnetic confinement devices that contain miniature stars on Earth. Nuclei can only fuse by overcoming the strong electrostatic forces between them which requires high temperatures and pressures. The temperatures in a tokamak are so great that the Deuterium-Tritium fusion fuel forms a plasma which must be kept hot and under pressure to maintain the fusion reaction. Turbulence in the plasma causes disruption by transporting mass and energy away from this core, reducing the efficiency of the reaction. Understanding and controlling the mechanisms of plasma turbulence is key to building a fusion reactor capable of producing sustained output. The extreme temperatures make detailed empirical observations difficult to acquire, so numerical simulations are used as an additional method of investigation. One numerical model used to study turbulence and diffusion is CENTORI, a direct two-fluid magneto-hydrodynamic simulation of a tokamak plasma developed by the Culham Centre for Fusion Energy (CCFE formerly UKAEA:Fusion). It simulates the entire tokamak plasma with realistic geometry, evolving bulk plasma quantities like pressure, density and temperature through millions of timesteps. This requires CENTORI to run in parallel on a Massively Parallel Processing (MPP) supercomputer to produce results in an acceptable time. Any improvements in CENTORI’s performance increases the rate and/or total number of results that can be obtained from access to supercomputer resources. This thesis presents the substantial effort to optimise CENTORI on the current generation of academic supercomputers. It investigates and reviews the properties of contemporary computer architectures then proposes, implements and executes a benchmark suite of CENTORI’s fundamental kernels. The suite is used to compare the performance of three competing memory layouts of the primary vector data structure using a selection of compilers on a variety of computer architectures. The results show there is no optimal memory layout on all platforms so a flexible optimisation strategy was adopted to pursue “portable” optimisation i.e optimisations that can easily be added, adapted or removed from future platforms depending on their performance. This required designing an interface to functions and datatypes that separate CENTORI’s fundamental algorithms from repetitive, low-level implementation details. This approach offered multiple benefits including: the clearer representation of CENTORI’s core equations as mathematical expressions in Fortran source code allows rapid prototyping and development of new features; the reduction in the total data volume by a factor of three reduces the amount of data transferred over the memory bus to almost a third; and the reduction in the number of intense floating point kernels reduces the effort of optimising the application on new platforms. The project proceeds to rewrite CENTORI using the new Application Programming Interface (API) and evaluates two optimised implementations. The first is a traditional library implementation that uses hand optimised subroutines to implement the library functions. The second uses a dynamic optimisation engine to perform automatic stripmining to improve the performance of the memory hierarchy. The automatic stripmining implementation uses lazy evaluation to delay calculations until absolutely necessary, allowing it to identify temporary data structures and minimise them for optimal cache use. This novel technique is combined with highly optimised implementations of the kernel operations and optimised parallel communication routines to produce a significant improvement in CENTORI’s performance. The maximum measured speed up of the optimised versions over the original code was 3.4 times on 128 processors on HPCx, 2.8 times on 1024 processors on HECToR and 2.3 times on 256 processors on HPC-FF

    Embedded electronic systems driven by run-time reconfigurable hardware

    Get PDF
    Abstract This doctoral thesis addresses the design of embedded electronic systems based on run-time reconfigurable hardware technology –available through SRAM-based FPGA/SoC devices– aimed at contributing to enhance the life quality of the human beings. This work does research on the conception of the system architecture and the reconfiguration engine that provides to the FPGA the capability of dynamic partial reconfiguration in order to synthesize, by means of hardware/software co-design, a given application partitioned in processing tasks which are multiplexed in time and space, optimizing thus its physical implementation –silicon area, processing time, complexity, flexibility, functional density, cost and power consumption– in comparison with other alternatives based on static hardware (MCU, DSP, GPU, ASSP, ASIC, etc.). The design flow of such technology is evaluated through the prototyping of several engineering applications (control systems, mathematical coprocessors, complex image processors, etc.), showing a high enough level of maturity for its exploitation in the industry.Resumen Esta tesis doctoral abarca el diseño de sistemas electrĂłnicos embebidos basados en tecnologĂ­a hardware dinĂĄmicamente reconfigurable –disponible a travĂ©s de dispositivos lĂłgicos programables SRAM FPGA/SoC– que contribuyan a la mejora de la calidad de vida de la sociedad. Se investiga la arquitectura del sistema y del motor de reconfiguraciĂłn que proporcione a la FPGA la capacidad de reconfiguraciĂłn dinĂĄmica parcial de sus recursos programables, con objeto de sintetizar, mediante codiseño hardware/software, una determinada aplicaciĂłn particionada en tareas multiplexadas en tiempo y en espacio, optimizando asĂ­ su implementaciĂłn fĂ­sica –área de silicio, tiempo de procesado, complejidad, flexibilidad, densidad funcional, coste y potencia disipada– comparada con otras alternativas basadas en hardware estĂĄtico (MCU, DSP, GPU, ASSP, ASIC, etc.). Se evalĂșa el flujo de diseño de dicha tecnologĂ­a a travĂ©s del prototipado de varias aplicaciones de ingenierĂ­a (sistemas de control, coprocesadores aritmĂ©ticos, procesadores de imagen, etc.), evidenciando un nivel de madurez viable ya para su explotaciĂłn en la industria.Resum Aquesta tesi doctoral estĂ  orientada al disseny de sistemes electrĂČnics empotrats basats en tecnologia hardware dinĂ micament reconfigurable –disponible mitjançant dispositius lĂČgics programables SRAM FPGA/SoC– que contribueixin a la millora de la qualitat de vida de la societat. S’investiga l’arquitectura del sistema i del motor de reconfiguraciĂł que proporcioni a la FPGA la capacitat de reconfiguraciĂł dinĂ mica parcial dels seus recursos programables, amb l’objectiu de sintetitzar, mitjançant codisseny hardware/software, una determinada aplicaciĂł particionada en tasques multiplexades en temps i en espai, optimizant aixĂ­ la seva implementaciĂł fĂ­sica –àrea de silici, temps de processat, complexitat, flexibilitat, densitat funcional, cost i potĂšncia dissipada– comparada amb altres alternatives basades en hardware estĂ tic (MCU, DSP, GPU, ASSP, ASIC, etc.). S’evalĂșa el fluxe de disseny d’aquesta tecnologia a travĂ©s del prototipat de varies aplicacions d’enginyeria (sistemes de control, coprocessadors aritmĂštics, processadors d’imatge, etc.), demostrant un nivell de maduresa viable ja per a la seva explotaciĂł a la indĂșstria

    Dynamically reconfigurable architecture for embedded computer vision systems

    Get PDF
    The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses

    A Solder-Defined Computer Architecture for Backdoor and Malware Resistance

    Get PDF
    This research is about securing control of those devices we most depend on for integrity and confidentiality. An emerging concern is that complex integrated circuits may be subject to exploitable defects or backdoors, and measures for inspection and audit of these chips are neither supported nor scalable. One approach for providing a “supply chain firewall” may be to forgo such components, and instead to build central processing units (CPUs) and other complex logic from simple, generic parts. This work investigates the capability and speed ceiling when open-source hardware methodologies are fused with maker-scale assembly tools and visible-scale final inspection. The author has designed, and demonstrated in simulation, a 36-bit CPU and protected memory subsystem that use only synchronous static random access memory (SRAM) and trivial glue logic integrated circuits as components. The design presently lacks preemptive multitasking, ability to load firmware into the SRAMs used as logic elements, and input/output. Strategies are presented for adding these missing subsystems, again using only SRAM and trivial glue logic. A load-store architecture is employed with four clock cycles per instruction. Simulations indicate that a clock speed of at least 64 MHz is probable, corresponding to 16 million instructions per second (16 MIPS), despite the architecture containing no microprocessors, field programmable gate arrays, programmable logic devices, application specific integrated circuits, or other purchased complex logic. The lower speed, larger size, higher power consumption, and higher cost of an “SRAM minicomputer,” compared to traditional microcontrollers, may be offset by the fully open architecture—hardware and firmware—along with more rigorous user control, reliability, transparency, and auditability of the system. SRAM logic is also particularly well suited for building arithmetic logic units, and can implement complex operations such as population count, a hash function for associative arrays, or a pseudorandom number generator with good statistical properties in as few as eight clock cycles per 36-bit word processed. 36-bit unsigned multiplication can be implemented in software in 47 instructions or fewer (188 clock cycles). A general theory is developed for fast SRAM parallel multipliers should they be needed

    The Fifth NASA Symposium on VLSI Design

    Get PDF
    The fifth annual NASA Symposium on VLSI Design had 13 sessions including Radiation Effects, Architectures, Mixed Signal, Design Techniques, Fault Testing, Synthesis, Signal Processing, and other Featured Presentations. The symposium provides insights into developments in VLSI and digital systems which can be used to increase data systems performance. The presentations share insights into next generation advances that will serve as a basis for future VLSI design

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    Hardware / Software Architectural and Technological Exploration for Energy-Efficient and Reliable Biomedical Devices

    Get PDF
    Nowadays, the ubiquity of smart appliances in our everyday lives is increasingly strengthening the links between humans and machines. Beyond making our lives easier and more convenient, smart devices are now playing an important role in personalized healthcare delivery. This technological breakthrough is particularly relevant in a world where population aging and unhealthy habits have made non-communicable diseases the first leading cause of death worldwide according to international public health organizations. In this context, smart health monitoring systems termed Wireless Body Sensor Nodes (WBSNs), represent a paradigm shift in the healthcare landscape by greatly lowering the cost of long-term monitoring of chronic diseases, as well as improving patients' lifestyles. WBSNs are able to autonomously acquire biological signals and embed on-node Digital Signal Processing (DSP) capabilities to deliver clinically-accurate health diagnoses in real-time, even outside of a hospital environment. Energy efficiency and reliability are fundamental requirements for WBSNs, since they must operate for extended periods of time, while relying on compact batteries. These constraints, in turn, impose carefully designed hardware and software architectures for hosting the execution of complex biomedical applications. In this thesis, I develop and explore novel solutions at the architectural and technological level of the integrated circuit design domain, to enhance the energy efficiency and reliability of current WBSNs. Firstly, following a top-down approach driven by the characteristics of biomedical algorithms, I perform an architectural exploration of a heterogeneous and reconfigurable computing platform devoted to bio-signal analysis. By interfacing a shared Coarse-Grained Reconfigurable Array (CGRA) accelerator, this domain-specific platform can achieve higher performance and energy savings, beyond the capabilities offered by a baseline multi-processor system. More precisely, I propose three CGRA architectures, each contributing differently to the maximization of the application parallelization. The proposed Single, Multi and Interleaved-Datapath CGRA designs allow the developed platform to achieve substantial energy savings of up to 37%, when executing complex biomedical applications, with respect to a multi-core-only platform. Secondly, I investigate how the modeling of technology reliability issues in logic and memory components can be exploited to adequately adjust the frequency and supply voltage of a circuit, with the aim of optimizing its computing performance and energy efficiency. To this end, I propose a novel framework for workload-dependent Bias Temperature Instability (BTI) impact analysis on biomedical application results quality. Remarkably, the framework is able to determine the range of safe circuit operating frequencies without introducing worst-case guard bands. Experiments highlight the possibility to safely raise the frequency up to 101% above the maximum obtained with the classical static timing analysis. Finally, through the study of several well-known biomedical algorithms, I propose an approach allowing energy savings by dynamically and unequally protecting an under-powered data memory in a new way compared to regular error protection schemes. This solution relies on the Dynamic eRror compEnsation And Masking (DREAM) technique that reduces by approximately 21% the energy consumed by traditional error correction codes

    Towards an embedded board-level tester: study of a configurable test processor

    Get PDF
    The demand for electronic systems with more features, higher performance, and less power consumption increases continuously. This is a real challenge for design and test engineers because they have to deal with electronic systems with ever-increasing complexity maintaining production and test costs low and meeting critical time to market deadlines. For a test engineer working at the board-level, this means that manufacturing defects must be detected as soon as possible and at a low cost. However, the use of classical test techniques for testing modern printed circuit boards is not sufficient, and in the worst case these techniques cannot be used at all. This is mainly due to modern packaging technologies, a high device density, and high operation frequencies of modern printed circuit boards. This leads to very long test times, low fault coverage, and high test costs. This dissertation addresses these issues and proposes an FPGA-based test approach for printed circuit boards. The concept is based on a configurable test processor that is temporarily implemented in the on-board FPGA and provides the corresponding mechanisms to communicate to external test equipment and co-processors implemented in the FPGA. This embedded test approach provides the flexibility to implement test functions either in the external test equipment or in the FPGA. In this manner, tests are executed at-speed increasing the fault coverage, test times are reduced, and the test system can be adapted automatically to the properties of the FPGA and devices located on the board. An essential part of the FPGA-based test approach deals with the development of a test processor. In this dissertation the required properties of the processor are discussed, and it is shown that the adaptation to the specific test scenario plays a very important role for the optimization. For this purpose, the test processor is equipped with configuration parameters at the instruction set architecture and microarchitecture level. Additionally, an automatic generation process for the test system and for the computation of some of the processor’s configuration parameters is proposed. The automatic generation process uses as input a model known as the device under test model (DUT-M). In order to evaluate the entire FPGA-based test approach and the viability of a processor for testing printed circuit boards, the developed test system is used to test interconnections to two different devices: a static random memory (SRAM) and a liquid crystal display (LCD). Experiments were conducted in order to determine the resource utilization of the processor and FPGA-based test system and to measure test time when different test functions are implemented in the external test equipment or the FPGA. It has been shown that the introduced approach is suitable to test printed circuit boards and that the test processor represents a realistic alternative for testing at board-level.Der Bedarf an elektronischen Systemen mit zusĂ€tzlichen Merkmalen, höherer Leistung und geringerem Energieverbrauch nimmt stĂ€ndig zu. Dies stellt eine erhebliche Herausforderung fĂŒr Entwicklungs- und Testingenieure dar, weil sie sich mit elektronischen Systemen mit einer steigenden KomplexitĂ€t zu befassen haben. Außerdem mĂŒssen die Herstellungs- und Testkosten gering bleiben und die ProdukteinfĂŒhrungsfristen so kurz wie möglich gehalten werden. Daraus folgt, dass ein Testingenieur, der auf Leiterplatten-Ebene arbeitet, die Herstellungsfehler so frĂŒh wie möglich entdecken und dabei möglichst niedrige Kosten verursachen soll. Allerdings sind die klassischen Testmethoden nicht in der Lage, die Anforderungen von modernen Leiterplatten zu erfĂŒllen und im schlimmsten Fall können diese Testmethoden ĂŒberhaupt nicht verwendet werden. Dies liegt vor allem an modernen GehĂ€use-Technologien, der hohen Bauteildichte und den hohen Arbeitsfrequenzen von modernen Leiterplatten. Das fĂŒhrt zu sehr langen Testzeiten, geringer Testabdeckung und hohen Testkosten. Die Dissertation greift diese Problematik auf und liefert einen FPGA-basierten Testansatz fĂŒr Leiterplatten. Das Konzept beruht auf einem konfigurierbaren Testprozessor, welcher im On-Board-FPGA temporĂ€r implementiert wird und die entsprechenden Mechanismen fĂŒr die Kommunikation mit der externen Testeinrichtung und Co-Prozessoren im FPGA bereitstellt. Dadurch ist es möglich Testfunktionen flexibel entweder auf der externen Testeinrichtung oder auf dem FPGA zu implementieren. Auf diese Weise werden Tests at-speed ausgefĂŒhrt, um die Testabdeckung zu erhöhen. Außerdem wird die Testzeit verkĂŒrzt und das Testsystem automatisch an die Eigenschaften des FPGAs und anderer Bauteile auf der Leiterplatte angepasst. Ein wesentlicher Teil des FPGA-basierten Testansatzes umfasst die Entwicklung eines Testprozessors. In dieser Dissertation wird ĂŒber die benötigten Eigenschaften des Prozessors diskutiert und es wird gezeigt, dass die Anpassung des Prozessors an den spezifischen Testfall von großer Bedeutung fĂŒr die Optimierung ist. Zu diesem Zweck wird der Prozessor mit Konfigurationsparametern auf der Befehlssatzarchitektur-Ebene und Mikroarchitektur-Ebene ausgerĂŒstet. Außerdem wird ein automatischer Generierungsprozess fĂŒr die Realisierung des Testsystems und fĂŒr die Berechnung einer Untergruppe von Konfigurationsparametern des Prozessors vorgestellt. Der automatische Generierungsprozess benutzt als Eingangsinformation ein Modell des PrĂŒflings (device under test model, DUT-M). Das entwickelte Testsystem wurde zum Testen von Leiterplatten fĂŒr Verbindungen zwischen dem FPGA und zwei Bauteilen verwendet, um den FPGA-basierten Testansatz und die DurchfĂŒhrbarkeit des Testprozessors fĂŒr das Testen auf Leiterplatte-Ebene zu evaluieren. Die zwei Bauteile sind ein Speicher mit direktem Zugriff (static random-access memory, SRAM) und eine FlĂŒssigkristallanzeige (liquid crystal display, LCD). Die Experimente wurden durchgefĂŒhrt, um den Ressourcenverbrauch des Prozessors und Testsystems festzustellen und um die Testzeit zu messen. Dies geschah durch die Implementierung von unterschiedlichen Testfunktionen auf der externen Testeinrichtung und dem FPGA. Dadurch konnte gezeigt werden, dass der FPGA-basierte Ansatz fĂŒr das Testen von Leiterplatten geeignet ist und dass der Testprozessor eine realistische Alternative fĂŒr das Testen auf Leiterplatten-Ebene ist
    corecore