14 research outputs found
Soft Error Analysis and Mitigation at High Abstraction Levels
Radiation-induced soft errors, as one of the major reliability challenges in future technology nodes, have to be carefully taken into consideration in the design space exploration. This thesis presents several novel and efficient techniques for soft error evaluation and mitigation at high abstract levels, i.e. from register transfer level up to behavioral algorithmic level. The effectiveness of proposed techniques is demonstrated with extensive synthesis experiments
Dependable Embedded Systems
This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems
Integration and validation of embedded flight software on space-qualified multicore architectures
In the recent decades, the importance of software on space missions has notably increased, reflecting the need to integrate advanced on-board functionalities. With multicore processors being lately introduced to host critical high-performance applications, the complexity to validate software has significantly raised with respect to single core architectures. While there has been a big step forward in avionics after the publication of the CAST-32A paper, the ECSS-E-ST-40C software engineering standard used by the European Space Agency (ESA) is still not providing validation support for multicore processors. Hence, it is expected that standardising guidelines to develop software on such platforms will become a recurring topic in the industry to match the demands of future space exploration missions
Resiliency Mechanisms for In-Memory Column Stores
The key objective of database systems is to reliably manage data, while high query throughput and low query latency are core requirements. To date, database research activities mostly concentrated on the second part. However, due to the constant shrinking of transistor feature sizes, integrated circuits become more and more unreliable and transient hardware errors in the form of multi-bit flips become more and more prominent. In a more recent study (2013), in a large high-performance cluster with around 8500 nodes, a failure rate of 40 FIT per DRAM device was measured. For their system, this means that every 10 hours there occurs a single- or multi-bit flip, which is unacceptably high for enterprise and HPC scenarios. Causes can be cosmic rays, heat, or electrical crosstalk, with the latter being exploited actively through the RowHammer attack. It was shown that memory cells are more prone to bit flips than logic gates and several surveys found multi-bit flip events in main memory modules of today's data centers. Due to the shift towards in-memory data management systems, where all business related data and query intermediate results are kept solely in fast main memory, such systems are in great danger to deliver corrupt results to their users. Hardware techniques can not be scaled to compensate the exponentially increasing error rates. In other domains, there is an increasing interest in software-based solutions to this problem, but these proposed methods come along with huge runtime and/or storage overheads. These are unacceptable for in-memory data management systems.
In this thesis, we investigate how to integrate bit flip detection mechanisms into in-memory data management systems. To achieve this goal, we first build an understanding of bit flip detection techniques and select two error codes, AN codes and XOR checksums, suitable to the requirements of in-memory data management systems. The most important requirement is effectiveness of the codes to detect bit flips. We meet this goal through AN codes, which exhibit better and adaptable error detection capabilities than those found in today's hardware. The second most important goal is efficiency in terms of coding latency. We meet this by introducing a fundamental performance improvements to AN codes, and by vectorizing both chosen codes' operations. We integrate bit flip detection mechanisms into the lowest storage layer and the query processing layer in such a way that the remaining data management system and the user can stay oblivious of any error detection. This includes both base columns and pointer-heavy index structures such as the ubiquitous B-Tree. Additionally, our approach allows adaptable, on-the-fly bit flip detection during query processing, with only very little impact on query latency. AN coding allows to recode intermediate results with virtually no performance penalty. We support our claims by providing exhaustive runtime and throughput measurements throughout the whole thesis and with an end-to-end evaluation using the Star Schema Benchmark. To the best of our knowledge, we are the first to present such holistic and fast bit flip detection in a large software infrastructure such as in-memory data management systems. Finally, most of the source code fragments used to obtain the results in this thesis are open source and freely available.:1 INTRODUCTION
1.1 Contributions of this Thesis
1.2 Outline
2 PROBLEM DESCRIPTION AND RELATED WORK
2.1 Reliable Data Management on Reliable Hardware
2.2 The Shift Towards Unreliable Hardware
2.3 Hardware-Based Mitigation of Bit Flips
2.4 Data Management System Requirements
2.5 Software-Based Techniques For Handling Bit Flips
2.5.1 Operating System-Level Techniques
2.5.2 Compiler-Level Techniques
2.5.3 Application-Level Techniques
2.6 Summary and Conclusions
3 ANALYSIS OF CODING TECHNIQUES
3.1 Selection of Error Codes
3.1.1 Hamming Coding
3.1.2 XOR Checksums
3.1.3 AN Coding
3.1.4 Summary and Conclusions
3.2 Probabilities of Silent Data Corruption
3.2.1 Probabilities of Hamming Codes
3.2.2 Probabilities of XOR Checksums
3.2.3 Probabilities of AN Codes
3.2.4 Concrete Error Models
3.2.5 Summary and Conclusions
3.3 Throughput Considerations
3.3.1 Test Systems Descriptions
3.3.2 Vectorizing Hamming Coding
3.3.3 Vectorizing XOR Checksums
3.3.4 Vectorizing AN Coding
3.3.5 Summary and Conclusions
3.4 Comparison of Error Codes
3.4.1 Effectiveness
3.4.2 Efficiency
3.4.3 Runtime Adaptability
3.5 Performance Optimizations for AN Coding
3.5.1 The Modular Multiplicative Inverse
3.5.2 Faster Softening
3.5.3 Faster Error Detection
3.5.4 Comparison to Original AN Coding
3.5.5 The Multiplicative Inverse Anomaly
3.6 Summary
4 BIT FLIP DETECTING STORAGE
4.1 Column Store Architecture
4.1.1 Logical Data Types
4.1.2 Storage Model
4.1.3 Data Representation
4.1.4 Data Layout
4.1.5 Tree Index Structures
4.1.6 Summary
4.2 Hardened Data Storage
4.2.1 Hardened Physical Data Types
4.2.2 Hardened Lightweight Compression
4.2.3 Hardened Data Layout
4.2.4 UDI Operations
4.2.5 Summary and Conclusions
4.3 Hardened Tree Index Structures
4.3.1 B-Tree Verification Techniques
4.3.2 Justification For Further Techniques
4.3.3 The Error Detecting B-Tree
4.4 Summary
5 BIT FLIP DETECTING QUERY PROCESSING
5.1 Column Store Query Processing
5.2 Bit Flip Detection Opportunities
5.2.1 Early Onetime Detection
5.2.2 Late Onetime Detection
5.2.3 Continuous Detection
5.2.4 Miscellaneous Processing Aspects
5.2.5 Summary and Conclusions
5.3 Hardened Intermediate Results
5.3.1 Materialization of Hardened Intermediates
5.3.2 Hardened Bitmaps
5.4 Summary
6 END-TO-END EVALUATION
6.1 Prototype Implementation
6.1.1 AHEAD Architecture
6.1.2 Diversity of Physical Operators
6.1.3 One Concrete Operator Realization
6.1.4 Summary and Conclusions
6.2 Performance of Individual Operators
6.2.1 Selection on One Predicate
6.2.2 Selection on Two Predicates
6.2.3 Join Operators
6.2.4 Grouping and Aggregation
6.2.5 Delta Operator
6.2.6 Summary and Conclusions
6.3 Star Schema Benchmark Queries
6.3.1 Query Runtimes
6.3.2 Improvements Through Vectorization
6.3.3 Storage Overhead
6.3.4 Summary and Conclusions
6.4 Error Detecting B-Tree
6.4.1 Single Key Lookup
6.4.2 Key Value-Pair Insertion
6.5 Summary
7 SUMMARY AND CONCLUSIONS
7.1 Future Work
A APPENDIX
A.1 List of Golden As
A.2 More on Hamming Coding
A.2.1 Code examples
A.2.2 Vectorization
BIBLIOGRAPHY
LIST OF FIGURES
LIST OF TABLES
LIST OF LISTINGS
LIST OF ACRONYMS
LIST OF SYMBOLS
LIST OF DEFINITION
Optimization of high-throughput real-time processes in physics reconstruction
La presente tesis se ha desarrollado en colaboración entre
la Universidad de Sevilla y la Organización Europea para la
Investigación Nuclear, CERN.
El detector LHCb es uno de los cuatro grandes detectores
situados en el Gran Colisionador de Hadrones, LHC. En LHCb,
se colisionan partículas a altas energías para comprender la
diferencia existente entre la materia y la antimateria. Debido a la
cantidad ingente de datos generada por el detector, es necesario
realizar un filtrado de datos en tiempo real, fundamentado en
los conocimientos actuales recogidos en el Modelo Estándar de
física de partículas. El filtrado, también conocido como High
Level Trigger, deberá procesar un throughput de 40 Tb/s de datos,
y realizar un filtrado de aproximadamente 1 000:1, reduciendo
el throughput a unos 40 Gb/s de salida, que se almacenan para
posterior análisis.
El proceso del High Level Trigger se subdivide a su vez en
dos etapas: High Level Trigger 1 (HLT1) y High Level Trigger
2 (HLT2). El HLT1 transcurre en tiempo real, y realiza una reducción de datos de aproximadamente 30:1. El HLT1 consiste
en una serie de procesos software que reconstruyen lo que ha
sucedido en la colisión de partículas. En la reconstrucción del
HLT1 únicamente se analizan las trayectorias de las partículas
producidas fruto de la colisión, en un problema conocido como
reconstrucción de trazas, para dictaminar el interés de las colisiones.
Por contra, el proceso HLT2 es más fino, requiriendo más
tiempo en realizarse y reconstruyendo todos los subdetectores
que componen LHCb.
Hacia 2020, el detector LHCb, así como todos los componentes
del sistema de adquisici´on de datos, serán actualizados acorde
a los últimos desarrollos técnicos. Como parte del sistema
de adquisición de datos, los servidores que procesan HLT1 y
HLT2 también sufrirán una actualización. Al mismo tiempo, el
acelerador LHC será también actualizado, de manera que la
cantidad de datos generada en cada cruce de grupo de partículas
aumentare en aproxidamente 5 veces la actual. Debido a
las actualizaciones tanto del acelerador como del detector, se
prevé que la cantidad de datos que deberá procesar el HLT en
su totalidad sea unas 40 veces mayor a la actual.
La previsión de la escalabilidad del software actual a 2020
subestim´ó los recursos necesarios para hacer frente al incremento
en throughput. Esto produjo que se pusiera en marcha un
estudio de todos los algoritmos tanto del HLT1 como del HLT2,
así como una actualización del código a nuevos estándares, para
mejorar su rendimiento y ser capaz de procesar la cantidad de
datos esperada.
En esta tesis, se exploran varios algoritmos de la reconstrucción de LHCb. El problema de reconstrucción de trazas se analiza
en profundidad y se proponen nuevos algoritmos para su
resolución. Ya que los problemas analizados exhiben un paralelismo
masivo, estos algoritmos se implementan en lenguajes
especializados para tarjetas gráficas modernas (GPUs), dada su
arquitectura inherentemente paralela. En este trabajo se dise ˜nan
dos algoritmos de reconstrucción de trazas. Además, se diseñan
adicionalmente cuatro algoritmos de decodificación y un algoritmo
de clustering, problemas también encontrados en el HLT1.
Por otra parte, se diseña un algoritmo para el filtrado de Kalman,
que puede ser utilizado en ambas etapas.
Los algoritmos desarrollados cumplen con los requisitos esperados
por la colaboración LHCb para el año 2020. Para poder
ejecutar los algoritmos eficientemente en tarjetas gráficas, se
desarrolla un framework especializado para GPUs, que permite
la ejecución paralela de secuencias de reconstrucción en GPUs.
Combinando los algoritmos desarrollados con el framework, se
completa una secuencia de ejecución que asienta las bases para
un HLT1 ejecutable en GPU.
Durante la investigación llevada a cabo en esta tesis, y gracias
a los desarrollos arriba mencionados y a la colaboración de
un pequeño equipo de personas coordinado por el autor, se
completa un HLT1 ejecutable en GPUs. El rendimiento obtenido
en GPUs, producto de esta tesis, permite hacer frente al reto de
ejecutar una secuencia de reconstrucción en tiempo real, bajo
las condiciones actualizadas de LHCb previstas para 2020. As´ı
mismo, se completa por primera vez para cualquier experimento
del LHC un High Level Trigger que se ejecuta únicamente en
GPUs. Finalmente, se detallan varias posibles configuraciones
para incluir tarjetas gr´aficas en el sistema de adquisición de
datos de LHCb.The current thesis has been developed in collaboration between
Universidad de Sevilla and the European Organization for Nuclear
Research, CERN.
The LHCb detector is one of four big detectors placed alongside
the Large Hadron Collider, LHC. In LHCb, particles are
collided at high energies in order to understand the difference
between matter and antimatter. Due to the massive quantity
of data generated by the detector, it is necessary to filter data
in real-time. The filtering, also known as High Level Trigger,
processes a throughput of 40 Tb/s of data and performs a selection
of approximately 1 000:1. The throughput is thus reduced
to roughly 40 Gb/s of data output, which is then stored for
posterior analysis.
The High Level Trigger process is subdivided into two stages:
High Level Trigger 1 (HLT1) and High Level Trigger 2 (HLT2).
HLT1 occurs in real-time, and yields a reduction of data of approximately
30:1. HLT1 consists in a series of software processes
that reconstruct particle collisions. The HLT1 reconstruction only
analyzes the trajectories of particles produced at the collision,
solving a problem known as track reconstruction, that determines
whether the collision data is kept or discarded. In contrast,
HLT2 is a finer process, which requires more time to execute
and reconstructs all subdetectors composing LHCb.
Towards 2020, the LHCb detector and all the components
composing the data acquisition system will be upgraded. As
part of the data acquisition system, the servers that process
HLT1 and HLT2 will also be upgraded. In addition, the LHC
accelerator will also be updated, increasing the data generated in
every bunch crossing by roughly 5 times. Due to the accelerator
and detector upgrades, the amount of data that the HLT will
require to process is expected to increase by 40 times.
The foreseen scalability of the software through 2020 underestimated
the required resources to face the increase in data
throughput. As a consequence, studies of all algorithms composing
HLT1 and HLT2 and code modernizations were carried
out, in order to obtain a better performance and increase the
processing capability of the foreseen hardware resources in the
upgrade.
In this thesis, several algorithms of the LHCb recontruction
are explored. The track reconstruction problem is analyzed
in depth, and new algorithms are proposed. Since the analyzed
problems are massively parallel, these algorithms are implemented
in specialized languages for modern graphics cards
(GPUs), due to their inherently parallel architecture. From this
work stem two algorithm designs. Furthermore, four additional
decoding algorithms and a clustering algorithms have been designed
and implemented, which are also part of HLT1. Apart
from that, an parallel Kalman filter algorithm has been designed
and implemented, which can be used in both HLT stages.
The developed algorithms satisfy the requirements of the
LHCb collaboration for the LHCb upgrade. In order to execute
the algorithms efficiently on GPUs, a software framework specialized
for GPUs is developed, which allows executing GPU
reconstruction sequences in parallel. Combining the developed
algorithms with the framework, an execution sequence is completed
as the foundations of a GPU HLT1.
During the research carried out in this thesis, the aforementioned
developments and a small group of collaborators coordinated
by the author lead to the completion of a full GPU
HLT1 sequence. The performance obtained on GPUs allows
executing a reconstruction sequence in real-time, under LHCb
upgrade conditions. The developed GPU HLT1 constitutes the
first GPU high level trigger ever developed for an LHC experiment.
Finally, various possible realizations of the GPU HLT1 to
integrate in a production GPU-equipped data acquisition system
are detailed
Towards the development of flexible, reliable, reconfigurable, and high-performance imaging systems
Current FPGAs can implement large systems because of the high density of
reconfigurable logic resources in a single chip. FPGAs are comprehensive devices
that combine flexibility and high performance in the same platform compared to
other platform such as General-Purpose Processors (GPPs) and Application Specific
Integrated Circuits (ASICs). The flexibility of modern FPGAs is further enhanced by
introducing Dynamic Partial Reconfiguration (DPR) feature, which allows for
changing the functionality of part of the system while other parts are functioning.
FPGAs became an important platform for digital image processing applications
because of the aforementioned features. They can fulfil the need of efficient and
flexible platforms that execute imaging tasks efficiently as well as the reliably with
low power, high performance and high flexibility. The use of FPGAs as accelerators
for image processing outperforms most of the current solutions. Current FPGA
solutions can to load part of the imaging application that needs high computational
power on dedicated reconfigurable hardware accelerators while other parts are
working on the traditional solution to increase the system performance. Moreover,
the use of the DPR feature enhances the flexibility of image processing further by
swapping accelerators in and out at run-time. The use of fault mitigation techniques
in FPGAs enables imaging applications to operate in harsh environments following
the fact that FPGAs are sensitive to radiation and extreme conditions.
The aim of this thesis is to present a platform for efficient implementations of
imaging tasks. The research uses FPGAs as the key component of this platform and
uses the concept of DPR to increase the performance, flexibility, to reduce the power
dissipation and to expand the cycle of possible imaging applications. In this context,
it proposes the use of FPGAs to accelerate the Image Processing Pipeline (IPP)
stages, the core part of most imaging devices. The thesis has a number of novel
concepts. The first novel concept is the use of FPGA hardware environment and
DPR feature to increase the parallelism and achieve high flexibility. The concept also
increases the performance and reduces the power consumption and area utilisation.
Based on this concept, the following implementations are presented in this thesis: An
implementation of Adams Hamilton Demosaicing algorithm for camera colour
interpolation, which exploits the FPGA parallelism to outperform other equivalents.
In addition, an implementation of Automatic White Balance (AWB), another IPP
stage that employs DPR feature to prove the mentioned novelty aspects. Another
novel concept in this thesis is presented in chapter 6, which uses DPR feature to
develop a novel flexible imaging system that requires less logic and can be
implemented in small FPGAs. The system can be employed as a template for any
imaging application with no limitation. Moreover, discussed in this thesis is a novel
reliable version of the imaging system that adopts novel techniques including
scrubbing, Built-In Self Test (BIST), and Triple Modular Redundancy (TMR) to
detect and correct errors using the Internal Configuration Access Port (ICAP)
primitive. These techniques exploit the datapath-based nature of the implemented
imaging system to improve the system's overall reliability. The thesis presents a
proposal for integrating the imaging system with the Robust Reliable Reconfigurable
Real-Time Heterogeneous Operating System (R4THOS) to get the best out of the
system. The proposal shows the suitability of the proposed DPR imaging system to
be used as part of the core system of autonomous cars because of its unbounded
flexibility. These novel works are presented in a number of publications as shown in section
1.3 later in this thesis
Towards the development of a reliable reconfigurable real-time operating system on FPGAs
In the last two decades, Field Programmable Gate Arrays (FPGAs) have been
rapidly developed from simple “glue-logic” to a powerful platform capable of
implementing a System on Chip (SoC). Modern FPGAs achieve not only the high
performance compared with General Purpose Processors (GPPs), thanks to hardware
parallelism and dedication, but also better programming flexibility, in comparison to
Application Specific Integrated Circuits (ASICs). Moreover, the hardware
programming flexibility of FPGAs is further harnessed for both performance and
manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR
allows a part or parts of a circuit to be reconfigured at run-time, without interrupting
the rest of the chip’s operation. As a result, hardware resources can be more
efficiently exploited since the chip resources can be reused by swapping in or out
hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR
improves fault tolerance against transient errors and permanent damage, such as
Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid
error accumulation. Furthermore, power and heat can be reduced by removing
finished or idle tasks from the chip. For all these reasons above, DPR has
significantly promoted Reconfigurable Computing (RC) and has become a very hot
topic. However, since hardware integration is increasing at an exponential rate, and
applications are becoming more complex with the growth of user demands, highlevel
application design and low-level hardware implementation are increasingly
separated and layered. As a consequence, users can obtain little advantage from DPR
without the support of system-level middleware.
To bridge the gap between the high-level application and the low-level hardware
implementation, this thesis presents the important contributions towards a Reliable,
Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the
user exploitation of DPR from the application level, by managing the complex
hardware in the background. In R3TOS, hardware tasks behave just like software
tasks, which can be created, scheduled, and mapped to different computing resources
on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time
scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and
EVC), which schedule tasks in real-time and circumvent emerging faults while
maintaining more compact empty areas. 3) Design and implementation of a faulttolerant
microprocessor by harnessing the existing FPGA resources, such as Error
Correction Code (ECC) and configuration primitives. 4) A novel symmetric
multiprocessing (SMP)-based architectures that supports shared memory programing
interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest
Neighbour classifier, which is a non-parametric classification algorithm widely used
in various fields of data mining; and b) pairwise sequence alignment, namely the
Smith Waterman algorithm, used for identifying similarities between two biological
sequences.
R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking
applications, whereby resources can be dynamically managed in respect of
user requirements and hardware availability. Benefiting from this, not only the
hardware resources can be more efficiently used, but also the system performance
can be significantly increased. Results show that the scheduling and allocating
efficiencies have been improved up to 2x, and the overall system performance is
further improved by ~2.5x. Future work includes the development of Network on
Chip (NoC), which is expected to further increase the communication throughput; as
well as the standardization and automation of our system design, which will be
carried out in line with the enablement of other high-level synthesis tools, to allow
application developers to benefit from the system in a more efficient manner
Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)
ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability