7 research outputs found
Compiled Low-Level Virtual Instruction Set Simulation and Profiling for Code Partitioning and ASIP-Synthesis
Abstract We present ongoing work and first results in static and detailed quantitative runtime analysis of LLVM byte code for the purpose of automatic procedural level partitioning and cosynthesis of complex software systems. Runtime behaviour is captured by reverse compilation of LLVM bytecode into augmented, self-profiling ANSI-C simulator programs retaining the LLVM instruction level. The actual global data flow is captured both in quantity and value range to guide function unit layout in the synthesis of application specific processors. Currently the implemented tool LLILA (Low Level Intermediate Language Analyzer) focuses on static code analysis on the inter-procedural data flow via e.g. function parameters and global variables to uncover a program's potential paths of data exchange
Conception de systèmes embarqués fiables et auto-réglables : applications sur les systèmes de transport ferroviaire
During the last few decades, a tremendous progress in the performance of semiconductor devices has been accomplished. In this emerging era of high performance applications, machines need not only to be efficient but also need to be dependable at circuit and system levels. Several works have been proposed to increase embedded systems efficiency by reducing the gap between software flexibility and hardware high-performance. Due to their reconfigurable aspect, Field Programmable Gate Arrays (FPGAs) represented a relevant step towards bridging this performance/flexibility gap. Nevertheless, Dynamic Reconfiguration (DR) has been continuously suffering from a bottleneck corresponding to a long reconfiguration time.In this thesis, we propose a novel medium-grained high-speed dynamic reconfiguration technique for DSP48E1-based circuits. The idea is to take advantage of the DSP48E1 slices runtime reprogrammability coupled with a re-routable interconnection block to change the overall circuit functionality in one clock cycle. In addition to the embedded systems efficiency, this thesis deals with the reliability chanllenges in new sub-micron electronic systems. In fact, as new technologies rely on reduced transistor size and lower supply voltages to improve performance, electronic circuits are becoming remarkably sensitive and increasingly susceptible to transient errors. The system-level impact of these errors can be far-reaching and Single Event Transients (SETs) have become a serious threat to embedded systems reliability, especially for especially for safety critical applications such as transportation systems. The reliability enhancement techniques that are based on overestimated soft error rates (SERs) can lead to unnecessary resource overheads as well as high power consumption. Considering error masking phenomena is a fundamental element for an accurate estimation of SERs.This thesis proposes a new cross-layer model of circuits vulnerability based on a combined modeling of Transistor Level (TLM) and System Level Masking (SLM) mechanisms. We then use this model to build a self adaptive fault tolerant architecture that evaluates the circuit’s effective vulnerability at runtime. Accordingly, the reliability enhancement strategy is adapted to protect only vulnerable parts of the system leading to a reliable circuit with optimized overheads. Experimentations performed on a radar-based obstacle detection system for railway transportation show that the proposed approach allows relevant reliability/resource utilization tradeoffs.Un énorme progrès dans les performances des semiconducteurs a été accompli ces dernières années. Avec l’´émergence d’applications complexes, les systèmes embarqués doivent être à la fois performants et fiables. Une multitude de travaux ont été proposés pour améliorer l’efficacité des systèmes embarqués en réduisant le décalage entre la flexibilité des solutions logicielles et la haute performance des solutions matérielles. En vertu de leur nature reconfigurable, les FPGAs (Field Programmable Gate Arrays) représentent un pas considérable pour réduire ce décalage performance/flexibilité. Cependant, la reconfiguration dynamique a toujours souffert d’une limitation liée à la latence de reconfiguration.Dans cette thèse, une nouvelle technique de reconfiguration dynamiqueau niveau ”grain-moyen” pour les circuits à base de blocks DSP48E1 est proposée. L’idée est de profiter de la reprogrammabilité des blocks DSP48E1 couplée avec un circuit d’interconnection reconfigurable afin de changer la fonction implémentée par le circuit en un cycle horloge. D’autre part, comme les nouvelles technologies s’appuient sur la réduction des dimensions des transistors ainsi que les tensions d’alimentation, les circuits électroniques sont devenus de plus en plus susceptibles aux fautes transitoires. L’impact de ces erreurs au niveau système peut être catastrophique et les SETs (Single Event Transients) sont devenus une menace tangible à la fiabilité des systèmes embarqués, en l’occurrence pour les applications critiques comme les systèmes de transport. Les techniques de fiabilité qui se basent sur des taux d’erreurs (SERs) surestimés peuvent conduire à un gaspillage de ressources et par conséquent un cout en consommation de puissance électrique. Il est primordial de prendre en compte le phénomène de masquage d’erreur pour une estimation précise des SERs.Cette thèse propose une nouvelle modélisation inter-couches de la vulnérabilité des circuits qui combine les mécanismes de masquage au niveau transistor (TLM) et le masquage au niveau Système (SLM). Ce modèle est ensuite utilisé afin de construire une architecture adaptative tolérante aux fautes qui évalue la vulnérabilité effective du circuit en runtime. La stratégie d’amélioration de fiabilité est adaptée pour ne protéger que les parties vulnérables du système, ce qui engendre un circuit fiable avec un cout optimisé. Les expérimentations effectuées sur un système de détection d’obstacles à base de radar pour le transport ferroviaire montre que l’approche proposée permet d’´établir un compromis fiabilité/ressources utilisées
A benchmark for assessing nanosatellite on-board computer power-efficiency and performance
Dissertação (mestrado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2019.O presente manuscrito tem como escopo a especificação, implementação e execução de um
benchmark para avaliar o desempenho de diferentes microcontroladores de computadores de
bordo de nanossatélites. O foco deste trabalho está nas aplicações dos sistemas de determinação
e controle de atitude (ADCS, do inglês Attitude Determination and Control System), e portanto,
uma revisão de algoritmos de determinação e controle de atitude utilizados em missões espaciais
é apresentada, visando a identificação de recursos a serem explorados. A proposta do benchmark
especifica uma carga de trabalho, que corresponde a um conjunto de instruções representativas
da aplicação em questão, uma métrica para avaliar o desempenho de diferentes arquiteturas e
regras operacionais para garantir resultados justos e confiáveis. A implementação foi realizada
em linguagem de alto nível, C, e sua validação foi realizada através de sua execução em quatro
diferentes arquiteturas, onde medidas de tempo e consumo de potência são utilizados para
avaliar o consumo de energia. As plataformas de desenvolvimento avaliadas foram o
MSP430FR5994 Launchpad Kit da Texas Instruments, a placa Nucleo-L432KC da
STMicroeletronics, o Arduino Uno e o Raspberry Pi 3B. Os resultados obtidos mostram que o
Arduino Uno apresenta o maior consumo de energia (1.41mJ) enquanto o Nucleo L432KC possui
o menor (0.05mJ). Esses resultados também foram analisados visando a utilização dessas
plataformas nos projetos desenvolvidos atualmente no Laboratório de Simulação e Controle de
Sistemas Aeroespaciais da Universidade de Brasília, onde o Raspberry Pi foi escolhido para as
aplicações do simulador de atitude de nanossatélites e o Nucleo L432KC para as atividades do
LAICAnSat.This manuscript is aimed at specifying, implementing and executing a benchmark to evaluate the
performance of different microcontrollers of nanosatellite onboard computers. The focus of this
work is on Attitude Determination and Control System (ADCS) applications and therefore, a review
of common attitude determination and control algorithms is presented in order to identify the main
features to be explored. The benchmark proposal specifies a workload, which corresponds to a set
of instructions that represent the application in question, a metric for evaluating the performance of
different architectures and operational rules to ensure fair and reliable results. The implementation
was carried out in a higher-level language, C, and it was validated by running it on four different
architectures, where execution time and power measurements were used to evaluate energy
consumption. The development platforms evaluated were the Texas Instruments MSP430FR5994
Launchpad Kit, STMicroeletronics Nucleo-L432KC board, Arduino Uno and Raspberry Pi 3B. The
results obtained show that Arduino Uno has the highest energy consumption (1.41mJ) while
Nucleo L432KC has the lowest (0.05mJ). These results were also analyzed with the purpose of
using these platforms in the projects currently developed at the Laboratory of Simulation and
Control of Aerospace Systems at the University of Brasilia where the Raspberry Pi was chosen for
the applications of the nanosatellite simulator facility and the Nucleo L432KC for LAICAnSat
activities
On the simulation and design of manycore CMPs
The progression of Moore’s Law has resulted in both embedded and performance
computing systems which use an ever increasing number of processing cores integrated
in a single chip. Commercial systems are now available which provide hundreds
of cores, and academics have proposed architectures for up to 1024 cores. Embedded
multicores are increasingly popular as it is easier to guarantee hard-realtime constraints
using individual cores dedicated for tasks, than to use traditional time-multiplexed processing.
However, finding the optimal hardware configuration to meet these requirements
at minimum cost requires extensive trial and error approaches to investigate the
design space.
This thesis tackles the problems encountered in the design of these large scale multicore
systems by first addressing the problem of fast, detailed micro-architectural simulation.
Initially addressing embedded systems, this work exploits the lack of hardware
cache-coherence support in many deeply embedded systems to increase the available
parallelism in the simulation. Then, through partitioning the NoC and using packet
counting and cycle skipping reduces the amount of computation required to accurately
model the NoC interconnect. In combination, this enables simulation speeds significantly
higher than the state of the art, while maintaining less error, when compared
to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS
(Million (target) Instructions Per Second), or 110MHz, which is better than typical
FPGA prototypes, and approaching final ASIC production speeds. This is achieved
while maintaining an error of only 2.1%, significantly lower than other similar simulators.
The thesis continues by scaling the simulator past large embedded systems up to
64-1024 core processors, adding support for coherent architectures using the same
packet counting techniques along with low overhead context switching to enable the
simulation of such large systems with stricter synchronisation requirements. The new
interconnect model was partitioned to enable parallel simulation to further improve
simulation speeds in a manner which did not sacrifice any accuracy.
These innovations were leveraged to investigate significant novel energy saving optimisations
to the coherency protocol, processor ISA, and processor micro-architecture.
By introducing a new instruction, with the name wait-on-address, the energy spent during
spin-wait style synchronisation events can be significantly reduced. This functions
by putting the core into a low-power idle state while the cache line of the indicated
address is monitored for coherency action. Upon an update or invalidation (or traditional
timer or external interrupts) the core will resume execution, but the active
energy of running the core pipeline and repeatedly accessing the data and instruction
caches is effectively reduced to static idle power. The thesis also shows that existing
combined software-hardware schemes to track data regions which do not require coherency
can adequately address the directory-associativity problem, and introduces a
new coherency sharer encoding which reduces the energy consumed by sharer invalidations
when sharers are grouped closely together, such as would be the case with a
system running many tasks with a small degree of parallelism in each.
The research concludes by using the extremely fast simulation speeds developed to
produce a large set of training data, collecting various runtime and energy statistics for
a wide range of embedded applications on a huge diverse range of potential MPSoC
designs. This data was used to train a series of machine learning based models which
were then evaluated on their capacity to predict performance characteristics of unseen
workload combinations across the explored MPSoC design space, using only two sample
simulations, with promising results from some of the machine learning techniques.
The models were then used to produce a ranking of predicted performance across the
design space, and on average Random Forest was able to predict the best design within
89% of the runtime performance of the actual best tested design, and better than 93%
of the alternative design space. When predicting for a weighted metric of energy, delay
and area, Random Forest on average produced results within 93% of the optimum
result.
In summary this thesis improves upon the state of the art for cycle accurate multicore
simulation, introduces novel energy saving changes the the ISA and microarchitecture
of future multicore processors, and demonstrates the viability of machine
learning techniques to significantly accelerate the design space exploration required to
bring a new manycore design to market
Heap Data Allocation to Scratch-Pad Memory in Embedded Systems
This thesis presents the first-ever compile-time method for allocating a portion of a program's dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data.
In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move dynamic data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy across a large number of benchmarks were also observed when compared against cache memory organizations, showing our method's success under constrained SRAM sizes when dealing with dynamic data. Lastly, our method is able to minimize the profile dependence issues which plague all similar allocation methods through careful analysis of static and dynamic profile information
A Course in Field Theory
Extensively classroom-tested, A Course in Field Theory provides material for an introductory course for advanced undergraduate and graduate students in physics. Based on the author's course that he has been teaching for more than 20 years, the text presents complete and detailed coverage of the core ideas and theories in quantum field theory