7 research outputs found

    Compiled Low-Level Virtual Instruction Set Simulation and Profiling for Code Partitioning and ASIP-Synthesis

    Get PDF
    Abstract We present ongoing work and first results in static and detailed quantitative runtime analysis of LLVM byte code for the purpose of automatic procedural level partitioning and cosynthesis of complex software systems. Runtime behaviour is captured by reverse compilation of LLVM bytecode into augmented, self-profiling ANSI-C simulator programs retaining the LLVM instruction level. The actual global data flow is captured both in quantity and value range to guide function unit layout in the synthesis of application specific processors. Currently the implemented tool LLILA (Low Level Intermediate Language Analyzer) focuses on static code analysis on the inter-procedural data flow via e.g. function parameters and global variables to uncover a program's potential paths of data exchange

    Conception de systèmes embarqués fiables et auto-réglables : applications sur les systèmes de transport ferroviaire

    Get PDF
    During the last few decades, a tremendous progress in the performance of semiconductor devices has been accomplished. In this emerging era of high performance applications, machines need not only to be efficient but also need to be dependable at circuit and system levels. Several works have been proposed to increase embedded systems efficiency by reducing the gap between software flexibility and hardware high-performance. Due to their reconfigurable aspect, Field Programmable Gate Arrays (FPGAs) represented a relevant step towards bridging this performance/flexibility gap. Nevertheless, Dynamic Reconfiguration (DR) has been continuously suffering from a bottleneck corresponding to a long reconfiguration time.In this thesis, we propose a novel medium-grained high-speed dynamic reconfiguration technique for DSP48E1-based circuits. The idea is to take advantage of the DSP48E1 slices runtime reprogrammability coupled with a re-routable interconnection block to change the overall circuit functionality in one clock cycle. In addition to the embedded systems efficiency, this thesis deals with the reliability chanllenges in new sub-micron electronic systems. In fact, as new technologies rely on reduced transistor size and lower supply voltages to improve performance, electronic circuits are becoming remarkably sensitive and increasingly susceptible to transient errors. The system-level impact of these errors can be far-reaching and Single Event Transients (SETs) have become a serious threat to embedded systems reliability, especially for especially for safety critical applications such as transportation systems. The reliability enhancement techniques that are based on overestimated soft error rates (SERs) can lead to unnecessary resource overheads as well as high power consumption. Considering error masking phenomena is a fundamental element for an accurate estimation of SERs.This thesis proposes a new cross-layer model of circuits vulnerability based on a combined modeling of Transistor Level (TLM) and System Level Masking (SLM) mechanisms. We then use this model to build a self adaptive fault tolerant architecture that evaluates the circuit’s effective vulnerability at runtime. Accordingly, the reliability enhancement strategy is adapted to protect only vulnerable parts of the system leading to a reliable circuit with optimized overheads. Experimentations performed on a radar-based obstacle detection system for railway transportation show that the proposed approach allows relevant reliability/resource utilization tradeoffs.Un énorme progrès dans les performances des semiconducteurs a été accompli ces dernières années. Avec l’´émergence d’applications complexes, les systèmes embarqués doivent être à la fois performants et fiables. Une multitude de travaux ont été proposés pour améliorer l’efficacité des systèmes embarqués en réduisant le décalage entre la flexibilité des solutions logicielles et la haute performance des solutions matérielles. En vertu de leur nature reconfigurable, les FPGAs (Field Programmable Gate Arrays) représentent un pas considérable pour réduire ce décalage performance/flexibilité. Cependant, la reconfiguration dynamique a toujours souffert d’une limitation liée à la latence de reconfiguration.Dans cette thèse, une nouvelle technique de reconfiguration dynamiqueau niveau ”grain-moyen” pour les circuits à base de blocks DSP48E1 est proposée. L’idée est de profiter de la reprogrammabilité des blocks DSP48E1 couplée avec un circuit d’interconnection reconfigurable afin de changer la fonction implémentée par le circuit en un cycle horloge. D’autre part, comme les nouvelles technologies s’appuient sur la réduction des dimensions des transistors ainsi que les tensions d’alimentation, les circuits électroniques sont devenus de plus en plus susceptibles aux fautes transitoires. L’impact de ces erreurs au niveau système peut être catastrophique et les SETs (Single Event Transients) sont devenus une menace tangible à la fiabilité des systèmes embarqués, en l’occurrence pour les applications critiques comme les systèmes de transport. Les techniques de fiabilité qui se basent sur des taux d’erreurs (SERs) surestimés peuvent conduire à un gaspillage de ressources et par conséquent un cout en consommation de puissance électrique. Il est primordial de prendre en compte le phénomène de masquage d’erreur pour une estimation précise des SERs.Cette thèse propose une nouvelle modélisation inter-couches de la vulnérabilité des circuits qui combine les mécanismes de masquage au niveau transistor (TLM) et le masquage au niveau Système (SLM). Ce modèle est ensuite utilisé afin de construire une architecture adaptative tolérante aux fautes qui évalue la vulnérabilité effective du circuit en runtime. La stratégie d’amélioration de fiabilité est adaptée pour ne protéger que les parties vulnérables du système, ce qui engendre un circuit fiable avec un cout optimisé. Les expérimentations effectuées sur un système de détection d’obstacles à base de radar pour le transport ferroviaire montre que l’approche proposée permet d’´établir un compromis fiabilité/ressources utilisées

    A benchmark for assessing nanosatellite on-board computer power-efficiency and performance

    Get PDF
    Dissertação (mestrado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2019.O presente manuscrito tem como escopo a especificação, implementação e execução de um benchmark para avaliar o desempenho de diferentes microcontroladores de computadores de bordo de nanossatélites. O foco deste trabalho está nas aplicações dos sistemas de determinação e controle de atitude (ADCS, do inglês Attitude Determination and Control System), e portanto, uma revisão de algoritmos de determinação e controle de atitude utilizados em missões espaciais é apresentada, visando a identificação de recursos a serem explorados. A proposta do benchmark especifica uma carga de trabalho, que corresponde a um conjunto de instruções representativas da aplicação em questão, uma métrica para avaliar o desempenho de diferentes arquiteturas e regras operacionais para garantir resultados justos e confiáveis. A implementação foi realizada em linguagem de alto nível, C, e sua validação foi realizada através de sua execução em quatro diferentes arquiteturas, onde medidas de tempo e consumo de potência são utilizados para avaliar o consumo de energia. As plataformas de desenvolvimento avaliadas foram o MSP430FR5994 Launchpad Kit da Texas Instruments, a placa Nucleo-L432KC da STMicroeletronics, o Arduino Uno e o Raspberry Pi 3B. Os resultados obtidos mostram que o Arduino Uno apresenta o maior consumo de energia (1.41mJ) enquanto o Nucleo L432KC possui o menor (0.05mJ). Esses resultados também foram analisados visando a utilização dessas plataformas nos projetos desenvolvidos atualmente no Laboratório de Simulação e Controle de Sistemas Aeroespaciais da Universidade de Brasília, onde o Raspberry Pi foi escolhido para as aplicações do simulador de atitude de nanossatélites e o Nucleo L432KC para as atividades do LAICAnSat.This manuscript is aimed at specifying, implementing and executing a benchmark to evaluate the performance of different microcontrollers of nanosatellite onboard computers. The focus of this work is on Attitude Determination and Control System (ADCS) applications and therefore, a review of common attitude determination and control algorithms is presented in order to identify the main features to be explored. The benchmark proposal specifies a workload, which corresponds to a set of instructions that represent the application in question, a metric for evaluating the performance of different architectures and operational rules to ensure fair and reliable results. The implementation was carried out in a higher-level language, C, and it was validated by running it on four different architectures, where execution time and power measurements were used to evaluate energy consumption. The development platforms evaluated were the Texas Instruments MSP430FR5994 Launchpad Kit, STMicroeletronics Nucleo-L432KC board, Arduino Uno and Raspberry Pi 3B. The results obtained show that Arduino Uno has the highest energy consumption (1.41mJ) while Nucleo L432KC has the lowest (0.05mJ). These results were also analyzed with the purpose of using these platforms in the projects currently developed at the Laboratory of Simulation and Control of Aerospace Systems at the University of Brasilia where the Raspberry Pi was chosen for the applications of the nanosatellite simulator facility and the Nucleo L432KC for LAICAnSat activities

    On the simulation and design of manycore CMPs

    Get PDF
    The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market

    Heap Data Allocation to Scratch-Pad Memory in Embedded Systems

    Get PDF
    This thesis presents the first-ever compile-time method for allocating a portion of a program's dynamic data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by its better real-time guarantees vs cache and by its significantly lower overheads in access time, energy consumption, area and overall runtime. Dynamic data refers to all objects allocated at run-time in a program, as opposed to static data objects which are allocated at compile-time. Existing compiler methods for allocating data to scratch-pad are able to place only code, global and stack data (static data) in scratch-pad memory; heap and recursive-function objects(dynamic data) are allocated entirely in DRAM, resulting in poor performance for these dynamic data types. Runtime methods based on software caching can place data in scratch-pad, but because of their high overheads from software address translation, they have not been successful, especially for dynamic data. In this thesis we present a dynamic yet compiler-directed allocation method for dynamic data that for the first time, (i) is able to place a portion of the dynamic data in scratch-pad; (ii) has no software-caching tags; (iii) requires no run-time per-access extra address translation; and (iv) is able to move dynamic data back and forth between scratch-pad and DRAM to better track the program's locality characteristics. With our method, code, global, stack and heap variables can share the same scratch-pad. When compared to placing all dynamic data variables in DRAM and only static data in scratch-pad, our results show that our method reduces the average runtime of our benchmarks by 22.3%, and the average power consumption by 26.7%, for the same size of scratch-pad fixed at 5% of total data size. Significant savings in runtime and energy across a large number of benchmarks were also observed when compared against cache memory organizations, showing our method's success under constrained SRAM sizes when dealing with dynamic data. Lastly, our method is able to minimize the profile dependence issues which plague all similar allocation methods through careful analysis of static and dynamic profile information

    A Course in Field Theory

    Get PDF
    Extensively classroom-tested, A Course in Field Theory provides material for an introductory course for advanced undergraduate and graduate students in physics. Based on the author's course that he has been teaching for more than 20 years, the text presents complete and detailed coverage of the core ideas and theories in quantum field theory
    corecore