213 research outputs found
The SHAP Microarchitecture and Java Virtual Machine
This report presents the SHAP platform consisting of its microarchitecture and its implementation of the Java Virtual Machine (JVM). Like quite a few other embedded implementations of the Java platform, the SHAP microarchitecture relies on an instruction set architecture based on Java bytecode. Unlike them, it, however, features a design with well-encapsulated components autonomously managing their duties on rather high abstraction levels. Thus, permanent runtime duties are transferred from the central computing core to concurrently working components so that it can actually spent a larger fraction of time executing application code. The degree of parallelity between the application and the runtime implementation is increased. Currently, the stack and heap management including the automatic garbage collection are implemented this way. After detailing the design of the microarchitecture, the SHAP implementation of the Java Virtual Machine is described. A major focus is laid on the presentation of the layout and the use of the runtime data structures representing the various language abstractions provided by Java. Also, the boot sequence starting the JVM is described
Recommended from our members
Automatic generation of synthetic workloads for multicore systems
textWhen designing a computer system, benchmark programs are used with cycle accurate performance/power simulators and HDL level simulators to evaluate novel architectural enhancements, perform design space exploration, understand the worst-case power characteristics of various designs and find performance bottlenecks. This research effort is directed towards automatically generating synthetic benchmarks to tackle three design challenges: 1) For most of the simulation related purposes, full runs of modern real world parallel applications like the PARSEC, SPLASH suites cannot be used as they take machine weeks of time on cycle accurate and HDL level simulators incurring a prohibitively large time cost 2) The second design challenge is that, some of these real world applications are intellectual property and cannot be shared with processor vendors for design studies 3) The most significant problem in the design stage is the complexity involved in fixing the maximum power consumption of a multicore design, called the Thermal Design Power (TDP). In an effort towards fixing this maximum power consumption of a system at the most optimal point, designers are used to hand-crafting possible code snippets called power viruses. But, this process of trying to manually write such maximum power consuming code snippets is very tedious.
All of these aforementioned challenges has lead to the resurrection of synthetic benchmarks in the recent past, serving as a promising solution to all the challenges. During the design stage of a multicore system, availability of a framework to automatically generate system-level synthetic benchmarks for multicore systems will greatly simplify the design process and result in more confident design decisions. The key idea behind such an adaptable benchmark synthesis framework is to identify the key characteristics of real world parallel applications that affect the performance and power consumption of a real program and create synthetic executable programs by varying the values for these characteristics. Firstly, with such a framework, one can generate miniaturized synthetic clones for large target (current and futuristic) parallel applications enabling an architect to use them with slow low-level simulation models (e.g., RTL models in VHDL/Verilog) and helps in tailoring designs to the targeted applications. These synthetic benchmark clones can be distributed to architects and designers even if the original applications are intellectual property, when they are not publicly available. Lastly, such a framework can be used to automatically create maximum power consuming code snippets to be able to help in fixing the TDP, heat sinks, cooling system and other power related features of the system.
The workload cloning framework built using the proposed synthetic benchmark generation methodology is evaluated to show its superiority over the existing cloning methodologies for single-core systems by generating miniaturized clones for CPU2006 and ImplantBench workloads with only an average error of 2.9% in performance for up to five orders of magnitude of simulation speedup. The correlation coefficient predicting the sensitivity to design changes is 0.95 and 0.98 for performance and power consumption. The proposed framework is evaluated by cloning parallel applications implemented based on p-threads and OpenMP in the PARSEC benchmark suite. The average error in predicting performance is 4.87% and that of power consumption is 2.73%. The correlation coefficient predicting the sensitivity to design changes is 0.92 for performance. The efficacy of the proposed synthetic benchmark generation framework for power virus generation is evaluation on SPARC, Alpha and x86 ISAs using full system simulators and also using real hardware. The results show that the power viruses generated for single-core systems consume 14-41% more power compared to MPrime on SPARC ISA. Similarly, the power viruses generated for multicore systems consume 45-98%, 40-89% and 41-56% more power than PARSEC workloads, running multiple copies of MPrime and multithreaded SPECjbb respectively.Electrical and Computer Engineerin
Real-Time Operating Systems and Programming Languages for Embedded Systems
In this chapter, we present the different alternatives that are available today for the development of real-time embedded systems. In particular, we will focus on the programming languages use like C++, Java and Ada and the operating systems like Linux-RT, FreeRTOS, TinyOS, etc. In particular we will analyze the actual state of the art for developing embedded systems under the WORA paradigm with standard Java [1], its Real-Time Specification and with the use of Real-Time Core Extensions and pico Java based CPUs [5]. We expect the reader to have a clear view of the opportunities present at the moment of starting a design with its pros and cons so it can choose the best one to fit its case.Fil: Orozco, Javier Dario. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - BahĂa Blanca. Instituto de Investigaciones en IngenierĂa ElĂ©ctrica "Alfredo Desages". Universidad Nacional del Sur. Departamento de IngenierĂa ElĂ©ctrica y de Computadoras. Instituto de Investigaciones en IngenierĂa ElĂ©ctrica "Alfredo Desages"; Argentina. Universidad Nacional del Sur. Departamento de IngenierĂa ElĂ©ctrica y de Computadoras. Laboratorio de Sistemas Digitales; ArgentinaFil: Santos, Rodrigo Martin. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro CientĂfico TecnolĂłgico Conicet - BahĂa Blanca. Instituto de Investigaciones en IngenierĂa ElĂ©ctrica "Alfredo Desages". Universidad Nacional del Sur. Departamento de IngenierĂa ElĂ©ctrica y de Computadoras. Instituto de Investigaciones en IngenierĂa ElĂ©ctrica "Alfredo Desages"; Argentina. Universidad Nacional del Sur. Departamento de IngenierĂa ElĂ©ctrica y de Computadoras. Laboratorio de Sistemas Digitales; Argentin
SHAP — Scalable Multi-Core Java Bytecode Processor
Abstract This paper introduces a new embedded Java multi-core architecture which shows a significantly better performance for a large number of cores than the related projects JopCMP and jamuth IP multi-core. The cores gain fast access to the shared heap by a fullduplex bus with pipelined transactions. Each core is equipped with local on-chip memory for the Java operand stack and the method cache to further reduce the memory bandwidth requirements. As opposed to the related projects, synchronization is supported on a per object-basis instead of a single lock. Load balancing is implemented in Java and requires no additional hardware. The multi-port memory manager includes an exact and fully concurrent garbage collector for automatic memory management. The design can be synthesized for a variable number of parallel cores and shows a linear increase in chip-space. Three different benchmarks demonstrate the very good scalability of our architecture. Due to limited chip-space on our evaluation platform, the core count could not be increased further than 8. But, we expect a smooth performance decrease
A Time-predictable Object Cache
Abstract—Static cache analysis for data allocated on the heap is practically impossible for standard data caches. We propose a distinct object cache for heap allocated data. The cache is highly associative to track symbolic object addresses in the static analysis. Cache lines are organized to hold single objects and individual fields are loaded on a miss. This cache organization is statically analyzable and improves the performance. In this paper we present the design and implementation of the object cache in a uniprocessor and chipmultiprocessor version of the Java processor JOP. Keywords-real-time systems; time-predictable computer architecture; worst-case execution time analysis I
Operating System Support for Redundant Multithreading
Failing hardware is a fact and trends in microprocessor design indicate that the fraction of hardware suffering from permanent and transient faults will continue to increase in future chip generations. Researchers proposed various solutions to this issue with different downsides: Specialized hardware components make hardware more expensive in production and consume additional energy at runtime. Fault-tolerant algorithms and libraries enforce specific programming models on the developer. Compiler-based fault tolerance requires the source code for all applications to be available for recompilation. In this thesis I present ASTEROID, an operating system architecture that integrates applications with different reliability needs.
ASTEROID is built on top of the L4/Fiasco.OC microkernel and extends the system with Romain, an operating system service that transparently replicates user applications. Romain supports single- and multi-threaded applications without requiring access to the application's source code. Romain replicates applications and their resources completely and thereby does not rely on hardware extensions, such as ECC-protected memory. In my thesis I describe how to efficiently implement replication as a form of redundant multithreading in software. I develop mechanisms to manage replica resources and to make multi-threaded programs behave deterministically for replication.
I furthermore present an approach to handle applications that use shared-memory channels with other programs. My evaluation shows that Romain provides 100% error detection and more than 99.6% error correction for single-bit flips in memory and general-purpose registers. At the same time, Romain's execution time overhead is below 14% for single-threaded applications running in triple-modular redundant mode. The last part of my thesis acknowledges that software-implemented fault tolerance methods often rely on the correct functioning of a certain set of hardware and software components, the Reliable Computing Base (RCB).
I introduce the concept of the RCB and discuss what constitutes the RCB of the ASTEROID system and other fault tolerance mechanisms. Thereafter I show three case studies that evaluate approaches to protecting RCB components and thereby aim to achieve a software stack that is fully protected against hardware errors
- …