5,186 research outputs found

    Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach

    Get PDF
    Current processors are optimized for average case performance, often leading to a high worst-case execution time (WCET). Many architectural features that increase the average case performance are hard to be modeled for the WCET analysis. In this paper we present Patmos, a processor optimized for low WCET bounds rather than high average case performance. Patmos is a dual-issue, statically scheduled RISC processor. The instruction cache is organized as a method cache and the data cache is organized as a split cache in order to simplify the cache WCET analysis. To fill the dual-issue pipeline with enough useful instructions, Patmos relies on a customized compiler. The compiler also plays a central role in optimizing the application for the WCET instead of average case performance

    Java Dust: How Small Can Embedded Java Be?

    Get PDF
    Java is slowly being accepted as a language and platform for embedded devices. However, the memory requirements of the Java library and runtime are still troublesome. A Java system is considered small when it requires less than 1 MB, and within the embedded domain small microcontollers with a few KB on-chip Flash memory and even less on-chip RAM are very common. For such small devices Java is a clearly challenging. In this paper we present the combination of the Java compiler Muvium for microcontrollers with the tiny soft-core Leros for an FPGA. To the best of our knowledge, the presented embedded Java system is the smallest Java system available. The Leros processor consumes less than 5 % of the logic cells of the smallest FPGA from Altera and the Muvium compiler produces a JVM, including the Java application, that can execute in a few KB ROM and less than 1 KB RAM. The Leros processor is available in open-source and the Leros port of Muvium is freely available

    JIST: just-in-time scheduling translation for parallel processors

    Get PDF
    The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power consumption and reduced hardware cost. Dynamic compilation makes it possible to bridge the gap between the two technologies, but special attention must be paid to software instruction scheduling, a must for the VLIW architectures. We have implemented JIST, a Virtual Machine and JIT compiler for Java Bytecode targeted to a VLIW processor. We show the impact of various optimizations on the performance of code compiled with JIST through the experimental study on a set of benchmark programs. We report significant speedups, and increments in the number of instructions issued per cycle up to 50% with respect to the non-scheduling version of the JITcompiler. Further optimizations are discussed

    Design Space Exploration of Object Caches with Cross-Profiling

    Get PDF
    To avoid data cache trashing between heapallocated data and other data areas, a distinct object cache has been proposed for embedded real-time Java processors. This object cache uses high associativity in order to statically track different object pointers for worst-case execution-time analysis. However, before implementing such an object cache, an empirical analysis of different organization forms is needed. We use a cross-profiling technique based on aspect-oriented programming in order to evaluate different object cache organizations with standard Java benchmarks. From the evaluation we conclude that field access exhibits some temporal locality, but almost no spatial locality. Therefore, filling long cache lines on a miss just introduces a high miss penalty without increasing the hit rate enough to make up for the increased miss penalty. For an object cache, it is more efficient to fill individual words within the cache line on a miss

    Inferring Energy Bounds via Static Program Analysis and Evolutionary Modeling of Basic Blocks

    Full text link
    The ever increasing number and complexity of energy-bound devices (such as the ones used in Internet of Things applications, smart phones, and mission critical systems) pose an important challenge on techniques to optimize their energy consumption and to verify that they will perform their function within the available energy budget. In this work we address this challenge from the software point of view and propose a novel parametric approach to estimating tight bounds on the energy consumed by program executions that are practical for their application to energy verification and optimization. Our approach divides a program into basic (branchless) blocks and estimates the maximal and minimal energy consumption for each block using an evolutionary algorithm. Then it combines the obtained values according to the program control flow, using static analysis, to infer functions that give both upper and lower bounds on the energy consumption of the whole program and its procedures as functions on input data sizes. We have tested our approach on (C-like) embedded programs running on the XMOS hardware platform. However, our method is general enough to be applied to other microprocessor architectures and programming languages. The bounds obtained by our prototype implementation can be tight while remaining on the safe side of budgets in practice, as shown by our experimental evaluation.Comment: Pre-proceedings paper presented at the 27th International Symposium on Logic-Based Program Synthesis and Transformation (LOPSTR 2017), Namur, Belgium, 10-12 October 2017 (arXiv:1708.07854). Improved version of the one presented at the HIP3ES 2016 workshop (v1): more experimental results (added benchmark to Table 1, added figure for new benchmark, added Table 3), improved Fig. 1, added Fig.

    Evolutionary improvement of programs

    Get PDF
    Most applications of genetic programming (GP) involve the creation of an entirely new function, program or expression to solve a specific problem. In this paper, we propose a new approach that applies GP to improve existing software by optimizing its non-functional properties such as execution time, memory usage, or power consumption. In general, satisfying non-functional requirements is a difficult task and often achieved in part by optimizing compilers. However, modern compilers are in general not always able to produce semantically equivalent alternatives that optimize non-functional properties, even if such alternatives are known to exist: this is usually due to the limited local nature of such optimizations. In this paper, we discuss how best to combine and extend the existing evolutionary methods of GP, multiobjective optimization, and coevolution in order to improve existing software. Given as input the implementation of a function, we attempt to evolve a semantically equivalent version, in this case optimized to reduce execution time subject to a given probability distribution of inputs. We demonstrate that our framework is able to produce non-obvious optimizations that compilers are not yet able to generate on eight example functions. We employ a coevolved population of test cases to encourage the preservation of the function's semantics. We exploit the original program both through seeding of the population in order to focus the search, and as an oracle for testing purposes. As well as discussing the issues that arise when attempting to improve software, we employ rigorous experimental method to provide interesting and practical insights to suggest how to address these issues

    Dusty Caches to Save Memory Traffic

    Get PDF
    Reference counting is a garbage-collection technique that maintains a per-object count of the number of pointers to that object. When the count reaches zero, the object must be dead and can be collected. Although it is not an exact method, it is well suited for real-time systems and is widely implemented, sometimes in conjunction with other methods to increase the overall precision. A disadvantage of reference counting is the extra storage trac that is introduced. In this paper, we describe a new cache write-back policy that can substantially decrease the reference-counting traffic to RAM. We propose a new cache design that remembers the first-fetched value of a cache subblock, so that the subblock need not be written back to RAM unless a different value is present. We present results from experiments that show the effectiveness of this approach, particularly in mitigating the storage traffic due to reference counting

    VXA: A Virtual Architecture for Durable Compressed Archives

    Full text link
    Data compression algorithms change frequently, and obsolete decoders do not always run on new hardware and operating systems, threatening the long-term usability of content archived using those algorithms. Re-encoding content into new formats is cumbersome, and highly undesirable when lossy compression is involved. Processor architectures, in contrast, have remained comparatively stable over recent decades. VXA, an archival storage system designed around this observation, archives executable decoders along with the encoded content it stores. VXA decoders run in a specialized virtual machine that implements an OS-independent execution environment based on the standard x86 architecture. The VXA virtual machine strictly limits access to host system services, making decoders safe to run even if an archive contains malicious code. VXA's adoption of a "native" processor architecture instead of type-safe language technology allows reuse of existing "hand-optimized" decoders in C and assembly language, and permits decoders access to performance-enhancing architecture features such as vector processing instructions. The performance cost of VXA's virtualization is typically less than 15% compared with the same decoders running natively. The storage cost of archived decoders, typically 30-130KB each, can be amortized across many archived files sharing the same compression method.Comment: 14 pages, 7 figures, 2 table

    A Co-Processor Approach for Efficient Java Execution in Embedded Systems

    Get PDF
    This thesis deals with a hardware accelerated Java virtual machine, named REALJava. The REALJava virtual machine is targeted for resource constrained embedded systems. The goal is to attain increased computational performance with reduced power consumption. While these objectives are often seen as trade-offs, in this context both of them can be attained simultaneously by using dedicated hardware. The target level of the computational performance of the REALJava virtual machine is initially set to be as fast as the currently available full custom ASIC Java processors. As a secondary goal all of the components of the virtual machine are designed so that the resulting system can be scaled to support multiple co-processor cores. The virtual machine is designed using the hardware/software co-design paradigm. The partitioning between the two domains is flexible, allowing customizations to the resulting system, for instance the floating point support can be omitted from the hardware in order to decrease the size of the co-processor core. The communication between the hardware and the software domains is encapsulated into modules. This allows the REALJava virtual machine to be easily integrated into any system, simply by redesigning the communication modules. Besides the virtual machine and the related co-processor architecture, several performance enhancing techniques are presented. These include techniques related to instruction folding, stack handling, method invocation, constant loading and control in time domain. The REALJava virtual machine is prototyped using three different FPGA platforms. The original pipeline structure is modified to suit the FPGA environment. The performance of the resulting Java virtual machine is evaluated against existing Java solutions in the embedded systems field. The results show that the goals are attained, both in terms of computational performance and power consumption. Especially the computational performance is evaluated thoroughly, and the results show that the REALJava is more than twice as fast as the fastest full custom ASIC Java processor. In addition to standard Java virtual machine benchmarks, several new Java applications are designed to both verify the results and broaden the spectrum of the tests.Siirretty Doriast
    corecore