208 research outputs found
Understanding the Memory Consumption of the MiBench Embedded Benchmark
International audienceComplex embedded systems today commonly involve a mix of real-time and best-effort applications. The recent emergence of small low-cost commodity multi-core processors raises the possibility of running both kinds of applications on a single machine, with virtualization ensuring that the best-effort applications cannot steal CPU cycles from the real-time applications. Nevertheless, memory contention can introduce other sources of delay, that can lead to missed deadlines. In this paper, we analyze the sources of memory consumption for the real-time applications found in the MiBench embedded benchmark suite
Lost in translation: Exposing hidden compiler optimization opportunities
Existing iterative compilation and machine-learning-based optimization
techniques have been proven very successful in achieving better optimizations
than the standard optimization levels of a compiler. However, they were not
engineered to support the tuning of a compiler's optimizer as part of the
compiler's daily development cycle. In this paper, we first establish the
required properties which a technique must exhibit to enable such tuning. We
then introduce an enhancement to the classic nightly routine testing of
compilers which exhibits all the required properties, and thus, is capable of
driving the improvement and tuning of the compiler's common optimizer. This is
achieved by leveraging resource usage and compilation information collected
while systematically exploiting prefixes of the transformations applied at
standard optimization levels. Experimental evaluation using the LLVM v6.0.1
compiler demonstrated that the new approach was able to reveal hidden
cross-architecture and architecture-dependent potential optimizations on two
popular processors: the Intel i5-6300U and the Arm Cortex-A53-based Broadcom
BCM2837 used in the Raspberry Pi 3B+. As a case study, we demonstrate how the
insights from our approach enabled us to identify and remove a significant
shortcoming of the CFG simplification pass of the LLVM v6.0.1 compiler.Comment: 31 pages, 7 figures, 2 table. arXiv admin note: text overlap with
arXiv:1802.0984
Detecting and Understanding Dynamically Dead Instructions for Contemporary Architectures
Instructions executed by the processor are dynamically dead if the values they produce are not used by the program. Executing such useless instructions can potentially slow-down program execution and waste power. The goal of this work is to quantify and understand the occurrence of dynamically dead instructions (DDI) for programs compiled using modern compilers for the most relevant contemporary architectures. We expect our extensive study to highlight the issue of DDI and to play a critical role in the development of compiler and/or architectural techniques to avoid DDI execution at runtime. In this thesis, we introduce our novel GCC&ndashbased instrumentation and analysis framework to determine DDI during program execution. We present the ratio and characteristics of DDI in our benchmark programs. We find that programs compiled with GCC (with and without optimizations) execute a significant fraction of DDI on x86 and ARM based machines. Additionally, an ample amount of predication employed by GCC results in a large fraction of executed instructions on the ARM to be dynamically dead. We observe that a handful of static instructions contribute a large majority to the overall DDI in standard benchmark programs. We also find that employing a small amount of static context information can significantly benefit the detection of DDI at run&ndashtime. Additionally, we describe the results of our manual study to analyze and categorize the DDI instances in our x86 benchmarks. We briefly outline compiler and architecture based techniques that can be used to eliminate each category of DDI in future programs. Overall, we believe that a close synergy between compiler and architecture techniques may be the most effective strategy to eliminate DDI to improve sequential program performance and energy efficiency on modern machines
CVA6 RISC-V Virtualization: Architecture, Microarchitecture, and Design Space Exploration
Virtualization is a key technology used in a wide range of applications, from
cloud computing to embedded systems. Over the last few years, mainstream
computer architectures were extended with hardware virtualization support,
giving rise to a set of virtualization technologies (e.g., Intel VT, Arm VE)
that are now proliferating in modern processors and SoCs. In this article, we
describe our work on hardware virtualization support in the RISC-V CVA6 core.
Our contribution is multifold and encompasses architecture, microarchitecture,
and design space exploration. In particular, we highlight the design of a set
of microarchitectural enhancements (i.e., G-Stage Translation Lookaside Buffer
(GTLB), L2 TLB) to alleviate the virtualization performance overhead. We also
perform a Design Space Exploration (DSE) and accompanying post-layout
simulations (based on 22nm FDX technology) to assess Performance, Power ,and
Area (PPA). Further, we map design variants on an FPGA platform (Genesys 2) to
assess the functional performance-area trade-off. Based on the DSE, we select
an optimal design point for the CVA6 with hardware virtualization support. For
this optimal hardware configuration, we collected functional performance
results by running the MiBench benchmark on Linux atop Bao hypervisor for a
single-core configuration. We observed a performance speedup of up to 16%
(approx. 12.5% on average) compared with virtualization-aware non-optimized
design at the minimal cost of 0.78% in area and 0.33% in power. Finally, all
work described in this article is publicly available and open-sourced for the
community to further evaluate additional design configurations and software
stacks
Specific Read Only Data Management for Memory Hierarchy Optimization
International audienceThe multiplication of the number of cores inside embedded systems has raised the pressure on the memory hierarchy. The cost of coherence protocol and the scalability problem of the memory hierarchy is nowadays a major issue. In this paper, a specific data management for read-only data is in-vestigated because these data can be duplicated in several memories without being tracked. Based on analysis of stan-dard benchmarks for embedded systems, we show that read-only data represent 62% of all the data used by applications and 18% of all the memory accesses. A specific data path for read-only data is then evaluated by using simulations. On the first level of the memory hierarchy, removing read-only data of the L1 cache and placing them in another read-only cache improve the data locality of the read-write data by 30% and decrease the total energy consumption of the first level memory by 5%
RT-Bench: an Extensible Benchmark Framework for the Analysis and Management of Real-Time Applications
Benchmarking is crucial for testing and validating any system, even more so
in real-time systems. Typical real-time applications adhere to well-understood
abstractions: they exhibit a periodic behavior, operate on a well-defined
working set, and strive for stable response time avoiding non-predicable
factors such as page faults. Unfortunately, available benchmark suites fail to
reflect key characteristics of real-time applications. Practitioners and
researchers must resort to either benchmark heavily approximated real-time
environments, or to re-engineer available benchmarks to add -- if possible --
the sought-after features. Additionally, the measuring and logging capabilities
provided by most benchmark suites are not tailored "out-of-the-box" to
real-time environments, and changing basic parameters such as the scheduling
policy often becomes a tiring and error-prone exercise.
In this paper, we present RT-bench, an open-source framework adding standard
real-time features to virtually any existing benchmark. Furthermore, RT-bench
provides an easy-to-use, unified command line interface to customize key
aspects of the real-time execution of a set of benchmarks. Our framework is
guided by four main criteria: 1) cohesive interface, 2) support for periodic
application behavior and deadline semantics, 3) controllable memory footprint,
and 4) extensibility and portability. We have integrated within the framework
applications from the widely used SD-VBS and IsolBench suites. We showcase a
set of use-cases that are representative of typical real-time system evaluation
scenarios and that can be easily conducted via RT-Bench.Comment: 11 pages, 12 figures; code available at
https://gitlab.com/rt-bench/rt-bench, documentation available at
https://rt-bench.gitlab.io/rt-bench
- …