32 research outputs found
SACR: Scheduling-Aware Cache Reconfiguration for Real-Time Embedded Systems
Dynamic reconfiguration techniques are widely used for efficient system optimization. Dynamic cache reconfiguration is a promising approach for reducing energy consumption as well as for improving overall system performance. It is a major challenge to introduce cache reconfiguration into real-time embedded systems since dynamic analysis may adversely affect tasks with real-time constraints. This paper presents a novel approach for implementing cache reconfiguration in soft real-time systems by efficiently leveraging static analysis during execution to both minimize energy and maximize performance. To the best of our knowledge, this is the first attempt to integrate dynamic cache reconfiguration in real-time scheduling techniques. Our experimental results using a wide variety of applications have demonstrated that our approach can significantly (up to 74%) reduce the overall energy consumption of the cache hierarchy in soft real-time systems. 1
A survey on hardware-based malware detection approaches
This paper delves into the dynamic landscape of computer security, where malware poses a paramount threat. Our focus is a riveting exploration of the recent and promising hardware-based malware detection approaches. Leveraging hardware performance counters and machine learning prowess, hardware-based malware detection approaches bring forth compelling advantages such as real-time detection, resilience to code variations, minimal performance overhead, protection disablement fortitude, and cost-effectiveness. Navigating through a generic hardware-based detection framework, we meticulously analyze the approach, unraveling the most common methods, algorithms, tools, and datasets that shape its contours. This survey is not only a resource for seasoned experts but also an inviting starting point for those venturing into the field of malware detection. However, challenges emerge in detecting malware based on hardware events. We struggle with the imperative of accuracy improvements and strategies to address the remaining classification errors. The discussion extends to crafting mixed hardware and software approaches for collaborative efficacy, essential enhancements in hardware monitoring units, and a better understanding of the correlation between hardware events and malware applications
Cache Sharing Administration for Performance Fairness using D3C Miss Classification in Chip Multi-Processors
This work presents a study of fairness in cache sharing between processes in a chip multiprocessor (CMP). We propose a new algorithm that uses a metric based on the D3C miss classification and LRU Stack Distance, to measure the fairness in the administration of the resources to achieve an increase of the global IPC of all executed processes. Shared cache miss rate, IPC and bandwidth metrics were considered to analyze the simulation results obtained using three test sets. The obtained results showed that the proposed dynamic management policy compared to Capitalist management policy, has a lower global miss rate in shared cache and lower bandwidth usage for each test set studied and fulfills its objective of managing the shared cache space for every process while improving the overall IPC.Sociedad Argentina de Inform谩tica e Investigaci贸n Operativ
Recommended from our members
Analysis of Super Fine-Grained Program Phases
Dynamic reconfiguration systems guided by coarse-grained program phases has found success in improving overall program performance and energy efficiency. These performance/energy savings are limited by the granularity that program phases are detected since phases that occur at a finer granularity goes undetected and reconfiguration opportunities are missed. In this study, we detect program phases using interval sizes on the order of tens, hundreds, and thousands of program cycles. This is in stark contrast with prior phase detection studies where the interval size is on the order of several thousands to millions of cycles. The primary goal of this study is to begin to fill a gap in the literature on phase detection by characterizing super fine-grained program phases and demonstrating an application where detection of these relatively short-lived phases can be instrumental. Traditional models for phase detection including basic block vectors and working set signatures are used to detect super fine-grained phases as well as a less traditional model based on microprocessor activity. Finally, we show an analytical case study where super fine-grained phases are applied to voltage and frequency scaling optimizations
Efficient and scalable scheduling for performance heterogeneous multicore systems
a b s t r a c t Performance heterogeneous multicore processors (HMP for brevity) consisting of multiple cores with the same instruction set but different performance characteristics (e.g., clock speed, issue width), are of great concern since they are able to deliver higher performance per watt and area for programs with diverse architectural requirements than comparable homogeneous ones. However, such power and area efficiencies of performance heterogeneous multicore systems can only be achieved when workloads are matched with cores according to both the properties of the workload and the features of the cores. Several heterogeneity-aware schedulers were proposed in the previous work. In terms of whether properties of workloads are obtained online or not, those scheduling algorithms can be categorized into two classes: online monitoring and offline profiling. The previous online monitoring approaches had to trace threads' execution on all core types, which is impractical as the number of core types grows. Besides, to trace all core types threads have to be migrated among cores, which may cause load imbalance and degrade the performance. The existing offline profiling approaches profile programs with a given input set before really executing them and thus remove the overhead associated with the number of core types. However, offline profiling approaches do not account for phase changes of threads. Moreover, since the properties they have collected are based on the given input set, those offline profiling approaches are hard to adapt to various input sets and therefore will drastically affect the program performance. To address the above problems in the existing approaches, we propose a new technique, ASTPI (Average Stall Time Per Instruction), to measure the efficiencies of threads in using fast cores. We design, implement and evaluate a new online monitoring approach called ESHMP, which is based on the metric. Our evaluation in the Linux 2.6.21 operating system shows that ESHMP delivers scalability while adapting to a wide variety of applications. Also, our experiment results show that among HMP systems in which heterogeneity-aware schedulers are adopted and there are more than one LLC (Last Level Cache), the architecture where heterogeneous cores share LLCs gain better performance than the ones where homogeneous cores share LLCs
Power, Performance, and Energy Management of Heterogeneous Architectures
abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance
optimization opportunities by tuning thousands of potential voltage, frequency
and core configurations. Applications running on these architectures are becoming increasingly
complex. As the basic building blocks, which make up the application, change during
runtime, different configurations may become optimal with respect to power, performance
or other metrics. Identifying the optimal configuration at runtime is a daunting task due
to a large number of workloads and configurations. Therefore, there is a strong need to
evaluate the metrics of interest as a function of the supported configurations.
This thesis focuses on two different types of modern multiprocessor systems-on-chip
(SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture.
For mobile heterogeneous systems, this thesis presents a novel methodology that can
accurately instrument different types of applications with specific performance monitoring
calls. These calls provide a rich set of performance statistics at a basic block level while the
application runs on the target platform. The target architecture used for this work (Odroid
XU3) is capable of running at 4940 different frequency and core combinations. With the
help of instrumented application vast amount of characterization data is collected that provides
details about performance, power and CPU state at every instrumented basic block
across 19 different types of applications. The vast amount of data collected has enabled
two runtime schemes. The first work provides a methodology to find optimal configurations
in heterogeneous architecture using classifiers and demonstrates an average increase
of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and
powersave governors, respectively. The second work using same data shows a novel imitation
learning framework for dynamically controlling the type, number, and the frequencies
of active cores to achieve an average of 109% PPW improvement compared to the default
governors.
This work also presents how to accurately profile tile based Intel Xeon Phi architecture
while training different types of neural networks using open image dataset on deep learning
framework. The data collected allows deep exploratory analysis. It also showcases how
different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201