8 research outputs found
Operating system support for overlapping-ISA heterogeneous multi-core architectures
A heterogeneous processor consists of cores that are asymmetric in performance and functionality. Such a de-sign provides a cost-effective solution for processor man-ufacturers to continuously improve both single-thread per-formance and multi-thread throughput. This design, how-ever, faces significant challenges in the operating system, which traditionally assumes only homogeneous hardware. This paper presents a comprehensive study of OS support for heterogeneous architectures in which cores have asym-metric performance and overlapping, but non-identical in-struction sets. Our algorithms allow applications to trans-parently execute and fairly share different types of cores. We have implemented these algorithms in the Linux 2.6.24 kernel and evaluated them on an actual heterogeneous plat-form. Evaluation results demonstrate that our designs effi-ciently manage heterogeneous hardware and enable signifi-cant performance improvements for a range of applications.
Doctor of Philosophy
dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications
Recommended from our members
The dynamic speculation and performance prediction of parallel loops
General purpose computer systems have seen increased performance potential through the parallel processing capabilities of multicore processors. Yet this potential performance can only be attained through parallel applications, thus forcing software developers to rethink how everyday applications are designed. The most readily form of Thread Level Parallelism (TLP) within any program are from loops. Unfortunately, the majority of loops cannot be easily multithreaded due to inter-iteration dependencies, conditional statements, nested functions, and dynamic memory allocation. This dissertation seeks to understand the fundamental characteristics and relationships of loops in order to assist programmers and compilers in exploiting TLP.
First, this dissertation explores a hardware solution that exploits (TLP) through Dynamic Speculative Multithreading (D-SpMT), which can extract multiple threads from a sequential program without compiler support or instruction set extensions. This dissertation presents Cascadia, a D-SpMT multicore architecture that provides multi-grain thread-level support. Cascadia applies a unique sustainable IPC (sIPC) metric on a comprehensive loop tree to select the best performing nested loop level to multithread. Results showed that Cascadia can extract large amounts of TLP, but ultimately, only yielded moderate performance gains. The lack of overall performance gains exhibited by Cascadia were due to the sequential nature of applications, rather than Cascadia's ability to perform D-SpMT.
In order to fully exploit TLP through loops, some loop level analysis and transformation must first be performed. Therefore, second contribution of this dissertation is the development of several theoretical methodologies to aid programmers and auto-tuners in parallelizing loops. This work found that the inter-iteration dependencies have a two-fold effect on the loop's parallel performance. First, the performance is primarily affected by a single, dominant dependency, and it is the execution of the dominant dependency path that directly determines the parallel performance of the loop. Any additional dependencies cause a secondary effect that may increase the execution time due to relative dependency path differences. Furthermore, this study analyzes the effects of non-ideal conditions, such as a limited number of processors, multithreading overhead, and irregular loop structures