19 research outputs found
Bottleneck identification and scheduling in multithreaded applications
Abstract Performance of multithreaded applications is limited by a variety of bottlenecks, e.g. critical sections, barriers and slow pipeline stages. These bottlenecks serialize execution, waste valuable execution cycles, and limit scalability of applications. This paper proposes Bottleneck Identification and Scheduling (BIS), a cooperative software-hardware mechanism to identify and accelerate the most critical bottlenecks. BIS identifies which bottlenecks are likely to reduce performance by measuring the number of cycles threads have to wait for each bottleneck, and accelerates those bottlenecks using one or more fast cores on an Asymmetric Chip MultiProcessor (ACMP). Unlike previous work that targets specific bottlenecks, BIS can identify and accelerate bottlenecks regardless of their type. We compare BIS to four previous approaches and show that it outperforms the best of them by 15% on average. BIS' performance improvement increases as the number of cores and the number of fast cores in the system increase
Feedback-driven threading: power-efficient and high-performance execution of multithreaded workloads on CMPs
Extracting high-performance from the emerging Chip Multiprocessors (CMPs) requires that the application be divided into multiple threads. Each thread executes on a separate core thereby increasing concurrency and improving performance. As the number of cores on a CMP continues to increase, the performance of some multi-threaded applications will benefit from the increased number of threads, whereas, the performance of other multi-threaded applications will become limited by data-synchronization and off-chip bandwidth. For applications that get limited by datasynchronization, increasing the number of threads significantly degrades performance and increases on-chip power. Similarly, for applications that get limited by off-chip bandwidth, increasing the number of threads increases on-chip power without providing any performance improvement. Furthermore, whether an applicatio
This page is intentionally left blank. 2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set
Static compilers use profiling to predict run-time program behavior. A program can behave differently with different input data sets. Aggressively optimized code for one input set can hurt performance when the code is run using a different input set. Hence, compilers need to profile using multiple input sets that can represent a wide range of program behavior. However, using multiple input sets for profiling is expensive in terms of resources and profiling time. To eliminate the need to profile with multiple input sets, this paper proposes a novel profiling mechanism called 2D-profiling. A compiler that uses 2D-profiling profiles with only one input set and predicts whether the result of the profile will remain similar across input sets or change significantly. 2D-profiling measures time-varying phase characteristics of the program during the profiling run, in addition to the average program characteristics over the whole profiling run. Using the profile-based phase information, 2D-profiling predicts if a program property is input-dependent. We develop a 2D-profiling algorithm to predict whether or not the prediction accuracy of a branch instruction will remain similar when the input set of a program is changed. The key insight behind our algorithm is that if the prediction accuracy of a branch changes significantly over time during the profiling run with a single input data set, then the prediction accuracy of that branch is more likely dependent on the input data set. We evaluate the 2D-profiling mechanism with the SPEC INT 2000 benchmarks. Our analysis shows that 2D-profiling can identify most of the input-dependent branches accurately
Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs
Asymmetric Chip Multiprocessors (ACMPs) are becoming a reality. ACMPs can speed up parallel applications if they can identify and accelerate code segments that are critical for performance. Proposals already exist for using coarsegrained thread scheduling and fine-grained bottleneck acceleration. Unfortunately, there have been no proposals offered thus far to decide which code segments to accelerate in cases where both coarse-grained thread scheduling and fine-grained bottleneck acceleration could have value. This paper proposes Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs (UBA), a cooperative software/hardware mechanism for identifying and accelerating the most likely critical code segments from a set of multithreaded applications running on an ACMP. The key idea is a new Utility of Acceleration metric that quantifies the performance benefit of accelerating a bottleneck or a thread by taking into account both the criticality and the expected speedup. UBA outperforms the best of two state-of-the-art mechanisms by 11 % for single application workloads and by 7 % for two-application workloads on an ACMP with 52 small cores and 3 large cores