258 research outputs found
Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors
© 2020 IEEE. Personal use of this material is permitted. PermissĂon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisĂng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Resource sharing is a critical issue in simultaneous multithreading (SMT) processors as threads running simultaneously on an SMT core compete for shared resources. Symbiotic job scheduling, which co-schedules applications with complementary resource demands, is an effective solution to maximize hardware utilization and improve overall system performance. However, symbiotic job scheduling typically distributes threads evenly among cores, i.e., all cores get assigned the same number of threads, which we find to lead to sub-optimal performance. In this paper, we show that asymmetric schedules (i.e., schedules that assign a different number of threads to each SMT core) can significantly improve performance compared to symmetric schedules. To leverage this finding, we propose thread isolation, a technique that turns symmetric schedules into asymmetric ones yielding higher overall system performance. Thread isolation identifies SMT-adverse applications and schedules them in isolation on a dedicated core to mitigate their sharp performance degradation under SMT. Our experimental results on an IBM POWER8 processor show that thread isolation improves system throughput by up to 5.5 percent compared to a state-of-the-art symmetric symbiotic job scheduler.Josue Feliu has been partially supported through a postdoctoral fellowship by the Generalitat Valenciana (APOSTD/2017/052). Additional support has been provided by the Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, as well as, by the Universitat Politenica de Valencia through the "Ayudas a Primeros Proyectos de Investigacion" (PAID-06-18) under grant SP20180140. Lieven Eeckhout's research program is supported through FWO grants no. G.0434.16N and G.0144.17N, and the European Research Council (ERC) Advanced Grant agreement no. 741097.Feliu-PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Eeckhout, L. (2020). Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors. IEEE Transactions on Parallel and Distributed Systems. 31(2):359-373. https://doi.org/10.1109/TPDS.2019.2934955S35937331
Cache-Hierarchy contention-aware scheduling in CMPs
© © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksTo improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01, and by the Universitat Politecnica de Valencia under Grant PAID-05-12 SP20120748.Feliu PĂ©rez, J.; Petit MartĂ, SV.; Sahuquillo Borrás, J.; Duato MarĂn, JF. (2014). Cache-Hierarchy contention-aware scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems. 25(3):581-590. https://doi.org/10.1109/TPDS.2013.61S58159025
Addressing fairness in SMT multicores with a progress-aware scheduler
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling.
This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness.
Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3Ă— over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de
Econom´ıa y Competitividad (MINECO) and Plan E funds,
under Grant TIN2012-38341-C04-01, and by the Intel Early
Career Faculty Honor Program AwardFeliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48
Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores
[EN] Nowadays, high performance multicore processors implement
multithreading capabilities. The processes running concurrently on these
processors are continuously competing for the shared resources, not only among
cores, but also within the core. While resource sharing increases the resource
utilization, the interference among processes accessing the shared resources can
strongly affect the performance of individual processes and its predictability. In this
scenario, process scheduling plays a key role to deal with performance and
fairness. In this work we present a process scheduler for SMT multicores that
simultaneously addresses both performance and fairness. This is a major design
issue since scheduling for only one of the two targets tends to damage the other.
To address performance, the scheduler tackles bandwidth contention at the L1
cache and main memory. To deal with fairness, the scheduler estimates the
progress experienced by the processes, and gives priority to the processes with
lower accumulated progress. Experimental results on an Intel Xeon E5645
featuring six dual-threaded SMT cores show that the proposed scheduler improves
both performance and fairness over two state-of-the-art schedulers and the Linux
OS scheduler. Compared to Linux, unfairness is reduced to a half while still
improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166
L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Improving the utilization of shared resources is a
key issue to increase performance in SMT processors. Recent
work has focused on resource sharing policies to enhance the
processor performance, but their proposals mainly concentrate on
novel hardware mechanisms that adapt to the dynamic resource
requirements of the running threads.
This work addresses the L1 cache bandwidth problem in SMT
processors experimentally on real hardware. Unlike previous
work, this paper concentrates on thread allocation, by selecting
the proper pair of co-runners to be launched to the same
core. The relation between L1 bandwidth requirements of each
benchmark and its performance (IPC) is analyzed. We found that
for individual benchmarks, performance is strongly connected to
L1 bandwidth consumption, and this observation remains valid
when several co-runners are launched to the same SMT core.
Based on these findings we propose two L1 bandwidth
aware thread to core (t2c) allocation policies, namely Static and
Dynamic t2c allocation, respectively. The aim of these policies is
to properly balance L1 bandwidth requirements of the running
threads among the processor cores. Experiments on a Xeon E5645
processor show that the proposed policies significantly improve
the performance of the Linux OS kernel regardless the number
of cores considered.This work was supported by the Spanish Ministerio de
Econom´ıa y Competitividad (MINECO) and by FEDER funds
under Grant TIN2012-38341-C04-01; and by Programa de
Apoyo a la Investigacion y Desarrollo (PAID-05-12) of the ´
Universitat Politecnica de Val ` encia under Grant SP20120748Feliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2013). L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. IEEE. https://doi.org/10.1109/PACT.2013.6618810
Improving IBM POWER8 Performance Through Symbiotic Job Scheduling
[EN] Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of processors with simultaneous multithreading (SMT) cores. SMT cores share most of their microarchitectural components among the co-running applications, which causes performance interference between them. Therefore, scheduling applications with complementary resource requirements on the same core can greatly improve the throughput of the system. This paper enhances symbiotic job scheduling for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to build an interference model that predicts symbiosis between applications. The proposed models achieve higher accuracy than previous models by predicting job symbiosis from throttled CPI stacks, i.e., CPI stacks of the applications when running in the same SMT mode to consider the statically partitioned resources, but without interference from other applications. The symbiotic scheduler uses these interference models to decide, at run-time, which applications should run on the same core or on separate cores. We prototype the symbiotic scheduler as a user-level scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogram workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. Across all evaluated workloads in SMT4 mode, throughput improves by 12.4 and 5.1 percent on average over the random and Linux schedulers, respectively.This work was supported in part by the Spanish Ministerio de Econom ıa y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972- C5-1-R and TIN2014-62246-EXP, as well as by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement No. 259295.Feliu-PĂ©rez, J.; Eyerman, S.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Eeckhout, L. (2017). Improving IBM POWER8 Performance Through Symbiotic Job Scheduling. IEEE Transactions on Parallel and Distributed Systems. 28(10):2838-2851. https://doi.org/10.1109/TPDS.2017.269170828382851281
Addressing bandwidth contention in SMT multicores through scheduling
© Owner/Author 2014. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '14 Proceedings of the 28th ACM international conference on Supercomputing; http://dx.doi.org/10.1145/2597652.2600109To mitigate the impact of bandwidth contention, which in some processes can yield to performance degradations up to 40%, we devise a scheduling algorithm that tackles main memory and L1 bandwidth contention. Experimental evaluation on a real system shows that the proposal achieves an average speedup by 5% with respect to Linux.This work was supported by the Spanish Ministerio de EconomĂa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Feliu-PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2014). Addressing bandwidth contention in SMT multicores through scheduling. ACM. https://doi.org/10.1145/2597652.2600109
Understanding cache hierarchy contention in CMPs to improve job scheduling
© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In order to improve CMP performance, recent research has focused on scheduling to mitigate contention produced by the limited memory bandwidth. Nowadays, commercial CMPs implement multi-level cache hierarchies where last level caches are shared by at least two cache structures located at the immediately lower cache level. In turn, these caches can be shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increases with each microprocessor generation. In this paper we characterize the impact on performance of the different contention points that appear along the memory subsystem. Then, we propose a generic scheduling strategy for CMPs that takes into account the available bandwidth at each level of the cache hierarchy. The proposed strategy selects the processes to be co-scheduled and allocates them to cores in order to minimize contention effects. The proposal has been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. Despite these potential contention limitations are less than in recent processor designs, compared to the Linux scheduler, the proposal reaches performance improvements up to 9% while these benefits (across the studied benchmark mixes) are always lower than 6% for a memory-aware scheduler that does not take into account the cache hierarchy. Moreover, in some cases the proposal doubles the speedup achieved by the memory-aware scheduler.This work was supported by the Spanish MICINN, Consolider Programme and Plan E funds, as well as European Commission FEDER funds, under Grants CSD2006-00046 and TIN2009-14475-C04-01.Feliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2012). Understanding cache hierarchy contention in CMPs to improve job scheduling. IEEE. https://doi.org/10.1109/IPDPS.2012.54
Bandwidth-Aware On-Line Scheduling in SMT Multicores
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The memory hierarchy plays a critical role on the performance of current chip multiprocessors. Main memory is shared by all the running processes, which can cause important bandwidth contention. In addition, when the processor implements SMT cores, the L1 bandwidth becomes shared among the threads running on each core. In such a case, bandwidth-aware schedulers emerge as an interesting approach to mitigate the contention. This work investigates the performance degradation that the processes suffer due to memory bandwidth constraints. Experiments show that main memory and L1 bandwidth contention negatively impact the process performance; in both cases, performance degradation can grow up to 40 percent for some of applications. To deal with contention, we devise a scheduling algorithm that consists of two policies guided by the bandwidth consumption gathered at runtime. The process selection policy balances the number of memory requests over the execution time to address main memory bandwidth contention. The process allocation policy tackles L1 bandwidth contention by balancing the L1 accesses among the L1 caches. The proposal is evaluated on a Xeon E5645 platform using a wide set of multiprogrammed workloads, achieving performance benefits up to 6.7 percent with respect to the Linux scheduler.This work was supported by the Spanish Ministerio de Economia y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Feliu-PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2016). Bandwidth-Aware On-Line Scheduling in SMT Multicores. IEEE Transactions on Computers. 65(2):422-434. https://doi.org/10.1109/TC.2015.2428694S42243465
A Workload Generator for Evaluating SMT Real-Time Systems
[EN] Real-time tasks have experience a significant complexity increase in the last years. We can find examples of real-time tasks in nowadays systems that control self-driving cars or multimedia systems, among others. To cope with the high performance requirements of such systems, real-time systems are moving from simple in-order processor to complex out-of-order multicore processors. Furthermore, we expect real-time systems to use simultaneous multithreading (SMT) processors in a near future since these architectures address two key design concerns of embedded systems, that is, they provide higher performance and power efficiency than single-threaded multicores.
The main drawback that multicores and SMT architectures present from a real-time perspective is that they implement shared resources. Single-threaded multicores usually share the main memory and the LLC, and SMT processor share additionally most of the microarchitectural core resources. Processes running concurrently can interfere in the shared resources, which increases the performance variability and predictability of these systems. We expect an increasing effort in the next years to mitigate these drawbacks and implement real-time systems with multicore SMT processors.
Workload generation is a tedious and time-consuming task in the real-time research field because the workloads dispose of many parameters that should be correctly adjusted to provide flexible and representative workloads. Typically used workload generators, however, fail when designing workloads for theses architectures because they are not aware of the architectural characteristics of SMT systems. In this paper we present the task class-based (TCB) workload generator aimed at providing workloads to evaluate real-time systems with SMT multicore processors in an ease and automatized way.FuriĂł Novejarque, C.; Feliu-PĂ©rez, J.; Petit MartĂ, SV.; Duro-GĂłmez, J.; Sahuquillo Borrás, J. (2018). A Workload Generator for Evaluating SMT Real-Time Systems. IEEE Computer Society. 367-374. doi:10.1109/HPCS.2018.00067S36737
- …