1,086 research outputs found
Efficient resources assignment schemes for clustered multithreaded processors
New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version
An Overview of Approaches Towards the Timing Analysability of Parallel Architecture
In order to meet performance/low energy/integration requirements, parallel architectures (multithreaded cores and multi-cores) are more and more considered in the design of embedded systems running critical software. The objective is to run several applications concurrently. When applications have strict real-time constraints, two questions arise: a) how can the worst-case execution time (WCET) of each application be computed while concurrent applications might interfere? b)~how can the tasks be scheduled so that they are guarantee to meet their deadlines? The second question has received much attention for several years~cite{CFHS04,DaBu11}. Proposed schemes generally assume that the first question has been solved, and in addition that they do not impact the WCETs. In effect, the first question is far from been answered even if several approaches have been proposed in the literature. In this paper, we present an overview of these approaches from the point of view of static WCET analysis techniques
Dynamically controlled resource allocation in SMT processors
SMT processors increase performance by executing instructions from several threads simultaneously. These threads use the resources of the processor better by sharing them but, at the same time, threads are competing for these resources. The way critical resources are distributed among threads determines the final performance. Currently, processor resources are distributed among threads as determined by the fetch policy that decides which threads enter the processor to compete for resources. However, current fetch policies only use indirect indicators of resource usage in their decision, which can lead to resource monopolization by a single thread or to resource waste when no thread can use them. Both situations can harm performance and happen, for example, after an L2 cache miss. In this paper, we introduce the concept of dynamic resource control in SMT processors. Using this concept, we propose a novel resource allocation policy for SMT processors. This policy directly monitors the usage of resources by each thread and guarantees that all threads get their fair share of the critical shared resources, avoiding monopolization. We also define a mechanism to allow a thread to borrow resources from another thread if that thread does not require them, thereby reducing resource under-use. Simulation results show that our dynamic resource allocation policy outperforms a static resource allocation policy by 8%, on average. It also improves the best dynamic resource-conscious fetch policies like FLUSH++ by 4%, on average, using the harmonic mean as a metric. This indicates that our policy does not obtain the ILP boost by unfairly running high ILP threads over slow memory-bounded threads. Instead, it achieves a better throughput-fairness balance.Peer ReviewedPostprint (published version
L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors
© 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Improving the utilization of shared resources is a
key issue to increase performance in SMT processors. Recent
work has focused on resource sharing policies to enhance the
processor performance, but their proposals mainly concentrate on
novel hardware mechanisms that adapt to the dynamic resource
requirements of the running threads.
This work addresses the L1 cache bandwidth problem in SMT
processors experimentally on real hardware. Unlike previous
work, this paper concentrates on thread allocation, by selecting
the proper pair of co-runners to be launched to the same
core. The relation between L1 bandwidth requirements of each
benchmark and its performance (IPC) is analyzed. We found that
for individual benchmarks, performance is strongly connected to
L1 bandwidth consumption, and this observation remains valid
when several co-runners are launched to the same SMT core.
Based on these findings we propose two L1 bandwidth
aware thread to core (t2c) allocation policies, namely Static and
Dynamic t2c allocation, respectively. The aim of these policies is
to properly balance L1 bandwidth requirements of the running
threads among the processor cores. Experiments on a Xeon E5645
processor show that the proposed policies significantly improve
the performance of the Linux OS kernel regardless the number
of cores considered.This work was supported by the Spanish Ministerio de
Econom´ıa y Competitividad (MINECO) and by FEDER funds
under Grant TIN2012-38341-C04-01; and by Programa de
Apoyo a la Investigacion y Desarrollo (PAID-05-12) of the ´
Universitat Politecnica de Val ` encia under Grant SP20120748Feliu PĂ©rez, J.; Sahuquillo Borrás, J.; Petit MartĂ, SV.; Duato MarĂn, JF. (2013). L1-Bandwidth Aware Thread Allocation in Multicore SMT Processors. IEEE. https://doi.org/10.1109/PACT.2013.6618810
Efficient memory-level parallelism extraction with decoupled strands
We present Outrider, an architecture for throughput-oriented processors that exploits intra-thread memory-level parallelism (MLP) to improve performance efficiency on highly threaded workloads. Outrider enables a single thread of execution to be presented to the
architecture as multiple decoupled instruction streams, consisting of either memory accessing or memory consuming instructions. The key insight is that by decoupling the instruction streams, the processor pipeline can expose MLP in a
way similar to out-of-order designs while relying on a low-complexity in-order micro-architecture. Instead of adding more threads as is done in modern GPUs, Outrider can expose the same MLP with fewer threads and reduced contention for resources shared among threads.
We demonstrate that Outrider can outperform single-threaded cores by 23-131% and a 4-way simultaneous multi-threaded core by up to 87% in data parallel applications in a 1024-core system. Outrider achieves these performance gains without incurring the overhead of additional hardware thread contexts, which results in improved efficiency compared to a
multi-threaded core
- …