14 research outputs found

    Enabling SMT for real-time embedded systems

    Get PDF
    In order to deal with real time constraints, current embedded processors are usually simple in-order processors with no speculation capabilities to ensure that execution times of applications are predictable. However, embedded systems require ever more compute power and the trend is that they will become as complex as current high performance systems. SMTs are viable candidates for future high performance embedded processors, because of their good cost/performance trade-off. However, current SMTs exhibit unpredictable performance. Hence, the SMT hardware needs to be adapted in order to meet real time constraints. This paper is a first step toward the use of high performance SMT processors in future real time systems. We present a novel collaboration between OS and SMT processors that entails that the OS exercises control over how resources are shared inside the processor. We illustrate this collaboration by a mechanism in which the OS cooperates with the SMT hardware to guarantee that a given thread runs at a specific speed, enabling SMT for real-time systems.Peer ReviewedPostprint (published version

    Efficient resources assignment schemes for clustered multithreaded processors

    Get PDF
    New feature sizes provide larger number of transistors per chip that architects could use in order to further exploit instruction level parallelism. However, these technologies bring also new challenges that complicate conventional monolithic processor designs. On the one hand, exploiting instruction level parallelism is leading us to diminishing returns and therefore exploiting other sources of parallelism like thread level parallelism is needed in order to keep raising performance with a reasonable hardware complexity. On the other hand, clustering architectures have been widely studied in order to reduce the inherent complexity of current monolithic processors. This paper studies the synergies and trade-offs between two concepts, clustering and simultaneous multithreading (SMT), in order to understand the reasons why conventional SMT resource assignment schemes are not so effective in clustered processors. These trade-offs are used to propose a novel resource assignment scheme that gets and average speed up of 17.6% versus Icount improving fairness in 24%.Peer ReviewedPostprint (published version

    Performance Enhancement of Multicore Architecture

    Get PDF
    Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm

    Software-controlled priority characterization of POWER5 processor

    Get PDF
    Due to the limitations of instruction-level parallelism, thread-level parallelism has become a popular way to improve processor performance. One example is the IBM POWER5TM processor, a two-context simultaneous-multithreaded dual-core chip. In each SMT core, the IBM POWER5 features two levels of thread resource balancing and prioritization. The first level provides automatic in-hardware resource balancing, while the second level is a software-controlled priority mechanism that presents eight levels of thread priorities. Currently, software-controlled prioritization is only used in limited number of cases in the software platforms due to lack of performance characterization of the effects of this mechanism. In this work, we characterize the effects of the software-based prioritization on several different workloads. We show that the impact of the prioritization significantly depends on the workloads coscheduled on a core. By prioritizing the right task, it is possible to obtain more than two times of throughput improvement for synthetic workloads compared to the baseline. We also present two application case studies targeting two different performance metrics: the first case study improves overall throughput by 23.7% and the second case study reduces the total execution time by 9.3%. In addition, we show the circumstances when a background thread can be run transparently without affecting the performance of the foreground thread.Peer ReviewedPostprint (published version

    Managing SMT Resource Usage through Speculative Instruction Window Weighting

    Get PDF
    Simultaneous multithreading processors dynamically share processor resources between multiple threads. In general, shared SMT resources may be managed explicitly, e.g. by dynamically setting queue occupation bounds for each thread as in the DCRA and Hill-Climbing policies. Alternatively, resources may be managed implicitly, i.e. resource usage is controlled by placing the desired instruction mix in the resources. In this case, the main resource management tool is the instruction fetch policy which must predict the behavior of each thread (branch mispredictions, long-latency loads, etc.) as it fetches instructions. In this paper, we present the use of Speculative Instruction Window Weighting (SIWW) to bridge the gap between implicit and explicit SMT fetch policies. SIWW estimates for each thread the amount of outstanding work in the processor pipeline. Fetch proceeds for the thread with the least amount of work left. SIWW policies are implicit as fetch proceeds for the thread with the least amount of work left. They are also explicit as maximum resource allocation can also be set. SIWW can use and combine virtually any of the indicators that were previously proposed for guiding the instruction fetch policy (number of in-flight instructions, number of low confidence branches, number of predicted cache misses, etc.). Therefore, SIWW is an \emph{approach to designing SMT fetch policies}, rather than a particular fetch policy. Targeting fairness or throughput is often contradictory and a SMT scheduling policy often optimizes only one performance metric at the sacrifice of the other metric. Our simulations show that the SIWW fetch policy can achieve at the same time state-of-the-art throughput, state-of-the-art fairness and state-of-the-art harmonic performance mean

    Hybrid Designs for Caches and Cores.

    Full text link
    Processor power constraints have come to the forefront over the last decade, heralded by the stagnation of clock frequency scaling. High-performance core and cache designs often utilize power-hungry techniques to increase parallelism. Conversely, the most energy-efficient designs opt for a serial execution to avoid unnecessary overheads. While both of these extremes constitute one-size-fits-all approaches, a judicious mix of parallel and serial execution has the potential to achieve the best of both high-performing and energy-efficient designs. This dissertation examines such hybrid designs for cores and caches. Firstly, we introduce a novel, hybrid out-of-order/in-order core microarchitecture. Instructions that are steered towards in-order execution skip register allocation, reordering and dynamic scheduling. At the same time, these instructions can interleave on an instruction-by-instruction basis with instructions that continue to benefit from these conventional out-of-order mechanisms. Secondly, this dissertation revisits a hybrid technique introduced for L1 caches, way-prediction, in the context of last-level caches that are larger, have higher associativity, and experience less locality.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113484/1/sleimanf_1.pd
    corecore