194 research outputs found

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications

    Get PDF
    High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.The research was supported by European Unions 7th Framework Programme [FP7/2007-2013] under project Mont-Blanc (288777), the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925), Generalitat de Catalunya (2014-SGR-1051 and 2014-SGR-1272), HiPEAC-3 Network of Excellence (ICT-287759), and finally the Severo Ochoa Program (SEV-2011-00067) of the Spanish Government.Peer ReviewedPostprint (author's final draft

    Exploring coordinated software and hardware support for hardware resource allocation

    Get PDF
    Multithreaded processors are now common in the industry as they offer high performance at a low cost. Traditionally, in such processors, the assignation of hardware resources between the multiple threads is done implicitly, by the hardware policies. However, a new class of multithreaded hardware allows the explicit allocation of resources to be controlled or biased by the software. Currently, there is little or no coordination between the allocation of resources done by the hardware and the prioritization of tasks done by the software.This thesis targets to narrow the gap between the software and the hardware, with respect to the hardware resource allocation, by proposing a new explicit resource allocation hardware mechanism and novel schedulers that use the currently available hardware resource allocation mechanisms.It approaches the problem in two different types of computing systems: on the high performance computing domain, we characterize the first processor to present a mechanism that allows the software to bias the allocation hardware resources, the IBM POWER5. In addition, we propose the use of hardware resource allocation as a way to balance high performance computing applications. Finally, we propose two new scheduling mechanisms that are able to transparently and successfully balance applications in real systems using the hardware resource allocation. On the soft real-time domain, we propose a hardware extension to the existing explicit resource allocation hardware and, in addition, two software schedulers that use the explicit allocation hardware to improve the schedulability of tasks in a soft real-time system.In this thesis, we demonstrate that system performance improves by making the software aware of the mechanisms to control the amount of resources given to each running thread. In particular, for the high performance computing domain, we show that it is possible to decrease the execution time of MPI applications biasing the hardware resource assignation between threads. In addition, we show that it is possible to decrease the number of missed deadlines when scheduling tasks in a soft real-time SMT system.Postprint (published version

    Addressing fairness in SMT multicores with a progress-aware scheduler

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling. This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness. Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3× over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program AwardFeliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48

    SEPARATING INSTRUCTION FETCHES FROM MEMORY ACCESSES : ILAR (INSTRUCTION LINE ASSOCIATIVE REGISTERS)

    Get PDF
    Due to the growing mismatch between processor performance and memory latency, many dynamic mechanisms which are “invisible” to the user have been proposed: for example, trace caches and automatic pre-fetch units. However, these dynamic mechanisms have become inadequate due to implicit memory accesses that have become so expensive. On the other hand, compiler-visible mechanisms like SWAR (SIMD Within A Register) and LARs (Line Associative Registers) are potentially more effective at improving data access performance. This thesis investigates applying the same ideas to improve instruction access. ILAR (Instruction LARs) store instructions in wide registers. Instruction blocks are explicitly loaded into ILAR, using block compression to enhance memory bandwidth. The control flow of the program then refers to instructions directly by their position within an ILAR, rather than by lengthy memory addresses. Because instructions are accessed directly from within registers, there is no implicit instruction fetch from memory. This thesis proposes an instruction set architecture for ILAR, investigates a mechanism to load ILAR using the best available block compression algorithm and also develop hardware descriptions for both ILAR and a conventional memory cache model so that performance comparisons could be made on the instruction fetch stage

    A Multi-core processor for hard real-time systems

    Get PDF
    The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption. Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks. Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems. This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD). In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs. The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs). Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs. This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform.La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en industrias como la automovilística y la de aviación, está impulsando un incremento en el rendimiento necesario en los actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple. Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas (formadas por tareas de tiempo real duro y laxo así como tareas sin requerimientos de tiempo real), maximizando así la utilización de los recursos de procesador y garantizando el bajo consumo de energía. Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden simultáneamente a los recursos compartidos del procesador. Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo de ejecución de la aplicación - se convierte en extremadamente difícil, si no imposible, porque el tiempo de ejecución de una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello, diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un límite superior (UBD). Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas HRT se reducen al mínimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del resto de recursos. Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo así mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las tareas HRT. Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con diferentes niveles de tiempo real (HRT y NHRTs). Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una ejecución analizable del modelo de ejecución paralela software pipelining. Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles

    Energy-Efficient Acceleration of Asynchronous Programs.

    Full text link
    Asynchronous or event-driven programming has become the dominant programming model in the last few years. In this model, computations are posted as events to an event queue from where they get processed asynchronously by the application. A huge fraction of computing systems built today use asynchronous programming. All the Web 2.0 JavaScript applications (e.g., Gmail, Facebook) use asynchronous programming. There are now more than two million mobile applications available between the Apple App Store and Google Play, which are all written using asynchronous programming. Distributed servers (e.g., Twitter, LinkedIn, PayPal) built using actor-based languages (e.g., Scala) and platforms such as node.js rely on asynchronous events for scalable communication. Internet-of-Things (IoT), embedded systems, sensor networks, desktop GUI applications, etc., all rely on the asynchronous programming model. Despite the ubiquity of asynchronous programs, their unique execution characteristics have been largely ignored by conventional processor architectures, which have remained heavily optimized for synchronous programs. Asynchronous programs are characterized by short events executing varied tasks. This results in a large instruction footprint with little cache locality, severely degrading cache performance. Also, event execution has few repeatable patterns causing poor branch prediction. This thesis proposes novel processor optimizations exploiting the unique execution characteristics of asynchronous programs for performance optimization and energy-efficiency. These optimizations are designed to make the underlying hardware aware of discrete events and thereafter, exploit the latent Event-Level Parallelism present in these applications. Through speculative pre-execution of future events, cache addresses and branch outcomes are recorded and later used for improving cache and branch predictor performance. A hardware instruction prefetcher specialized for asynchronous programs is also proposed as a comparative design direction.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120780/1/gauravc_1.pd

    Stretch: Balancing QoS and Throughput for Colocated Server Workloads on SMT Cores

    Get PDF
    In a drive to maximize resource utilization, today's datacenters are moving to colocation of latency-sensitive and batch workloads on the same server. State-of-the-art deployments, such as those at Google, colocate such diverse workloads even on a single SMT core. This form of aggressive colocation is afforded by virtue of the fact that a latency-sensitive service operating below its peak load has significant slack in its response latency with respect to the QoS target. The slack affords a degradation in single-thread performance, which is inevitable under SMT colocation, without compromising QoS targets. This work makes the observation that many batch applications can greatly benefit from a large instruction window to uncover ILP and MLP. Under SMT colocation, conventional wisdom holds that individual hardware threads should be limited in their ability to acquire and hold a disproportionately large share of microarchitectural resources so as not to compromise the performance of a co-running thread. We show that the performance slack inherent in latency-sensitive workloads operating at low to moderate load makes it safe to shift microarchitectural resources to a co-running batch thread without compromising QoS targets. Based on this insight, we introduce Stretch, a simple ROB partitioning scheme that is invoked by system software to provide one hardware thread with a much larger ROB partition at the expense of another thread. When Stretch is enabled for latency-sensitive workloads operating below their peak load on an SMT core, co-running batch applications gain 13% of performance on average (30% max) over a baseline SMT colocation and without compromising QoS constraints
    corecore