394 research outputs found

    BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures

    Full text link
    We introduce BriskStream, an in-memory data stream processing system (DSPSs) specifically designed for modern shared-memory multicore architectures. BriskStream's key contribution is an execution plan optimization paradigm, namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair of producer-consumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1

    LIKWID: Lightweight Performance Tools

    Full text link
    Exploiting the performance of today's microprocessors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter metrics, and microbenchmarking for reliable upper performance bounds. Moreover, it includes a mpirun wrapper allowing for portable thread-core affinity in MPI and hybrid MPI/threaded applications. To demonstrate the capabilities of the tool set we show the influence of thread affinity on performance using the well-known OpenMP STREAM triad benchmark, use hardware counter tools to study the performance of a stencil code, and finally show how to detect bandwidth problems on ccNUMA-based compute nodes.Comment: 12 page

    Architecting Data Centers for High Efficiency and Low Latency

    Full text link
    Modern data centers, housing remarkably powerful computational capacity, are built in massive scales and consume a huge amount of energy. The energy consumption of data centers has mushroomed from virtually nothing to about three percent of the global electricity supply in the last decade, and will continuously grow. Unfortunately, a significant fraction of this energy consumption is wasted due to the inefficiency of current data center architectures, and one of the key reasons behind this inefficiency is the stringent response latency requirements of the user-facing services hosted in these data centers such as web search and social networks. To deliver such low response latency, data center operators often have to overprovision resources to handle high peaks in user load and unexpected load spikes, resulting in low efficiency. This dissertation investigates data center architecture designs that reconcile high system efficiency and low response latency. To increase the efficiency, we propose techniques that understand both microarchitectural-level resource sharing and system-level resource usage dynamics to enable highly efficient co-locations of latency-critical services and low-priority batch workloads. We investigate the resource sharing on real-system simultaneous multithreading (SMT) processors to enable SMT co-locations by precisely predicting the performance interference. We then leverage historical resource usage patterns to further optimize the task scheduling algorithm and data placement policy to improve the efficiency of workload co-locations. Moreover, we introduce methodologies to better manage the response latency by automatically attributing the source of tail latency to low-level architectural and system configurations in both offline load testing environment and online production environment. We design and develop a response latency evaluation framework at microsecond-level precision for data center applications, with which we construct statistical inference procedures to attribute the source of tail latency. Finally, we present an approach that proactively enacts carefully designed causal inference micro-experiments to diagnose the root causes of response latency anomalies, and automatically correct them to reduce the response latency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144144/1/yunqi_1.pd

    Contention-Aware Scheduling for SMT Multicore Processors

    Get PDF
    The recent multicore era and the incoming manycore/manythread era generate a lot of challenges for computer scientists going from productive parallel programming, over network congestion avoidance and intelligent power management, to circuit design issues. The ultimate goal is to squeeze out as much performance as possible while limiting power and energy consumption and guaranteeing a reliable execution. The increasing number of hardware contexts of current and future systems makes the scheduler an important component to achieve this goal, as there is often a combinatorial amount of different ways to schedule the distinct threads or applications, each with a different performance due to the inter-application interference. Picking an optimal schedule can result in substantial performance gains. This thesis deals with inter-application interference, covering the problems this fact causes on performance and fairness on actual machines. The study starts with single-threaded multicore processors (Intel Xeon X3320), follows with simultaneous multithreading (SMT) multicores supporting up to two threads per core (Intel Xeon E5645), and goes to the most highly threaded per-core processor that has ever been built (IBM POWER8). The dissertation analyzes the main contention points of each experimental platform and proposes scheduling algorithms that tackle the interference arising at each contention point to improve the system throughput and fairness. First we analyze contention through the memory hierarchy of current multicore processors. The performed studies reveal high performance degradation due to contention on main memory and any shared cache the processors implement. To mitigate such contention, we propose different bandwidth-aware scheduling algorithms with the key idea of balancing the memory accesses through the workload execution time and the cache requests among the different caches at each cache level. The high interference that different applications suffer when running simultaneously on the same SMT core, however, does not only affect performance, but can also compromise system fairness. In this dissertation, we also analyze fairness in current SMT multicores. To improve system fairness, we design progress-aware scheduling algorithms that estimate, at runtime, how the processes progress, which allows to improve system fairness by prioritizing the processes with lower accumulated progress. Finally, this dissertation tackles inter-application contention in the IBM POWER8 system with a symbiotic scheduler that addresses overall SMT interference. The symbiotic scheduler uses an SMT interference model, based on CPI stacks, that estimates the slowdown of any combination of applications if they are scheduled on the same SMT core. The number of possible schedules, however, grows too fast with the number of applications and makes unfeasible to explore all possible combinations. To overcome this issue, the symbiotic scheduler models the scheduling problem as a graph problem, which allows finding the optimal schedule in reasonable time. In summary, this thesis addresses contention in the shared resources of the memory hierarchy and SMT cores of multicore processors. We identify the main contention points of three systems with different architectures and propose scheduling algorithms to tackle contention at these points. The evaluation on the real systems shows the benefits of the proposed algorithms. The symbiotic scheduler improves system throughput by 6.7\% over Linux. Regarding fairness, the proposed progress-aware scheduler reduces Linux unfairness to a third. Besides, since the proposed algorithm are completely software-based, they could be incorporated as scheduling policies in Linux and used in small-scale servers to achieve the mentioned benefits.La actual era multinúcleo y la futura era manycore/manythread generan grandes retos en el área de la computación incluyendo, entre otros, la programación paralela productiva o la gestión eficiente de la energía. El último objetivo es alcanzar las mayores prestaciones limitando el consumo energético y garantizando una ejecución confiable. El incremento del número de contextos hardware de los sistemas hace que el planificador se convierta en un componente importante para lograr este objetivo debido a que existen múltiples formas diferentes de planificar las aplicaciones, cada una con distintas prestaciones debido a las interferencias que se producen entre las aplicaciones. Seleccionar la planificación óptima puede proporcionar importantes mejoras de prestaciones. Esta tesis se ocupa de las interferencias entre aplicaciones, cubriendo los problemas que causan en las prestaciones y equidad de los sistemas actuales. El estudio empieza con procesadores multinúcleo monohilo (Intel Xeon X3320), sigue con multinúcleos con soporte para la ejecución simultanea (SMT) de dos hilos (Intel Xeon E5645), y llega al procesador que actualmente soporta un mayor número de hilos por núcleo (IBM POWER8). La disertación analiza los principales puntos de contención en cada plataforma y propone algoritmos de planificación que mitigan las interferencias que se generan en cada uno de ellos para mejorar la productividad y equidad de los sistemas. En primer lugar, analizamos la contención a lo largo de la jerarquía de memoria. Los estudios realizados revelan la alta degradación de prestaciones provocada por la contención en memoria principal y en cualquier cache compartida. Para mitigar esta contención, proponemos diversos algoritmos de planificación cuya idea principal es distribuir los accesos a memoria a lo largo del tiempo de ejecución de la carga y las peticiones a las caches entre las diferentes caches compartidas en cada nivel. Las altas interferencias que sufren las aplicaciones que se ejecutan simultáneamente en un núcleo SMT, sin embargo, no solo afectan a las prestaciones, sino que también pueden comprometer la equidad del sistema. En esta tesis, también abordamos la equidad en los actuales multinúcleos SMT. Para mejorarla, diseñamos algoritmos de planificación que estiman el progreso de las aplicaciones en tiempo de ejecución, lo que permite priorizar los procesos con menor progreso acumulado para reducir la inequidad. Finalmente, la tesis se centra en la contención entre aplicaciones en el sistema IBM POWER8 con un planificador simbiótico que aborda la contención en todo el núcleo SMT. El planificador simbiótico utiliza un modelo de interferencia basado en pilas de CPI que predice las prestaciones para la ejecución de cualquier combinación de aplicaciones en un núcleo SMT. El número de posibles planificaciones, no obstante, crece muy rápido y hace inviable explorar todas las posibles combinaciones. Por ello, el problema de planificación se modela como un problema de teoría de grafos, lo que permite obtener la planificación óptima en un tiempo razonable. En resumen, esta tesis aborda la contención en los recursos compartidos en la jerarquía de memoria y el núcleo SMT de los procesadores multinúcleo. Identificamos los principales puntos de contención de tres sistemas con diferentes arquitecturas y proponemos algoritmos de planificación para mitigar esta contención. La evaluación en sistemas reales muestra las mejoras proporcionados por los algoritmos propuestos. Así, el planificador simbiótico mejora la productividad, en promedio, un 6.7% con respecto a Linux. En cuanto a la equidad, el planificador que considera el progreso consigue reducir la inequidad de Linux a una tercera parte. Además, dado que los algoritmos propuestos son completamente software, podrían incorporarse como políticas de planificación en Linux y usarse en servidores a pequeña escala para obtener los benefiL'actual era multinucli i la futura era manycore/manythread generen grans reptes en l'àrea de la computació incloent, entre d'altres, la programació paral·lela productiva o la gestió eficient de l'energia. L'últim objectiu és assolir les majors prestacions limitant el consum energètic i garantint una execució confiable. L'increment del número de contextos hardware dels sistemes fa que el planificador es convertisca en un component important per assolir aquest objectiu donat que existeixen múltiples formes distintes de planificar les aplicacions, cadascuna amb unes prestacions diferents degut a les interferències que es produeixen entre les aplicacions. Seleccionar la planificació òptima pot donar lloc a millores importants de les prestacions. Aquesta tesi s'ocupa de les interferències entre aplicacions, cobrint els problemes que provoquen en les prestacions i l'equitat dels sistemes actuals. L'estudi comença amb processadors multinucli monofil (Intel Xeon X3320), segueix amb multinuclis amb suport per a l'execució simultània (SMT) de dos fils (Intel Xeon E5645), i arriba al processador que actualment suporta un major nombre de fils per nucli (IBM POWER8). Aquesta dissertació analitza els principals punts de contenció en cada plataforma i proposa algoritmes de planificació que aborden les interferències que es generen en cadascun d'ells per a millorar la productivitat i l'equitat dels sistemes. En primer lloc, estudiem la contenció al llarg de la jerarquia de memòria en els processadors multinucli. Els estudis realitzats revelen l'alta degradació de prestacions provocada per la contenció en memòria principal i en qualsevol cache compartida. Per a mitigar la contenció, proposem diversos algoritmes de planificació amb la idea principal de distribuir els accessos a memòria al llarg del temps d'execució de la càrrega i les peticions a les caches entre les diferents caches compartides en cada nivell. Les altes interferències que sofreixen las aplicacions que s'executen simultàniament en un nucli SMT, no obstant, no sols afecten a las prestacions, sinó que també poden comprometre l'equitat del sistema. En aquesta tesi, també abordem l'equitat en els actuals multinuclis SMT. Per a millorar-la, dissenyem algoritmes de planificació que estimen el progrés de les aplicacions en temps d'execució, el que permet prioritzar els processos amb menor progrés acumulat para a reduir la inequitat. Finalment, la tesi es centra en la contenció entre aplicacions en el sistema IBM POWER8 amb un planificador simbiòtic que aborda la contenció en tot el nucli SMT. El planificador simbiòtic utilitza un model d'interferència basat en piles de CPI que prediu les prestacions per a l'execució de qualsevol combinació d'aplicacions en un nucli SMT. El nombre de possibles planificacions, no obstant, creix molt ràpid i fa inviable explorar totes les possibles combinacions. Per resoldre aquest contratemps, el problema de planificació es modela com un problema de teoria de grafs, la qual cosa permet obtenir la planificació òptima en un temps raonable. En resum, aquesta tesi aborda la contenció en els recursos compartits en la jerarquia de memòria i el nucli SMT dels processadors multinucli. Identifiquem els principals punts de contenció de tres sistemes amb diferents arquitectures i proposem algoritmes de planificació per a mitigar aquesta contenció. L'avaluació en sistemes reals mostra les millores proporcionades pels algoritmes proposats. Així, el planificador simbiòtic millora la productivitat una mitjana del 6.7% respecte a Linux. Pel que fa a l'equitat, el planificador que considera el progrés aconsegueix reduir la inequitat de Linux a una tercera part. A més, donat que els algoritmes proposats son completament software, podrien incorporar-se com a polítiques de planificació en Linux i emprar-se en servidors a petita escala per obtenir els avantatges mencionats.Feliu Pérez, J. (2017). Contention-Aware Scheduling for SMT Multicore Processors [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/79081TESISPremios Extraordinarios de tesis doctorale

    Improving IBM POWER8 Performance Through Symbiotic Job Scheduling

    Full text link
    [EN] Symbiotic job scheduling, i.e., scheduling applications that co-run well together on a core, can have a considerable impact on the performance of processors with simultaneous multithreading (SMT) cores. SMT cores share most of their microarchitectural components among the co-running applications, which causes performance interference between them. Therefore, scheduling applications with complementary resource requirements on the same core can greatly improve the throughput of the system. This paper enhances symbiotic job scheduling for the IBM POWER8 processor. We leverage the existing cycle accounting mechanism to build an interference model that predicts symbiosis between applications. The proposed models achieve higher accuracy than previous models by predicting job symbiosis from throttled CPI stacks, i.e., CPI stacks of the applications when running in the same SMT mode to consider the statically partitioned resources, but without interference from other applications. The symbiotic scheduler uses these interference models to decide, at run-time, which applications should run on the same core or on separate cores. We prototype the symbiotic scheduler as a user-level scheduler in the Linux operating system and evaluate it on an IBM POWER8 server running multiprogram workloads. The symbiotic job scheduler significantly improves performance compared to both an agnostic random scheduler and the default Linux scheduler. Across all evaluated workloads in SMT4 mode, throughput improves by 12.4 and 5.1 percent on average over the random and Linux schedulers, respectively.This work was supported in part by the Spanish Ministerio de Econom ıa y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972- C5-1-R and TIN2014-62246-EXP, as well as by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013)/ERC grant agreement No. 259295.Feliu-Pérez, J.; Eyerman, S.; Sahuquillo Borrás, J.; Petit Martí, SV.; Eeckhout, L. (2017). Improving IBM POWER8 Performance Through Symbiotic Job Scheduling. IEEE Transactions on Parallel and Distributed Systems. 28(10):2838-2851. https://doi.org/10.1109/TPDS.2017.269170828382851281

    Multi-core devices for safety-critical systems: a survey

    Get PDF
    Multi-core devices are envisioned to support the development of next-generation safety-critical systems, enabling the on-chip integration of functions of different criticality. This integration provides multiple system-level potential benefits such as cost, size, power, and weight reduction. However, safety certification becomes a challenge and several fundamental safety technical requirements must be addressed, such as temporal and spatial independence, reliability, and diagnostic coverage. This survey provides a categorization and overview at different device abstraction levels (nanoscale, component, and device) of selected key research contributions that support the compliance with these fundamental safety requirements.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under grant TIN2015-65316-P, Basque Government under grant KK-2019-00035 and the HiPEAC Network of Excellence. The Spanish Ministry of Economy and Competitiveness has also partially supported Jaume Abella under Ramon y Cajal postdoctoral fellowship (RYC-2013-14717).Peer ReviewedPostprint (author's final draft
    corecore