102 research outputs found

    Performance Characterization of Spark Workloads on Shared NUMA Systems

    Get PDF
    As the adoption of Big Data technologies becomes the norm in an increasing number of scenarios, there is also a growing need to optimize them for modern processors. Spark has gained momentum over the last few years among companies looking for high performance solutions that can scale out across different cluster sizes. At the same time, modern processors can be connected to large amounts of physical memory, in the range of up to few terabytes. This opens an enormous range of opportunities for runtimes and applications that aim to improve their performance by leveraging low latencies and high bandwidth provided by RAM. The result is that there are several examples today of applications that have started pushing the in-memory computing paradigm to accelerate tasks. To deliver such a large physical memory capacity, hardware vendors have leveraged Non-Uniform Memory Architectures (NUMA). This paper explores how Spark-based workloads are impacted by the effects of NUMA-placement decisions, how different Spark configurations result in changes in delivered performance, how the characteristics of the applications can be used to predict workload collocation conflicts, and how to improve performance by collocating workloads in scale-up nodes. We explore several workloads run on top of the IBM Power8 processor, and provide manual strategies that can leverage performance improvements up to 40% on Spark workloads when using smart processor-pinning and workload collocation strategies.This work is partially supported by the European Research Council (ERC) under the EU Horizon 2020 programme (GA 639595), the Spanish Ministry of Economy, Industry and Competitiveness (TIN2015-65316-P) and the Generalitat de Catalunya (2014-SGR-1051).Postprint (author's final draft

    Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

    Full text link
    In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015

    Data center's telemetry reduction and prediction through modeling techniques

    Get PDF
    Nowadays, Cloud Computing is widely used to host and deliver services over the Internet. The architecture of clouds is complex due to its heterogeneous nature of hardware and is hosted in large scale data centers. To effectively and efficiently manage such complex infrastructure, constant monitoring is needed. This monitoring generates large amounts of telemetry data streams (e.g. hardware utilization metrics) which are used for multiple purposes including problem detection, resource management, workload characterization, resource utilization prediction, capacity planning, and job scheduling. These telemetry streams require costly bandwidth utilization and storage space particularly at medium-long term for large data centers. Moreover, accurate future estimation of these telemetry streams is a challenging task due to multi-tenant co-hosted applications and dynamic workloads. The inaccurate estimation leads to either under or over-provisioning of data center resources. In this Ph.D. thesis, we propose to improve the prediction accuracy and reduce the bandwidth utilization and storage space requirement with the help of modeling and prediction methods from machine learning. Most of the existing methods are based on a single model which often does not appropriately estimate different workload scenarios. Moreover, these prediction methods use a fixed size of observation windows which cannot produce accurate results because these are not adaptively adjusted to capture the local trends in the recent data. Therefore, the estimation method trains on fixed sliding windows use an irrelevant large number of observations which yields inaccurate estimations. In summary, we C1) efficiently reduce bandwidth and storage for telemetry data through real-time modeling using Markov chain model. C2) propose a novel method to adaptively and automatically identify the most appropriate model to accurately estimate data center resources utilization. C3) propose a deep learning-based adaptive window size selection method which dynamically limits the sliding window size to capture the local trend in the latest resource utilization for building estimation model.Hoy en día, Cloud Computing se usa ampliamente para alojar y prestar servicios a través de Internet. La arquitectura de las nubes es compleja debido a su naturaleza heterogénea del hardware y está alojada en centros de datos a gran escala. Para administrar de manera efectiva y eficiente dicha infraestructura compleja, se necesita un monitoreo constante. Este monitoreo genera grandes cantidades de flujos de datos de telemetría (por ejemplo, métricas de utilización de hardware) que se utilizan para múltiples propósitos, incluyendo detección de problemas, gestión de recursos, caracterización de carga de trabajo, predicción de utilización de recursos, planificación de capacidad y programación de trabajos. Estas transmisiones de telemetría requieren una utilización costosa del ancho de banda y espacio de almacenamiento, particularmente a mediano y largo plazo para grandes centros de datos. Además, la estimación futura precisa de estas transmisiones de telemetría es una tarea difícil debido a las aplicaciones cohospedadas de múltiples inquilinos y las cargas de trabajo dinámicas. La estimación inexacta conduce a un suministro insuficiente o excesivo de los recursos del centro de datos. En este Ph.D. En la tesis, proponemos mejorar la precisión de la predicción y reducir la utilización del ancho de banda y los requisitos de espacio de almacenamiento con la ayuda de métodos de modelado y predicción del aprendizaje automático. La mayoría de los métodos existentes se basan en un modelo único que a menudo no estima adecuadamente diferentes escenarios de carga de trabajo. Además, estos métodos de predicción utilizan un tamaño fijo de ventanas de observación que no pueden producir resultados precisos porque no se ajustan adaptativamente para capturar las tendencias locales en los datos recientes. Por lo tanto, el método de estimación entrena en ventanas corredizas fijas utiliza un gran número de observaciones irrelevantes que produce estimaciones inexactas. En resumen, C1) reducimos eficientemente el ancho de banda y el almacenamiento de datos de telemetría a través del modelado en tiempo real utilizando el modelo de cadena de Markov. C2) proponer un método novedoso para identificar de forma adaptativa y automática el modelo más apropiado para estimar con precisión la utilización de los recursos del centro de datos. C3) proponer un método de selección de tamaño de ventana adaptativo basado en el aprendizaje profundo que limita dinámicamente el tamaño de ventana deslizante para capturar la tendencia local en la última utilización de recursos para el modelo de estimación de construcción.Postprint (published version

    동시에 실행되는 병렬 처리 어플리케이션들을 위한 병렬성 관리

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.멀티코어 시스템에서 여러 개의 병렬 처리 어플리케이션들을 함께 실행시키는 것 은 주어진 하드웨어 자원을 효율적으로 사용하기 위해서 점점 더 중요해지고 있다. 하지만, 현재 런타임 시스템에서 여러 개의 병렬 처리 어플리케이션들을 동시에 효율적으로 실행시키는 것은 여전히 어려운 문제이다. OpenMP와 같이 통상 사 용되는 병렬화 런타임 시스템들은 모든 하드웨어 코어 자원을 사용하기 위해서 일반적으로 코어 개수 만큼 스레드를 생성하여 어플리케이션을 실행시킨다. 이 때, 어플리케이션은 모든 코어 자원을 활용할 때 오히려 최적의 성능을 얻지 못할 수도 있으며, 운영체제 커널의 부하는 실행되는 어플리케이션의 개수가 늘어날 수록 관리해야 하는 스레드의 개수가 늘어나기 때문에 계속해서 커지게 된다. 본 학위 논문에서, 우리는 함께 실행되는 병렬 처리 어플리케이션들의 런타임 성능을 높이는 것에 집중한다. 이를 위해, 본 연구의 핵심 목표는 함께 실행되는 어플리케이션들에게 공간 분할식 스케줄링 방법을 적용하는 것이다. 각 어플리 케이션에게 독립적인 코어 자원을 할당해주는 공간 분할식 스케줄링은 점점 더 늘어나는 코어 자원의 개수를 효율적으로 관리하기 위한 방법으로 많은 관심을 받고 있다. 하지만, 공간 분할 스케줄링 방법을 통해 어플리케이션을 실행시키는 것은 두 가지 연구 과제를 가지고 있다. 먼저, 각 어플리케이션은 가변적인 코어 자원 상에서 효율적으로 실행되기 위한 런타임 기술을 필요로 하고, 스케줄러는 어플리케이션들의 성능 특성을 고려해서 런타임 성능을 높일 수 있도록 적당한 수의 코어 자원을 제공해야한다. 이 학위 논문에서, 우리는 함께 실행되는 병렬 처리 어플리케이션들을 공간 분 할 스케줄링을 통해서 효율적으로 실행시키기 위한 세가지 런타임 시스템 기술을 소개한다. 먼저 우리는 협동적인 런타임 시스템이라는 기술을 소개하는데, 이는 OpenMP 병렬 처리 어플리케이션들에게 유연하고 효율적인 실행 환경을 제공한다. 이 기술은 공유 메모리 병렬 실행에 내재되어 있는 특성을 활용하여 병렬처리 프로그램들이 변화하는 코어 자원에 맞추어 병렬성의 정도를 동적으로 조절할 수 있도록 해준다. 이러한 유연한 실행 모델은 병렬 어플리케이션들이 사용 가능한 코어 자원이 동적으로 변화하는 환경에서 어플리케이션의 스레드 수준 병렬성을 다루지 못하는 기존 런타임 시스템들에 비해서 더 효율적으로 실행될 수 있도록 해준다. 두번째로, 본 논문은 사용되는 코어 자원에 따른 병렬처리 프로그램의 성능 및 자원 활용도를 예측할 수 있도록 해주는 분석적 성능 모델을 소개한다. 병렬 처리 코드의 성능 확장성이 일반적으로 메모리 성능에 좌우된다는 관찰에 기초하여, 제 안된 해석 모델은 큐잉 이론을 활용하여 메모리 시스템의 성능 정보들을 계산한다. 이 큐잉 시스템에 기반한 방법은 유용한 성능 정보들을 수식을 통해 효율적으로 계산할 수 있도록 하며 상용 시스템에서 제공하는 하드웨어 성능 카운터만을 요구 하기 때문에 활용 가능성 또한 높다. 마지막으로, 본 논문은 동시에 실행되는 병렬 처리 어플리케이션들 사이에서 코어 자원을 할당해주는 프레임워크를 소개한다. 제안된 프레임워크는 동시에 동 작하는 병렬 처리 어플리케이션의 병렬성 및 코어 자원을 관리하여 멀티 소켓 멀티코어 시스템에서 CPU 자원 및 메모리 대역폭 자원 활용도를 동시에 최적 화한다. 해석적인 모델링과 제안된 코어 할당 프레임워크의 성능 평가를 통해서, 우리가 제안하는 정책이 일반적인 경우에 CPU 자원의 활용도만을 최적화하는 방법에 비해서 함께 동작하는 어플리케이션들의 실행시간을 감소시킬 수 있음을 보여준다.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 초록 159 Acknowledgements 161Docto

    Venice: Exploring Server Architectures for Effective Resource Sharing

    Get PDF
    Consolidated server racks are quickly becoming the backbone of IT infrastructure for science, engineering, and business, alike. These servers are still largely built and organized as when they were distributed, individual entities. Given that many fields increasingly rely on analytics of huge datasets, it makes sense to support flexible resource utilization across servers to improve cost-effectiveness and performance. We introduce Venice, a family of data-center server architectures that builds a strong communication substrate as a first-class resource for server chips. Venice provides a diverse set of resource-joining mechanisms that enables user programs to efficiently leverage non-local resources. To better understand the implications of design decisions about system support for resource sharing we have constructed a hardware prototype that allows us to more accurately measure end-to-end performance of at-scale applications and to explore tradeoffs among performance, power, and resource-sharing transparency. We present results from our initial studies analyzing these tradeoffs when sharing memory, accelerators, or NICs. We find that it is particularly important to reduce or hide latency, that data-sharing access patterns should match the features of the communication channels employed, and that inter-channel collaboration can be exploited for better performance
    corecore