4 research outputs found

    Enhancing in-memory Efficiency for MapReduce-based Data Processing

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Parallel and Distributed Computing. The final authenticated version is available online at: https://doi.org/10.1016/j.jpdc.2018.04.001[Abstract] As the memory capacity of computational systems increases, the in-memory data management of Big Data processing frameworks becomes more crucial for performance. This paper analyzes and improves the memory efficiency of Flame-MR, a framework that accelerates Hadoop applications, providing valuable insight into the impact of memory management on performance. By optimizing memory allocation, the garbage collection overheads and execution times have been reduced by up to 85% and 44%, respectively, on a multi-core cluster. Moreover, different data buffer implementations are evaluated, showing that off-heap buffers achieve better results overall. Memory resources are also leveraged by caching intermediate results, improving iterative applications by up to 26%. The memory-enhanced version of Flame-MR has been compared with Hadoop and Spark on the Amazon EC2 cloud platform. The experimental results have shown significant performance benefits reducing Hadoop execution times by up to 65%, while providing very competitive results compared to Spark.Ministerio de Economía, industria y Competitividad; TIN2016-75845-P, AEI/FEDER/EUMinisterio de Educación; FPU14/0280

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    Quantifying the performance impact of large pages on in-memory big-data workloads

    No full text
    In-memory big-data processing is rapidly emerging as a promising solution for large-scale data analytics with highperformance and/or real-time requirements. In-memory bigdata workloads are often hosted on servers that consist of a few multi-core CPUs and large physical memory, exhibiting the non-uniform memory access (NUMA) characteristics. While large pages are commonly known as an effective technique to reduce the performance overheads of virtual memory and widely supported across the modern hardware and system software stacks, relatively little work has been done to investigate their performance impact on in-memory big-data workloads hosted on NUMA systems. To bridge this gap, this work quantifies the performance impact of large pages on in-memory big-data workloads running on a large-scale NUMA system. Our experimental results show that large pages provide no or little performance gains over the 4KB pages when the in-memory big-data workloads process sufficiently large datasets. In addition, our experimental results show that large pages achieve higher performance gains as the dataset size of the in-memory big-data workloads decreases and the NUMA system scale increases. We also discuss the possible performance optimizations for large pages and estimate the potential performance improvements

    ?????????, ?????????, ????????? ?????? ???????????? ????????? ????????? ??????????????? ???????????? ????????? ??????????????? ??????

    Get PDF
    Department of Computer Science and EngineeringHardware with advanced functionalities and/or improved performance and efficiency has been introduced in modern computer systems. However, there exist several challenges with such emerging hardware. First, the characteristics of emerging hardware are unknown but deriving useful properties through characterization studies is hard because emerging hardware has different effects on applications with different characteristics. Second, sole use of emerging hardware is suboptimal but coordination of emerging hardware and other techniques is hard due to large and complex system state space. To address the problem, we first conduct in-depth characterization studies for emerging hardware based on applications with various characteristics. Guided by the observations from our characterization studies, we propose a set of system software techniques to effectively leverage emerging hardware. The system software techniques combine emerging hardware and other techniques to improve the performance, efficiency, and fairness of computer systems based on efficient optimization algorithms. First, we investigate system software techniques to effectively manage hardware-based last-level cache (LLC) and memory bandwidth partitioning functionalities. For effective memory bandwidth partitioning on commodity servers, we propose HyPart, a hybrid technique for practical memory bandwidth partitioning on commodity servers. HyPart combines the three widely used memory bandwidth partitioning techniques (i.e., thread packing, clock modulation, and Intel MBA) in a coordinated manner considering the characteristics of the target applications. We demonstrate the effectiveness of HyPart through the quantitative evaluation. We also propose CoPart, coordinated partitioning of LLC and memory bandwidth for fairness-aware workload consolidation on commodity servers. We first characterize the impact of LLC and memory bandwidth partitioning on the performance and fairness of the consolidated workloads. Guided by the characterization, we design and implement CoPart. CoPart dynamically profiles the characteristics of the consolidated workloads and partitions LLC and memory bandwidth in a coordinated manner to maximize the fairness of the consolidated workloads. Through the quantitative evaluation with various workloads and system configurations, we demonstrate the effectiveness of CoPart in the sense that it significantly improves the overall fairness of the consolidated workloads. Second, we investigate a system software technique to effectively leverage hardware-based power capping functionality. We first characterize the performance impact of the two key system knobs (i.e., concurrency level of the target applications and cross component power allocation) for power capping. Guided by the characterization results, we design and implement RPPC, a holistic runtime system for maximizing performance under power capping. RPPC dynamically controls the key system knobs in a cooperative manner considering the characteristics (e.g., scalability and memory intensity) of the target applications. Our evaluation results show the effectiveness of RPPC in the sense that it significantly improves the performance under power capping on various application and system configurations. Third, we investigate system software techniques for effective dynamic concurrency control on many-core systems and heterogeneous multiprocessing systems. We propose RMC, an integrated runtime system for adaptive many-core computing. RMC combines the two widely used dynamic concurrency control techniques (i.e., thread packing and dynamic threading) in a coordinated manner to exploit the advantages of both techniques. RMC quickly controls the concurrency level of the target applications through the thread packing technique to improve the performance and efficiency. RMC further improves the performance and efficiency by determining the optimal thread count through the dynamic threading technique. Our quantitative experiments show the effectiveness of RMC in the sense that it outperforms the existing dynamic concurrency control techniques in terms of the performance and energy efficiency. In addition, we also propose PALM, progress- and locality-aware adaptive task migration for efficient thread packing. We first conduct an in-depth performance analysis of thread packing with various synchronization-intensive benchmarks and system configurations and find the root causes of the performance pathologies of thread packing. Based on the characterization results, we design and implement PALM, which supports both of symmetric multiprocessing systems and heterogeneous multiprocessing systems. For efficient thread packing, PALM solves the three key problems, progress-aware task migration, locality-aware task migration, and scheduling period control. Our quantitative evaluation explains the effectiveness of PALM in the sense that it achieves substantially higher performance and energy efficiency than the conventional thread packing. We also present case studies in which PALM considerably improves the efficiency of dynamic server consolidation and the performance under power capping.ope
    corecore