8 research outputs found

    Efficient Memory Management for GPU-based Deep Learning Systems

    Get PDF
    GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3% of memory compared with Nvidia's default memory pool with equal time complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2% without communication overhead

    Efficient Memory Management for GPU-based Deep Learning Systems

    Full text link
    GPU (graphics processing unit) has been used for many data-intensive applications. Among them, deep learning systems are one of the most important consumer systems for GPU nowadays. As deep learning applications impose deeper and larger models in order to achieve higher accuracy, memory management becomes an important research topic for deep learning systems, given that GPU has limited memory size. Many approaches have been proposed towards this issue, e.g., model compression and memory swapping. However, they either degrade the model accuracy or require a lot of manual intervention. In this paper, we propose two orthogonal approaches to reduce the memory cost from the system perspective. Our approaches are transparent to the models, and thus do not affect the model accuracy. They are achieved by exploiting the iterative nature of the training algorithm of deep learning to derive the lifetime and read/write order of all variables. With the lifetime semantics, we are able to implement a memory pool with minimal fragments. However, the optimization problem is NP-complete. We propose a heuristic algorithm that reduces up to 13.3% of memory compared with Nvidia's default memory pool with equal time complexity. With the read/write semantics, the variables that are not in use can be swapped out from GPU to CPU to reduce the memory footprint. We propose multiple swapping strategies to automatically decide which variable to swap and when to swap out (in), which reduces the memory cost by up to 34.2% without communication overhead

    PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

    Full text link
    This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.Comment: 48 pages, including references and Appendi

    Enhancing in-memory Efficiency for MapReduce-based Data Processing

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Parallel and Distributed Computing. The final authenticated version is available online at: https://doi.org/10.1016/j.jpdc.2018.04.001[Abstract] As the memory capacity of computational systems increases, the in-memory data management of Big Data processing frameworks becomes more crucial for performance. This paper analyzes and improves the memory efficiency of Flame-MR, a framework that accelerates Hadoop applications, providing valuable insight into the impact of memory management on performance. By optimizing memory allocation, the garbage collection overheads and execution times have been reduced by up to 85% and 44%, respectively, on a multi-core cluster. Moreover, different data buffer implementations are evaluated, showing that off-heap buffers achieve better results overall. Memory resources are also leveraged by caching intermediate results, improving iterative applications by up to 26%. The memory-enhanced version of Flame-MR has been compared with Hadoop and Spark on the Amazon EC2 cloud platform. The experimental results have shown significant performance benefits reducing Hadoop execution times by up to 65%, while providing very competitive results compared to Spark.Ministerio de Economía, industria y Competitividad; TIN2016-75845-P, AEI/FEDER/EUMinisterio de Educación; FPU14/0280

    Deca : a garbage collection optimizer for in-memory data processing

    Get PDF
    In-memory caching of intermediate data and active combining of data in shuffle buffers have been shown to be very effective in minimizing the recomputation and I/O cost in big data processing systems such as Spark and Flink. However, it has also been widely reported that these techniques would create a large amount of long-living data objects in the heap. These generated objects may quickly saturate the garbage collector, especially when handling a large dataset, and hence, limit the scalability of the system. To eliminate this problem, we propose a lifetime-based memory management framework, which, by automatically analyzing the user-defined functions and data types, obtains the expected lifetime of the data objects and then allocates and releases memory space accordingly to minimize the garbage collection overhead. In particular, we present Deca,1 a concrete implementation of our proposal on top of Spark, which transparently decomposes and groups objects with similar lifetimes into byte arrays and releases their space altogether when their lifetimes come to an end. When systems are processing very large data, Deca also provides field-oriented memory pages to ensure high compression efficiency. Extensive experimental studies using both synthetic and real datasets show that, in comparing to Spark, Deca is able to (1) reduce the garbage collection time by up to 99.9%, (2) reduce the memory consumption by up to 46.6% and the storage space by 23.4%, (3) achieve 1.2× to 22.7× speedup in terms of execution time in cases without data spilling and 16× to 41.6× speedup in cases with data spilling, and (4) provide similar performance compared to domain-specific systems
    corecore