246 research outputs found

    The Unexpected Efficiency of Bin Packing Algorithms for Dynamic Storage Allocation in the Wild: An Intellectual Abstract

    Full text link
    Recent work has shown that viewing allocators as black-box 2DBP solvers bears meaning. For instance, there exists a 2DBP-based fragmentation metric which often correlates monotonically with maximum resident set size (RSS). Given the field's indeterminacy with respect to fragmentation definitions, as well as the immense value of physical memory savings, we are motivated to set allocator-generated placements against their 2DBP-devised, makespan-optimizing counterparts. Of course, allocators must operate online while 2DBP algorithms work on complete request traces; but since both sides optimize criteria related to minimizing memory wastage, the idea of studying their relationship preserves its intellectual--and practical--interest. Unfortunately no implementations of 2DBP algorithms for DSA are available. This paper presents a first, though partial, implementation of the state-of-the-art. We validate its functionality by comparing its outputs' makespan to the theoretical upper bound provided by the original authors. Along the way, we identify and document key details to assist analogous future efforts. Our experiments comprise 4 modern allocators and 8 real application workloads. We make several notable observations on our empirical evidence: in terms of makespan, allocators outperform Robson's worst-case lower bound 93.75%93.75\% of the time. In 87.5%87.5\% of cases, GNU's \texttt{malloc} implementation demonstrates equivalent or superior performance to the 2DBP state-of-the-art, despite the second operating offline. Most surprisingly, the 2DBP algorithm proves competent in terms of fragmentation, producing up to 2.462.46x better solutions. Future research can leverage such insights towards memory-targeting optimizations.Comment: 13 pages, 10 figures, 3 tables. To appear in ISMM '2

    Resource Aware GPU Scheduling in Kubernetes Infrastructure

    Get PDF
    Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications. In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads

    Dataflow acceleration of Smith-Waterman with Traceback for high throughput Next Generation Sequencing

    Get PDF
    Smith-Waterman algorithm is widely adopted bymost popular DNA sequence aligners. The inherent algorithmcomputational intensity and the vast amount of NGS input datait operates on, create a bottleneck in genomic analysis flows forshort-read alignment. FPGA architectures have been extensivelyleveraged to alleviate the problem, each one adopting a differentapproach. In existing solutions, effective co-design of the NGSshort-read alignment still remains an open issue, mainly due tonarrow view on real integration aspects, such as system widecommunication and accelerator call overheads. In this paper, wepropose a dataflow architecture for Smith-Waterman Matrix-filland Traceback alignment stages, to perform short-read alignmenton NGS data. The architectural decision of moving both stages onchip extinguishes the communication overhead, and coupled withradical software restructuring, allows for efficient integration intowidely-used Bowtie2 aligner. This approach deliversĂ—18 speedupover the respective Bowtie2 standalone components, while our co-designed Bowtie2 demonstrates a 35% boost in performance
    • …
    corecore