1,163 research outputs found

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Quality of experience and access network traffic management of HTTP adaptive video streaming

    Get PDF
    The thesis focuses on Quality of Experience (QoE) of HTTP adaptive video streaming (HAS) and traffic management in access networks to improve the QoE of HAS. First, the QoE impact of adaptation parameters and time on layer was investigated with subjective crowdsourcing studies. The results were used to compute a QoE-optimal adaptation strategy for given video and network conditions. This allows video service providers to develop and benchmark improved adaptation logics for HAS. Furthermore, the thesis investigated concepts to monitor video QoE on application and network layer, which can be used by network providers in the QoE-aware traffic management cycle. Moreover, an analytic and simulative performance evaluation of QoE-aware traffic management on a bottleneck link was conducted. Finally, the thesis investigated socially-aware traffic management for HAS via Wi-Fi offloading of mobile HAS flows. A model for the distribution of public Wi-Fi hotspots and a platform for socially-aware traffic management on private home routers was presented. A simulative performance evaluation investigated the impact of Wi-Fi offloading on the QoE and energy consumption of mobile HAS.Die Doktorarbeit beschäftigt sich mit Quality of Experience (QoE) – der subjektiv empfundenen Dienstgüte – von adaptivem HTTP Videostreaming (HAS) und mit Verkehrsmanagement, das in Zugangsnetzwerken eingesetzt werden kann, um die QoE des adaptiven Videostreamings zu verbessern. Zuerst wurde der Einfluss von Adaptionsparameters und der Zeit pro Qualitätsstufe auf die QoE von adaptivem Videostreaming mittels subjektiver Crowdsourcingstudien untersucht. Die Ergebnisse wurden benutzt, um die QoE-optimale Adaptionsstrategie für gegebene Videos und Netzwerkbedingungen zu berechnen. Dies ermöglicht Dienstanbietern von Videostreaming verbesserte Adaptionsstrategien für adaptives Videostreaming zu entwerfen und zu benchmarken. Weiterhin untersuchte die Arbeit Konzepte zum Überwachen von QoE von Videostreaming in der Applikation und im Netzwerk, die von Netzwerkbetreibern im Kreislauf des QoE-bewussten Verkehrsmanagements eingesetzt werden können. Außerdem wurde eine analytische und simulative Leistungsbewertung von QoE-bewusstem Verkehrsmanagement auf einer Engpassverbindung durchgeführt. Schließlich untersuchte diese Arbeit sozialbewusstes Verkehrsmanagement für adaptives Videostreaming mittels WLAN Offloading, also dem Auslagern von mobilen Videoflüssen über WLAN Netzwerke. Es wurde ein Modell für die Verteilung von öffentlichen WLAN Zugangspunkte und eine Plattform für sozialbewusstes Verkehrsmanagement auf privaten, häuslichen WLAN Routern vorgestellt. Abschließend untersuchte eine simulative Leistungsbewertung den Einfluss von WLAN Offloading auf die QoE und den Energieverbrauch von mobilem adaptivem Videostreaming

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    Critical Analysis of Solutions to Hadoop Small File Problem

    Get PDF
    Hadoop big data platform is designed to process large volume of data Small file problem is a performance bottleneck in Hadoop processing Small files lower than the block size of Hadoop creates huge storage overhead at Namenode s and also wastes computational resources due to spawning of many map tasks Various solutions like merging small files mapping multiple map threads to same java virtual machine instance etc have been proposed to solve the small file problems in Hadoop This survey does a critical analysis of existing works addressing small file problems in Hadoop and its variant platforms like Spark The aim is to understand their effectiveness in reducing the storage computational overhead and identify the open issues for further researc

    Application-centric bandwidth allocation in datacenters

    Get PDF
    Today's datacenters host a large number of concurrently executing applications with diverse intra-datacenter latency and bandwidth requirements. Some of these applications, such as data analytics, graph processing, and machine learning training, are data-intensive and require high bandwidth to function properly. However, these bandwidth-hungry applications can often congest the datacenter network, leading to queuing delays that hurt application completion time. To remove the network as a potential performance bottleneck, datacenter operators have begun deploying high-end HPC-grade networks like InfiniBand. These networks offer fully offloaded network stacks, remote direct memory access (RDMA) capability, and non-discarding links, which allow them to provide both low latency and high bandwidth for a single application. However, it is unclear how well such networks accommodate a mix of latency- and bandwidth-sensitive traffic in a real-world deployment. In this thesis, we aim to answer the above question. To do so, we develop RPerf, a latency measurement tool for RDMA-based networks that can precisely measure the InfiniBand switch latency without hardware support. Using RPerf, we benchmark a rack-scale InfiniBand cluster in both isolated and mixed-traffic scenarios. Our key finding is that the evaluated switch can provide either low latency or high bandwidth, but not both simultaneously in a mixed-traffic scenario. We also evaluate several options to improve the latency-bandwidth trade-off and demonstrate that none are ideal. We find that while queue separation is a solution to protect latency-sensitive applications, it fails to properly manage the bandwidth of other applications. We also aim to resolve the problem with bandwidth management for non-latency-sensitive applications. Previous efforts to address this problem have generally focused on achieving max-min fairness at the flow level. However, we observe that different workloads exhibit varying levels of sensitivity to network bandwidth. For some workloads, even a small reduction in available bandwidth can significantly increase completion time, while for others, completion time is largely insensitive to available network bandwidth. As a result, simply splitting the bandwidth equally among all workloads is sub-optimal for overall application-level performance. To address this issue, we first propose a robust methodology capable of effectively measuring the sensitivity of applications to bandwidth. We then design Saba, an application-aware bandwidth allocation framework that distributes network bandwidth based on application-level sensitivity. Saba combines ahead-of-time application profiling to determine bandwidth sensitivity with runtime bandwidth allocation using lightweight software support, with no modifications to network hardware or protocols. Experiments with a 32-server hardware testbed show that Saba can significantly increase overall performance by reducing the job completion time for bandwidth-sensitive jobs

    Victima: Drastically Increasing Address Translation Reach by Leveraging Underutilized Cache Resources

    Full text link
    Address translation is a performance bottleneck in data-intensive workloads due to large datasets and irregular access patterns that lead to frequent high-latency page table walks (PTWs). PTWs can be reduced by using (i) large hardware TLBs or (ii) large software-managed TLBs. Unfortunately, both solutions have significant drawbacks: increased access latency, power and area (for hardware TLBs), and costly memory accesses, the need for large contiguous memory blocks, and complex OS modifications (for software-managed TLBs). We present Victima, a new software-transparent mechanism that drastically increases the translation reach of the processor by leveraging the underutilized resources of the cache hierarchy. The key idea of Victima is to repurpose L2 cache blocks to store clusters of TLB entries, thereby providing an additional low-latency and high-capacity component that backs up the last-level TLB and thus reduces PTWs. Victima has two main components. First, a PTW cost predictor (PTW-CP) identifies costly-to-translate addresses based on the frequency and cost of the PTWs they lead to. Second, a TLB-aware cache replacement policy prioritizes keeping TLB entries in the cache hierarchy by considering (i) the translation pressure (e.g., last-level TLB miss rate) and (ii) the reuse characteristics of the TLB entries. Our evaluation results show that in native (virtualized) execution environments Victima improves average end-to-end application performance by 7.4% (28.7%) over the baseline four-level radix-tree-based page table design and by 6.2% (20.1%) over a state-of-the-art software-managed TLB, across 11 diverse data-intensive workloads. Victima (i) is effective in both native and virtualized environments, (ii) is completely transparent to application and system software, and (iii) incurs very small area and power overheads on a modern high-end CPU.Comment: To appear in 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 202

    Modular Collaborative Program Analysis

    Get PDF
    With our world increasingly relying on computers, it is important to ensure the quality, correctness, security, and performance of software systems. Static analysis that computes properties of computer programs without executing them has been an important method to achieve this for decades. However, static analysis faces major chal- lenges in increasingly complex programming languages and software systems and increasing and sometimes conflicting demands for soundness, precision, and scalability. In order to cope with these challenges, it is necessary to build static analyses for complex problems from small, independent, yet collaborating modules that can be developed in isolation and combined in a plug-and-play manner. So far, no generic architecture to implement and combine a broad range of dissimilar static analyses exists. The goal of this thesis is thus to design such an architecture and implement it as a generic framework for developing modular, collaborative static analyses. We use several, diverse case-study analyses from which we systematically derive requirements to guide the design of the framework. Based on this, we propose the use of a blackboard-architecture style collaboration of analyses that we implement in the OPAL framework. We also develop a formal model of our architectures core concepts and show how it enables freely composing analyses while retaining their soundness guarantees. We showcase and evaluate our architecture using the case-study analyses, each of which shows how important and complex problems of static analysis can be addressed using a modular, collaborative implementation style. In particular, we show how a modular architecture for the construction of call graphs ensures consistent soundness of different algorithms. We show how modular analyses for different aspects of immutability mutually benefit each other. Finally, we show how the analysis of method purity can benefit from the use of other complex analyses in a collaborative manner and from exchanging different analysis implementations that exhibit different characteristics. Each of these case studies improves over the respective state of the art in terms of soundness, precision, and/or scalability and shows how our architecture enables experimenting with and fine-tuning trade-offs between these qualities

    Hadoop-Oriented SVM-LRU (H-SVM-LRU): An Intelligent Cache Replacement Algorithm to Improve MapReduce Performance

    Full text link
    Modern applications can generate a large amount of data from different sources with high velocity, a combination that is difficult to store and process via traditional tools. Hadoop is one framework that is used for the parallel processing of a large amount of data in a distributed environment, however, various challenges can lead to poor performance. Two particular issues that can limit performance are the high access time for I/O operations and the recomputation of intermediate data. The combination of these two issues can result in resource wastage. In recent years, there have been attempts to overcome these problems by using caching mechanisms. Due to cache space limitations, it is crucial to use this space efficiently and avoid cache pollution (the cache contains data that is not used in the future). We propose Hadoop-oriented SVM-LRU (HSVM- LRU) to improve Hadoop performance. For this purpose, we use an intelligent cache replacement algorithm, SVM-LRU, that combines the well-known LRU mechanism with a machine learning algorithm, SVM, to classify cached data into two groups based on their future usage. Experimental results show a significant decrease in execution time as a result of an increased cache hit ratio, leading to a positive impact on Hadoop performance

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction

    Overview of Caching Mechanisms to Improve Hadoop Performance

    Full text link
    Nowadays distributed computing environments, large amounts of data are generated from different resources with a high velocity, rendering the data difficult to capture, manage, and process within existing relational databases. Hadoop is a tool to store and process large datasets in a parallel manner across a cluster of machines in a distributed environment. Hadoop brings many benefits like flexibility, scalability, and high fault tolerance; however, it faces some challenges in terms of data access time, I/O operation, and duplicate computations resulting in extra overhead, resource wastage, and poor performance. Many researchers have utilized caching mechanisms to tackle these challenges. For example, they have presented approaches to improve data access time, enhance data locality rate, remove repetitive calculations, reduce the number of I/O operations, decrease the job execution time, and increase resource efficiency. In the current study, we provide a comprehensive overview of caching strategies to improve Hadoop performance. Additionally, a novel classification is introduced based on cache utilization. Using this classification, we analyze the impact on Hadoop performance and discuss the advantages and disadvantages of each group. Finally, a novel hybrid approach called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental results show that our hybrid method achieves an average improvement of 31.2% in job execution time
    corecore