9,828 research outputs found

    EOS: Automatic In-vivo Evolution of Kernel Policies for Better Performance

    Full text link
    Today's monolithic kernels often implement a small, fixed set of policies such as disk I/O scheduling policies, while exposing many parameters to let users select a policy or adjust the specific setting of the policy. Ideally, the parameters exposed should be flexible enough for users to tune for good performance, but in practice, users lack domain knowledge of the parameters and are often stuck with bad, default parameter settings. We present EOS, a system that bridges the knowledge gap between kernel developers and users by automatically evolving the policies and parameters in vivo on users' real, production workloads. It provides a simple policy specification API for kernel developers to programmatically describe how the policies and parameters should be tuned, a policy cache to make in-vivo tuning easy and fast by memorizing good parameter settings for past workloads, and a hierarchical search engine to effectively search the parameter space. Evaluation of EOS on four main Linux subsystems shows that it is easy to use and effectively improves each subsystem's performance.Comment: 14 pages, technique repor

    Technical Report: KNN Joins Using a Hybrid Approach: Exploiting CPU/GPU Workload Characteristics

    Full text link
    This paper studies finding the K nearest neighbors (KNN) of all points in a dataset. Typical solutions to KNN searches use indexing to prune the search, which reduces the number of candidate points that may be within the set of the nearest KK points of each query point. In high dimensionality, index searches degrade, making the KNN self-join a prohibitively expensive operation in some scenarios. Furthermore, there are a significant number of distance calculations needed to determine which points are nearest to each query point. To address these challenges, we propose a hybrid CPU/GPU approach. Since the CPU and GPU are considerably different architectures that are best exploited using different algorithms, we advocate for splitting the work between both architectures based on the characteristic workloads defined by the query points in the dataset. As such, we assign dense regions to the GPU, and sparse regions to the CPU to most efficiently exploit the relative strengths of each architecture. Critically, we find that the relative performance gains over the reference implementation across four real-world datasets are a function of the data properties (size, dimensionality, distribution), and number of neighbors, K.Comment: 30 pages, 10 figures, 6 table

    GPU Accelerated Self-join for the Distance Similarity Metric

    Full text link
    The self-join finds all objects in a dataset within a threshold of each other defined by a similarity metric. As such, the self-join is a building block for the field of databases and data mining, and is employed in Big Data applications. In this paper, we advance a GPU-efficient algorithm for the similarity self-join that uses the Euclidean distance metric. The search-and-refine strategy is an efficient approach for low dimensionality datasets, as index searches degrade with increasing dimension (i.e., the curse of dimensionality). Thus, we target the low dimensionality problem, and compare our GPU self-join to a search-and-refine implementation, and a state-of-the-art parallel algorithm. In low dimensionality, there are several unique challenges associated with efficiently solving the self-join problem on the GPU. Low dimensional data often results in higher data densities, causing a significant number of distance calculations and a large result set. As dimensionality increases, index searches become increasingly exhaustive, forming a performance bottleneck. We advance several techniques to overcome these challenges using the GPU. The techniques we propose include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result set sizes, and a reduction in distance calculations through duplicate search removal. Our GPU self-join outperforms both search-and-refine and state-of-the-art algorithms.Comment: Accepted for Publication in the 4th IEEE International Workshop on High-Performance Big Data, Deep Learning, and Cloud Computing. To appear in the Proceedings of the 32nd IEEE International Parallel and Distributed Processing Symposium Workshop

    Application-Driven Near-Data Processing for Similarity Search

    Full text link
    Similarity search is a key to a variety of applications including content-based search for images and video, recommendation systems, data deduplication, natural language processing, computer vision, databases, computational biology, and computer graphics. At its core, similarity search manifests as k-nearest neighbors (kNN), a computationally simple primitive consisting of highly parallel distance calculations and a global top-k sort. However, kNN is poorly supported by today's architectures because of its high memory bandwidth requirements. This paper proposes an application-driven near-data processing accelerator for similarity search: the Similarity Search Associative Memory (SSAM). By instantiating compute units close to memory, SSAM benefits from the higher memory bandwidth and density exposed by emerging memory technologies. We evaluate the SSAM design down to layout on top of the Micron hybrid memory cube (HMC), and show that SSAM can achieve up to two orders of magnitude area-normalized throughput and energy efficiency improvement over multicore CPUs; we also show SSAM is faster and more energy efficient than competing GPUs and FPGAs. Finally, we show that SSAM is also useful for other data intensive tasks like kNN index construction, and can be generalized to semantically function as a high capacity content addressable memory.Comment: 15 pages, 8 figures, 7 table

    HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud

    Full text link
    Eliminating duplicate data in primary storage of clouds increases the cost-efficiency of cloud service providers as well as reduces the cost of users for using cloud services. Existing primary deduplication techniques either use inline caching to exploit locality in primary workloads or use post-processing deduplication running in system idle time to avoid the negative impact on I/O performance. However, neither of them works well in the cloud servers running multiple services or applications for the following two reasons: Firstly, the temporal locality of duplicate data writes may not exist in some primary storage workloads thus inline caching often fails to achieve good deduplication ratio. Secondly, the post-processing deduplication allows duplicate data to be written into disks, therefore does not provide the benefit of I/O deduplication and requires high peak storage capacity. This paper presents HPDedup, a Hybrid Prioritized data Deduplication mechanism to deal with the storage system shared by applications running in co-located virtual machines or containers by fusing an inline and a post-processing process for exact deduplication. In the inline deduplication phase, HPDedup gives a fingerprint caching mechanism that estimates the temporal locality of duplicates in data streams from different VMs or applications and prioritizes the cache allocation for these streams based on the estimation. HPDedup also allows different deduplication threshold for streams based on their spatial locality to reduce the disk fragmentation. The post-processing phase removes duplicates whose fingerprints are not able to be cached due to the weak temporal locality from disks. Our experimental results show that HPDedup clearly outperforms the state-of-the-art primary storage deduplication techniques in terms of inline cache efficiency and primary deduplication efficiency.Comment: 14 pages, 11 figures, submitted to MSST201

    Cuttlefish: A Lightweight Primitive for Adaptive Query Processing

    Full text link
    Modern data processing applications execute increasingly sophisticated analysis that requires operations beyond traditional relational algebra. As a result, operators in query plans grow in diversity and complexity. Designing query optimizer rules and cost models to choose physical operators for all of these novel logical operators is impractical. To address this challenge, we develop Cuttlefish, a new primitive for adaptively processing online query plans that explores candidate physical operator instances during query execution and exploits the fastest ones using multi-armed bandit reinforcement learning techniques. We prototype Cuttlefish in Apache Spark and adaptively choose operators for image convolution, regular expression matching, and relational joins. Our experiments show Cuttlefish-based adaptive convolution and regular expression operators can reach 72-99% of the throughput of an all-knowing oracle that always selects the optimal algorithm, even when individual physical operators are up to 105x slower than the optimal. Additionally, Cuttlefish achieves join throughput improvements of up to 7.5x compared with Spark SQL's query optimizer

    Analyzes of the Distributed System Load with Multifractal Input Data Flows

    Full text link
    The paper proposes a solution an actual scientific problem related to load balancing and efficient utilization of resources of the distributed system. The proposed method is based on calculation of load CPU, memory, and bandwidth by flows of different classes of service for each server and the entire distributed system and taking into account multifractal properties of input data flows. Weighting factors were introduced that allow to determine the significance of the characteristics of server relative to each other. Thus, this method allows to calculate the imbalance of the all system servers and system utilization. The simulation of the proposed method for different multifractal parameters of input flows was conducted. The simulation showed that the characteristics of multifractal traffic have a appreciable effect on the system imbalance. The usage of proposed method allows to distribute requests across the servers thus that the deviation of the load servers from the average value was minimal, that allows to get a higher metrics of system performance and faster processing flows.Comment: 5 page

    Machine Learning in Compiler Optimisation

    Full text link
    In the last decade, machine learning based compilation has moved from an an obscure research niche to a mainstream activity. In this article, we describe the relationship between machine learning and compiler optimisation and introduce the main concepts of features, models, training and deployment. We then provide a comprehensive survey and provide a road map for the wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This paper provides both an accessible introduction to the fast moving area of machine learning based compilation and a detailed bibliography of its main achievements.Comment: Accepted to be published at Proceedings of the IEE

    A Comparative Exploration of ML Techniques for Tuning Query Degree of Parallelism

    Full text link
    There is a large body of recent work applying machine learning (ML) techniques to query optimization and query performance prediction in relational database management systems (RDBMSs). However, these works typically ignore the effect of \textit{intra-parallelism} -- a key component used to boost the performance of OLAP queries in practice -- on query performance prediction. In this paper, we take a first step towards filling this gap by studying the problem of \textit{tuning the degree of parallelism (DOP) via ML techniques} in Microsoft SQL Server, a popular commercial RDBMS that allows an individual query to execute using multiple cores. In our study, we cast the problem of DOP tuning as a {\em regression} task, and examine how several popular ML models can help with query performance prediction in a multi-core setting. We explore the design space and perform an extensive experimental study comparing different models against a list of performance metrics, testing how well they generalize in different settings: (i)(i) to queries from the same template, (ii)(ii) to queries from a new template, (iii)(iii) to instances of different scale, and (iv)(iv) to different instances and queries. Our experimental results show that a simple featurization of the input query plan that ignores cost model estimations can accurately predict query performance, capture the speedup trend with respect to the available parallelism, as well as help with automatically choosing an optimal per-query DOP

    GRACOS: Scalable and Load Balanced P3M Cosmological N-body Code

    Full text link
    We present a parallel implementation of the particle-particle/particle-mesh (P3M) algorithm for distributed memory clusters. The GRACOS (GRAvitational COSmology) code uses a hybrid method for both computation and domain decomposition. Long-range forces are computed using a Fourier transform gravity solver on a regular mesh; the mesh is distributed across parallel processes using a static one-dimensional slab domain decomposition. Short-range forces are computed by direct summation of close pairs; particles are distributed using a dynamic domain decomposition based on a space-filling Hilbert curve. A nearly-optimal method was devised to dynamically repartition the particle distribution so as to maintain load balance even for extremely inhomogeneous mass distributions. Tests using 8003800^3 simulations on a 40-processor beowulf cluster showed good load balance and scalability up to 80 processes. We discuss the limits on scalability imposed by communication and extreme clustering and suggest how they may be removed by extending our algorithm to include adaptive mesh refinement.Comment: to be submitted to ApJ.
    • …
    corecore