66,300 research outputs found

    Pay One, Get Hundreds for Free: Reducing Cloud Costs through Shared Query Execution

    Full text link
    Cloud-based data analysis is nowadays common practice because of the lower system management overhead as well as the pay-as-you-go pricing model. The pricing model, however, is not always suitable for query processing as heavy use results in high costs. For example, in query-as-a-service systems, where users are charged per processed byte, collections of queries accessing the same data frequently can become expensive. The problem is compounded by the limited options for the user to optimize query execution when using declarative interfaces such as SQL. In this paper, we show how, without modifying existing systems and without the involvement of the cloud provider, it is possible to significantly reduce the overhead, and hence the cost, of query-as-a-service systems. Our approach is based on query rewriting so that multiple concurrent queries are combined into a single query. Our experiments show the aggregated amount of work done by the shared execution is smaller than in a query-at-a-time approach. Since queries are charged per byte processed, the cost of executing a group of queries is often the same as executing a single one of them. As an example, we demonstrate how the shared execution of the TPC-H benchmark is up to 100x and 16x cheaper in Amazon Athena and Google BigQuery than using a query-at-a-time approach while achieving a higher throughput

    A Novel Workload Allocation Strategy for Batch Jobs

    Get PDF
    The distribution of computational tasks across a diverse set of geographically distributed heterogeneous resources is a critical issue in the realisation of true computational grids. Conventionally, workload allocation algorithms are divided into static and dynamic approaches. Whilst dynamic approaches frequently outperform static schemes, they usually require the collection and processing of detailed system information at frequent intervals - a task that can be both time consuming and unreliable in the real-world. This paper introduces a novel workload allocation algorithm for optimally distributing the workload produced by the arrival of batches of jobs. Results show that, for the arrival of batches of jobs, this workload allocation algorithm outperforms other commonly used algorithms in the static case. A hybrid scheduling approach (using this workload allocation algorithm), where information about the speed of computational resources is inferred from previously completed jobs, is then introduced and the efficiency of this approach demonstrated using a real world computational grid. These results are compared to the same workload allocation algorithm used in the static case and it can be seen that this hybrid approach comprehensively outperforms the static approach

    Many Task Learning with Task Routing

    Full text link
    Typical multi-task learning (MTL) methods rely on architectural adjustments and a large trainable parameter set to jointly optimize over several tasks. However, when the number of tasks increases so do the complexity of the architectural adjustments and resource requirements. In this paper, we introduce a method which applies a conditional feature-wise transformation over the convolutional activations that enables a model to successfully perform a large number of tasks. To distinguish from regular MTL, we introduce Many Task Learning (MaTL) as a special case of MTL where more than 20 tasks are performed by a single model. Our method dubbed Task Routing (TR) is encapsulated in a layer we call the Task Routing Layer (TRL), which applied in an MaTL scenario successfully fits hundreds of classification tasks in one model. We evaluate our method on 5 datasets against strong baselines and state-of-the-art approaches.Comment: 8 Pages, 5 Figures, 2 Table

    Distributed learning of CNNs on heterogeneous CPU/GPU architectures

    Get PDF
    Convolutional Neural Networks (CNNs) have shown to be powerful classification tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are offered by several frameworks dedicated to neural network training. However, these techniques do not take full advantage of the possible parallelization offered by CNNs and the cooperative use of heterogeneous devices with different processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 6060-9090\% of global processing time. The paper analyzes the influence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without affecting the classification performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500500 and 15001500 kernels, respectively, best speedups achieve 3.28×3.28\times using four CPUs and 2.45×2.45\times with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 6060-9090\% of processing time calculating convolutions, and speedups will tend to increase accordingly
    • …
    corecore