13,429 research outputs found

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time

    Optimal Pricing to Manage Electric Vehicles in Coupled Power and Transportation Networks

    Full text link
    We study the system-level effects of the introduction of large populations of Electric Vehicles on the power and transportation networks. We assume that each EV owner solves a decision problem to pick a cost-minimizing charge and travel plan. This individual decision takes into account traffic congestion in the transportation network, affecting travel times, as well as as congestion in the power grid, resulting in spatial variations in electricity prices for battery charging. We show that this decision problem is equivalent to finding the shortest path on an "extended" transportation graph, with virtual arcs that represent charging options. Using this extended graph, we study the collective effects of a large number of EV owners individually solving this path planning problem. We propose a scheme in which independent power and transportation system operators can collaborate to manage each network towards a socially optimum operating point while keeping the operational data of each system private. We further study the optimal reserve capacity requirements for pricing in the absence of such collaboration. We showcase numerically that a lack of attention to interdependencies between the two infrastructures can have adverse operational effects.Comment: Submitted to IEEE Transactions on Control of Network Systems on June 1st 201

    Aggregate Analytic Window Query over Spatial Data

    Full text link
    Analytic window query is a commonly used query in the relational databases. It answers the aggregations of data over a sliding window. For example, to get the average prices of a stock for each day. However, it is not supported in the spatial databases. Because the spatial data are not in a one-dimension space, there is no straightforward way to extend the original analytic window query to spatial databases. But these queries are useful and meaningful. For example, to find the average number of visits for all the POIs in the circle with a fixed radius for each POI as the centre. In this paper, we define the aggregate analytic window query over spatial data and propose algorithms for grid index and tree-index. We also analyze the complexity of the algorithms to prove they are efficient and practical

    Pulling Out All the Tops with Computer Vision and Deep Learning

    Full text link
    We apply computer vision with deep learning -- in the form of a convolutional neural network (CNN) -- to build a highly effective boosted top tagger. Previous work (the "DeepTop" tagger of Kasieczka et al) has shown that a CNN-based top tagger can achieve comparable performance to state-of-the-art conventional top taggers based on high-level inputs. Here, we introduce a number of improvements to the DeepTop tagger, including architecture, training, image preprocessing, sample size and color pixels. Our final CNN top tagger outperforms BDTs based on high-level inputs by a factor of ∼2\sim 2--3 or more in background rejection, over a wide range of tagging efficiencies and fiducial jet selections. As reference points, we achieve a QCD background rejection factor of 500 (60) at 50\% top tagging efficiency for fully-merged (non-merged) top jets with pTp_T in the 800--900 GeV (350--450 GeV) range. Our CNN can also be straightforwardly extended to the classification of other types of jets, and the lessons learned here may be useful to others designing their own deep NNs for LHC applications.Comment: 33 pages, 11 figure

    Improving Performance of Spatial Network Queries

    Get PDF
    Spatial network queries, for example KNN or range, operate on systems where objects are constrained to locations on a network. Current spatial network query algorithms rely on forms of network traversal which have a high complexity proportional to the size of the network making, them poor for large real-world networks. In this thesis, an alternative method of approximating the results of spatial network queries with a high level of accuracy is introduced. Distances between network points are stored in an M-Tree index, a balanced tree index where metric distance determines data ordering. The M-Tree uses the chessboard metric on network points embedded in a higher dimensional space using tRNE. Using the M-Tree both KNN and range queries are computed more efficiently than network traversal. Error rates of the M-Tree are low, with accuracies of 97% possible on KNN queries and perfect accuracy with 2% extra results on range queries
    • …