13,429 research outputs found
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
Optimal Pricing to Manage Electric Vehicles in Coupled Power and Transportation Networks
We study the system-level effects of the introduction of large populations of
Electric Vehicles on the power and transportation networks. We assume that each
EV owner solves a decision problem to pick a cost-minimizing charge and travel
plan. This individual decision takes into account traffic congestion in the
transportation network, affecting travel times, as well as as congestion in the
power grid, resulting in spatial variations in electricity prices for battery
charging. We show that this decision problem is equivalent to finding the
shortest path on an "extended" transportation graph, with virtual arcs that
represent charging options. Using this extended graph, we study the collective
effects of a large number of EV owners individually solving this path planning
problem. We propose a scheme in which independent power and transportation
system operators can collaborate to manage each network towards a socially
optimum operating point while keeping the operational data of each system
private. We further study the optimal reserve capacity requirements for pricing
in the absence of such collaboration. We showcase numerically that a lack of
attention to interdependencies between the two infrastructures can have adverse
operational effects.Comment: Submitted to IEEE Transactions on Control of Network Systems on June
1st 201
Aggregate Analytic Window Query over Spatial Data
Analytic window query is a commonly used query in the relational databases.
It answers the aggregations of data over a sliding window. For example, to get
the average prices of a stock for each day. However, it is not supported in the
spatial databases. Because the spatial data are not in a one-dimension space,
there is no straightforward way to extend the original analytic window query to
spatial databases. But these queries are useful and meaningful. For example, to
find the average number of visits for all the POIs in the circle with a fixed
radius for each POI as the centre. In this paper, we define the aggregate
analytic window query over spatial data and propose algorithms for grid index
and tree-index. We also analyze the complexity of the algorithms to prove they
are efficient and practical
Pulling Out All the Tops with Computer Vision and Deep Learning
We apply computer vision with deep learning -- in the form of a convolutional
neural network (CNN) -- to build a highly effective boosted top tagger.
Previous work (the "DeepTop" tagger of Kasieczka et al) has shown that a
CNN-based top tagger can achieve comparable performance to state-of-the-art
conventional top taggers based on high-level inputs. Here, we introduce a
number of improvements to the DeepTop tagger, including architecture, training,
image preprocessing, sample size and color pixels. Our final CNN top tagger
outperforms BDTs based on high-level inputs by a factor of --3 or more
in background rejection, over a wide range of tagging efficiencies and fiducial
jet selections. As reference points, we achieve a QCD background rejection
factor of 500 (60) at 50\% top tagging efficiency for fully-merged (non-merged)
top jets with in the 800--900 GeV (350--450 GeV) range. Our CNN can also
be straightforwardly extended to the classification of other types of jets, and
the lessons learned here may be useful to others designing their own deep NNs
for LHC applications.Comment: 33 pages, 11 figure
Improving Performance of Spatial Network Queries
Spatial network queries, for example KNN or range, operate on systems where objects are constrained to locations on a network. Current spatial network query algorithms rely on forms of network traversal which have a high complexity proportional to the size of the network making, them poor for large real-world networks. In this thesis, an alternative method of approximating the results of spatial network queries with a high level of accuracy is introduced. Distances between network points are stored in an M-Tree index, a balanced tree index where metric distance determines data ordering. The M-Tree uses the chessboard metric on network points embedded in a higher dimensional space using tRNE. Using the M-Tree both KNN and range queries are computed more efficiently than network traversal. Error rates of the M-Tree are low, with accuracies of 97% possible on KNN queries and perfect accuracy with 2% extra results on range queries
- …