54,939 research outputs found
Online Algorithms for Multi-Level Aggregation
In the Multi-Level Aggregation Problem (MLAP), requests arrive at the nodes
of an edge-weighted tree T, and have to be served eventually. A service is
defined as a subtree X of T that contains its root. This subtree X serves all
requests that are pending in the nodes of X, and the cost of this service is
equal to the total weight of X. Each request also incurs waiting cost between
its arrival and service times. The objective is to minimize the total waiting
cost of all requests plus the total cost of all service subtrees. MLAP is a
generalization of some well-studied optimization problems; for example, for
trees of depth 1, MLAP is equivalent to the TCP Acknowledgment Problem, while
for trees of depth 2, it is equivalent to the Joint Replenishment Problem.
Aggregation problem for trees of arbitrary depth arise in multicasting, sensor
networks, communication in organization hierarchies, and in supply-chain
management. The instances of MLAP associated with these applications are
naturally online, in the sense that aggregation decisions need to be made
without information about future requests.
Constant-competitive online algorithms are known for MLAP with one or two
levels. However, it has been open whether there exist constant competitive
online algorithms for trees of depth more than 2. Addressing this open problem,
we give the first constant competitive online algorithm for networks of
arbitrary (fixed) number of levels. The competitive ratio is O(D^4 2^D), where
D is the depth of T. The algorithm works for arbitrary waiting cost functions,
including the variant with deadlines.
We also show several additional lower and upper bound results for some
special cases of MLAP, including the Single-Phase variant and the case when the
tree is a path
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Diffusion-limited deposition of dipolar particles
Deposits of dipolar particles are investigated by means of extensive Monte
Carlo simulations. We found that the effect of the interactions is described by
an initial, non-universal, scaling regime characterized by orientationally
ordered deposits. In the dipolar regime, the order and geometry of the clusters
depend on the strength of the interactions and the magnetic properties are
tunable by controlling the growth conditions. At later stages, the growth is
dominated by thermal effects and the diffusion-limited universal regime
obtains, at finite temperatures. At low temperatures the crossover size
increases exponentially as T decreases and at T=0 only the dipolar regime is
observed.Comment: 5 pages, 4 figure
Isoelastic Agents and Wealth Updates in Machine Learning Markets
Recently, prediction markets have shown considerable promise for developing
flexible mechanisms for machine learning. In this paper, agents with isoelastic
utilities are considered. It is shown that the costs associated with
homogeneous markets of agents with isoelastic utilities produce equilibrium
prices corresponding to alpha-mixtures, with a particular form of mixing
component relating to each agent's wealth. We also demonstrate that wealth
accumulation for logarithmic and other isoelastic agents (through payoffs on
prediction of training targets) can implement both Bayesian model updates and
mixture weight updates by imposing different market payoff structures. An
iterative algorithm is given for market equilibrium computation. We demonstrate
that inhomogeneous markets of agents with isoelastic utilities outperform state
of the art aggregate classifiers such as random forests, as well as single
classifiers (neural networks, decision trees) on a number of machine learning
benchmarks, and show that isoelastic combination methods are generally better
than their logarithmic counterparts.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Multiple imputation for sharing precise geographies in public use data
When releasing data to the public, data stewards are ethically and often
legally obligated to protect the confidentiality of data subjects' identities
and sensitive attributes. They also strive to release data that are informative
for a wide range of secondary analyses. Achieving both objectives is
particularly challenging when data stewards seek to release highly resolved
geographical information. We present an approach for protecting the
confidentiality of data with geographic identifiers based on multiple
imputation. The basic idea is to convert geography to latitude and longitude,
estimate a bivariate response model conditional on attributes, and simulate new
latitude and longitude values from these models. We illustrate the proposed
methods using data describing causes of death in Durham, North Carolina. In the
context of the application, we present a straightforward tool for generating
simulated geographies and attributes based on regression trees, and we present
methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Regression and Learning to Rank Aggregation for User Engagement Evaluation
User engagement refers to the amount of interaction an instance (e.g., tweet,
news, and forum post) achieves. Ranking the items in social media websites
based on the amount of user participation in them, can be used in different
applications, such as recommender systems. In this paper, we consider a tweet
containing a rating for a movie as an instance and focus on ranking the
instances of each user based on their engagement, i.e., the total number of
retweets and favorites it will gain.
For this task, we define several features which can be extracted from the
meta-data of each tweet. The features are partitioned into three categories:
user-based, movie-based, and tweet-based. We show that in order to obtain good
results, features from all categories should be considered. We exploit
regression and learning to rank methods to rank the tweets and propose to
aggregate the results of regression and learning to rank methods to achieve
better performance. We have run our experiments on an extended version of
MovieTweeting dataset provided by ACM RecSys Challenge 2014. The results show
that learning to rank approach outperforms most of the regression models and
the combination can improve the performance significantly.Comment: In Proceedings of the 2014 ACM Recommender Systems Challenge,
RecSysChallenge '1
- …