54,939 research outputs found

    Online Algorithms for Multi-Level Aggregation

    Full text link
    In the Multi-Level Aggregation Problem (MLAP), requests arrive at the nodes of an edge-weighted tree T, and have to be served eventually. A service is defined as a subtree X of T that contains its root. This subtree X serves all requests that are pending in the nodes of X, and the cost of this service is equal to the total weight of X. Each request also incurs waiting cost between its arrival and service times. The objective is to minimize the total waiting cost of all requests plus the total cost of all service subtrees. MLAP is a generalization of some well-studied optimization problems; for example, for trees of depth 1, MLAP is equivalent to the TCP Acknowledgment Problem, while for trees of depth 2, it is equivalent to the Joint Replenishment Problem. Aggregation problem for trees of arbitrary depth arise in multicasting, sensor networks, communication in organization hierarchies, and in supply-chain management. The instances of MLAP associated with these applications are naturally online, in the sense that aggregation decisions need to be made without information about future requests. Constant-competitive online algorithms are known for MLAP with one or two levels. However, it has been open whether there exist constant competitive online algorithms for trees of depth more than 2. Addressing this open problem, we give the first constant competitive online algorithm for networks of arbitrary (fixed) number of levels. The competitive ratio is O(D^4 2^D), where D is the depth of T. The algorithm works for arbitrary waiting cost functions, including the variant with deadlines. We also show several additional lower and upper bound results for some special cases of MLAP, including the Single-Phase variant and the case when the tree is a path

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Diffusion-limited deposition of dipolar particles

    Full text link
    Deposits of dipolar particles are investigated by means of extensive Monte Carlo simulations. We found that the effect of the interactions is described by an initial, non-universal, scaling regime characterized by orientationally ordered deposits. In the dipolar regime, the order and geometry of the clusters depend on the strength of the interactions and the magnetic properties are tunable by controlling the growth conditions. At later stages, the growth is dominated by thermal effects and the diffusion-limited universal regime obtains, at finite temperatures. At low temperatures the crossover size increases exponentially as T decreases and at T=0 only the dipolar regime is observed.Comment: 5 pages, 4 figure

    Isoelastic Agents and Wealth Updates in Machine Learning Markets

    Get PDF
    Recently, prediction markets have shown considerable promise for developing flexible mechanisms for machine learning. In this paper, agents with isoelastic utilities are considered. It is shown that the costs associated with homogeneous markets of agents with isoelastic utilities produce equilibrium prices corresponding to alpha-mixtures, with a particular form of mixing component relating to each agent's wealth. We also demonstrate that wealth accumulation for logarithmic and other isoelastic agents (through payoffs on prediction of training targets) can implement both Bayesian model updates and mixture weight updates by imposing different market payoff structures. An iterative algorithm is given for market equilibrium computation. We demonstrate that inhomogeneous markets of agents with isoelastic utilities outperform state of the art aggregate classifiers such as random forests, as well as single classifiers (neural networks, decision trees) on a number of machine learning benchmarks, and show that isoelastic combination methods are generally better than their logarithmic counterparts.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    Multiple imputation for sharing precise geographies in public use data

    Full text link
    When releasing data to the public, data stewards are ethically and often legally obligated to protect the confidentiality of data subjects' identities and sensitive attributes. They also strive to release data that are informative for a wide range of secondary analyses. Achieving both objectives is particularly challenging when data stewards seek to release highly resolved geographical information. We present an approach for protecting the confidentiality of data with geographic identifiers based on multiple imputation. The basic idea is to convert geography to latitude and longitude, estimate a bivariate response model conditional on attributes, and simulate new latitude and longitude values from these models. We illustrate the proposed methods using data describing causes of death in Durham, North Carolina. In the context of the application, we present a straightforward tool for generating simulated geographies and attributes based on regression trees, and we present methods for assessing disclosure risks with such simulated data.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS506 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Regression and Learning to Rank Aggregation for User Engagement Evaluation

    Full text link
    User engagement refers to the amount of interaction an instance (e.g., tweet, news, and forum post) achieves. Ranking the items in social media websites based on the amount of user participation in them, can be used in different applications, such as recommender systems. In this paper, we consider a tweet containing a rating for a movie as an instance and focus on ranking the instances of each user based on their engagement, i.e., the total number of retweets and favorites it will gain. For this task, we define several features which can be extracted from the meta-data of each tweet. The features are partitioned into three categories: user-based, movie-based, and tweet-based. We show that in order to obtain good results, features from all categories should be considered. We exploit regression and learning to rank methods to rank the tweets and propose to aggregate the results of regression and learning to rank methods to achieve better performance. We have run our experiments on an extended version of MovieTweeting dataset provided by ACM RecSys Challenge 2014. The results show that learning to rank approach outperforms most of the regression models and the combination can improve the performance significantly.Comment: In Proceedings of the 2014 ACM Recommender Systems Challenge, RecSysChallenge '1
    corecore