47,133 research outputs found

    Random Forests for Big Data

    Get PDF
    Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities -- such as out-of-bag error and variable importance -- are addressed in these methods. Then, we formulate various remarks for random forests in the Big Data context. Finally, we experiment five variants on two massive datasets (15 and 120 millions of observations), a simulated one as well as real world data. One variant relies on subsampling while three others are related to parallel implementations of random forests and involve either various adaptations of bootstrap to Big Data or to "divide-and-conquer" approaches. The fifth variant relates on online learning of random forests. These numerical experiments lead to highlight the relative performance of the different variants, as well as some of their limitations

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams

    Full text link
    Online social media are complementing and in some cases replacing person-to-person social interaction and redefining the diffusion of information. In particular, microblogs have become crucial grounds on which public relations, marketing, and political battles are fought. We introduce an extensible framework that will enable the real-time analysis of meme diffusion in social media by mining, visualizing, mapping, classifying, and modeling massive streams of public microblogging events. We describe a Web service that leverages this framework to track political memes in Twitter and help detect astroturfing, smear campaigns, and other misinformation in the context of U.S. political elections. We present some cases of abusive behaviors uncovered by our service. Finally, we discuss promising preliminary results on the detection of suspicious memes via supervised learning based on features extracted from the topology of the diffusion networks, sentiment analysis, and crowdsourced annotations

    Graph Sample and Hold: A Framework for Big-Graph Analytics

    Full text link
    Sampling is a standard approach in big-graph analytics; the goal is to efficiently estimate the graph properties by consulting a sample of the whole population. A perfect sample is assumed to mirror every property of the whole population. Unfortunately, such a perfect sample is hard to collect in complex populations such as graphs (e.g. web graphs, social networks etc), where an underlying network connects the units of the population. Therefore, a good sample will be representative in the sense that graph properties of interest can be estimated with a known degree of accuracy. While previous work focused particularly on sampling schemes used to estimate certain graph properties (e.g. triangle count), much less is known for the case when we need to estimate various graph properties with the same sampling scheme. In this paper, we propose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH). To begin, the proposed framework samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state. We then show how to produce unbiased estimators for various graph properties from the sample. Given that the graph analysis algorithms will run on a sample instead of the whole population, the runtime complexity of these algorithm is kept under control. Moreover, given that the estimators of graph properties are unbiased, the approximation error is kept under control. Finally, we show the performance of the proposed framework (gSH) on various types of graphs, such as social graphs, among others

    ISER 2012 Working Paper No. 1

    Get PDF
    Large resource development projects take years to plan. During that planning time, the public frequently debates the potential benefits and risks of a project, but with incomplete information. In these debates, some people might assert that a project would have great benefits, while others might assert that it would certainly harm the environment. At the same time, the developer will be assessing different designs, before finally submitting one to the government permitting agencies for evaluation and public scrutiny. For large mines in Alaska, the government permitting process takes years, and often includes an ecological risk assessment. This assessment is a data-intensive, scientific evaluation of the project’s potential ecological risks, based on the specific details of the project. Recently, some organizations have tried to bring scientific rigor to the pre-design public discussions, especially for mining projects, through a pre-design risk ecological risk assessment. This is a scientific assessment of the environmental risks a project might pose, before the details of project design, risk-prevention, and risk-mitigation measures are known. It is important to know whether pre-design risk assessment is a viable method for drawing conclusions about risks of projects. If valid risk predictions can be made at that stage, then people or governments would not have to wait for either a design or for the detailed evaluation that is done during the permitting process. Such an approach could be used to short cut permitting. It could affect project financing; it could affect the schedule, priority, or even the resources that governments put toward evaluating a project. But perhaps most important: in an age where public perceptions are an important influence on a project’s viability and government permitting decisions, a realistic risk assessment can be used to focus public attention on the facts. But if the methodology is flawed and results in poor quality information and unsupportable conclusions, then a pre-design risk assessment could unjustifiably either inflame or calm the public, depending on what it predicts.Executive Summary / Section 1. Introduction / Section 2. Overview of Ecological Risk / Section 3. Ecological Risk Assessment Methodology / Section 4. Examples of Post-Design Ecological Risk Assessments / Section 5. Pre-Design Ecological Risk Assessment: Risks of Large Scale Mining in the Bristol Bay Watershed / Section 6. Conclusion / Bibliograph

    A monitoring strategy for application to salmon-bearing watersheds

    Get PDF

    Alaska mining and water quality

    Get PDF
    The Institute of Water Resources has sought financial assistance for some time in an attempt to initiate research relative to the impact of mining on water quality. Attempts were made as early as 1971 by Dr. Timothy Tilsworth and later by Dr. Donald Cook and Dr. Sage Murphy. These investigators anticipated growth in placer gold mining and the development of natural resources in Alaska during a period of national and environmental concern. The subsequent energy "crisis," the major increase in the price of gold on the world market, and dwindling nonrenewable resource supplies have resulted in large-scale mineral exploration in Alaska. This exploration, coupled with development of the trans-Alaska oil pipeline, has attracted considerable capital for potential investment and development in Alaska. Expected industrial growth has already started and major new projects are "just around the corner." Yet, as of 1976, no major research effort has occurred to determine the extent of or potential for water quality impacts from mining operations in Alaska. Recently a series of interdisciplinary research projects have been completed in Canada; however, the application of Canadian data to Alaskan problems is uncertain. Although, state and federal government agencies have been advised and are aware of this potential problem and lack of baseline data they have not sought out new information or rational solutions. Even now, with deadlines of Public Law 92-500 at hand, some regulatory agencies give the impression of attempting to ignore the situation. Interim limitations are proposed and permits are issued with no discernible rationale or basis. Data have not been obtained relative to the Alaskan mining operations and thus are not available for use in seeking solutions compatible with mining and environmental protection. Numbers appear to have been arbitrarily assigned to permits and water quality standards. When permits are issued, self-monitoring requirements are negligible or nonexistent. Nor have regulatory agencies demonstrated the ability or inclination to monitor mining operations or enforce permits and water quality standards. It was hoped that the project would bring together miners, environmentalists, and regulators in a cooperative effort to identify the problems and seek solutions. The investigators recognized the political sensitivity of the subject matter but proceeded optimistically. Relatively good cooperation, though not total, occurred early in the project. In April 1976, a symposium was held to exchange ideas and determine the state-of-the-art. Although the symposium had good attendance and an exchange of information occurred, the symposium itself was somewhat of a disappointment. With few exceptions, the participants aligned on one side or the other in preconceived fixed positions. Some even chose not to attend and were therefore able to avoid the issues. Little hard data was presented. Optimistically, some of the miners, environmentalists, and regulators are prepared to resolve their differences. This report, hopefully, will be of benefit to them. It is our experience that miners and environmentalists share a love of the land that is uniquely Alaska. We feel that technology is available for application to this problem for those who care about doing the job right in the "last frontier." Whether or not it will be effectively applied to protect Alaska's water resources is a question which remains unanswered.The work upon which this report is based was supported in part by funds provided by the United States Department of the Interior, Office of Water Resources Research Act of 1964, Public Law 88-379, as amended (Project A-055-ALAS)
    • …
    corecore