47,133 research outputs found
Random Forests for Big Data
Big Data is one of the major challenges of statistical science and has
numerous consequences from algorithmic and theoretical viewpoints. Big Data
always involve massive data but they also often include online data and data
heterogeneity. Recently some statistical methods have been adapted to process
Big Data, like linear regression models, clustering methods and bootstrapping
schemes. Based on decision trees combined with aggregation and bootstrap ideas,
random forests were introduced by Breiman in 2001. They are a powerful
nonparametric statistical method allowing to consider in a single and versatile
framework regression problems, as well as two-class and multi-class
classification problems. Focusing on classification problems, this paper
proposes a selective review of available proposals that deal with scaling
random forests to Big Data problems. These proposals rely on parallel
environments or on online adaptations of random forests. We also describe how
related quantities -- such as out-of-bag error and variable importance -- are
addressed in these methods. Then, we formulate various remarks for random
forests in the Big Data context. Finally, we experiment five variants on two
massive datasets (15 and 120 millions of observations), a simulated one as well
as real world data. One variant relies on subsampling while three others are
related to parallel implementations of random forests and involve either
various adaptations of bootstrap to Big Data or to "divide-and-conquer"
approaches. The fifth variant relates on online learning of random forests.
These numerical experiments lead to highlight the relative performance of the
different variants, as well as some of their limitations
Network Sampling: From Static to Streaming Graphs
Network sampling is integral to the analysis of social, information, and
biological networks. Since many real-world networks are massive in size,
continuously evolving, and/or distributed in nature, the network structure is
often sampled in order to facilitate study. For these reasons, a more thorough
and complete understanding of network sampling is critical to support the field
of network science. In this paper, we outline a framework for the general
problem of network sampling, by highlighting the different objectives,
population and units of interest, and classes of network sampling methods. In
addition, we propose a spectrum of computational models for network sampling
methods, ranging from the traditionally studied model based on the assumption
of a static domain to a more challenging model that is appropriate for
streaming domains. We design a family of sampling methods based on the concept
of graph induction that generalize across the full spectrum of computational
models (from static to streaming) while efficiently preserving many of the
topological properties of the input graphs. Furthermore, we demonstrate how
traditional static sampling algorithms can be modified for graph streams for
each of the three main classes of sampling methods: node, edge, and
topology-based sampling. Our experimental results indicate that our proposed
family of sampling methods more accurately preserves the underlying properties
of the graph for both static and streaming graphs. Finally, we study the impact
of network sampling algorithms on the parameter estimation and performance
evaluation of relational classification algorithms
Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams
Online social media are complementing and in some cases replacing
person-to-person social interaction and redefining the diffusion of
information. In particular, microblogs have become crucial grounds on which
public relations, marketing, and political battles are fought. We introduce an
extensible framework that will enable the real-time analysis of meme diffusion
in social media by mining, visualizing, mapping, classifying, and modeling
massive streams of public microblogging events. We describe a Web service that
leverages this framework to track political memes in Twitter and help detect
astroturfing, smear campaigns, and other misinformation in the context of U.S.
political elections. We present some cases of abusive behaviors uncovered by
our service. Finally, we discuss promising preliminary results on the detection
of suspicious memes via supervised learning based on features extracted from
the topology of the diffusion networks, sentiment analysis, and crowdsourced
annotations
Graph Sample and Hold: A Framework for Big-Graph Analytics
Sampling is a standard approach in big-graph analytics; the goal is to
efficiently estimate the graph properties by consulting a sample of the whole
population. A perfect sample is assumed to mirror every property of the whole
population. Unfortunately, such a perfect sample is hard to collect in complex
populations such as graphs (e.g. web graphs, social networks etc), where an
underlying network connects the units of the population. Therefore, a good
sample will be representative in the sense that graph properties of interest
can be estimated with a known degree of accuracy. While previous work focused
particularly on sampling schemes used to estimate certain graph properties
(e.g. triangle count), much less is known for the case when we need to estimate
various graph properties with the same sampling scheme. In this paper, we
propose a generic stream sampling framework for big-graph analytics, called
Graph Sample and Hold (gSH). To begin, the proposed framework samples from
massive graphs sequentially in a single pass, one edge at a time, while
maintaining a small state. We then show how to produce unbiased estimators for
various graph properties from the sample. Given that the graph analysis
algorithms will run on a sample instead of the whole population, the runtime
complexity of these algorithm is kept under control. Moreover, given that the
estimators of graph properties are unbiased, the approximation error is kept
under control. Finally, we show the performance of the proposed framework (gSH)
on various types of graphs, such as social graphs, among others
ISER 2012 Working Paper No. 1
Large resource development projects take years to plan. During that planning time, the public
frequently debates the potential benefits and risks of a project, but with incomplete information.
In these debates, some people might assert that a project would have great benefits, while others
might assert that it would certainly harm the environment. At the same time, the developer will
be assessing different designs, before finally submitting one to the government permitting
agencies for evaluation and public scrutiny.
For large mines in Alaska, the government permitting process takes years, and often includes an
ecological risk assessment. This assessment is a data-intensive, scientific evaluation of the
project’s potential ecological risks, based on the specific details of the project.
Recently, some organizations have tried to bring scientific rigor to the pre-design public
discussions, especially for mining projects, through a pre-design risk ecological risk assessment.
This is a scientific assessment of the environmental risks a project might pose, before the details
of project design, risk-prevention, and risk-mitigation measures are known.
It is important to know whether pre-design risk assessment is a viable method for drawing
conclusions about risks of projects. If valid risk predictions can be made at that stage, then
people or governments would not have to wait for either a design or for the detailed evaluation
that is done during the permitting process. Such an approach could be used to short cut
permitting. It could affect project financing; it could affect the schedule, priority, or even the
resources that governments put toward evaluating a project. But perhaps most important: in an
age where public perceptions are an important influence on a project’s viability and government
permitting decisions, a realistic risk assessment can be used to focus public attention on the facts.
But if the methodology is flawed and results in poor quality information and unsupportable
conclusions, then a pre-design risk assessment could unjustifiably either inflame or calm the
public, depending on what it predicts.Executive Summary / Section 1. Introduction / Section 2. Overview of Ecological Risk / Section 3. Ecological Risk Assessment Methodology / Section 4. Examples of Post-Design Ecological Risk Assessments / Section 5. Pre-Design Ecological Risk Assessment: Risks of Large Scale Mining in the Bristol Bay Watershed / Section 6. Conclusion / Bibliograph
Alaska mining and water quality
The Institute of Water Resources has sought financial assistance
for some time in an attempt to initiate research relative to the impact
of mining on water quality. Attempts were made as early as 1971 by Dr.
Timothy Tilsworth and later by Dr. Donald Cook and Dr. Sage Murphy.
These investigators anticipated growth in placer gold mining and the
development of natural resources in Alaska during a period of national
and environmental concern. The subsequent energy "crisis," the major
increase in the price of gold on the world market, and dwindling nonrenewable
resource supplies have resulted in large-scale mineral
exploration in Alaska. This exploration, coupled with development of
the trans-Alaska oil pipeline, has attracted considerable capital for
potential investment and development in Alaska. Expected industrial
growth has already started and major new projects are "just around the
corner."
Yet, as of 1976, no major research effort has occurred to determine
the extent of or potential for water quality impacts from mining operations
in Alaska. Recently a series of interdisciplinary research projects
have been completed in Canada; however, the application of Canadian data
to Alaskan problems is uncertain. Although, state and federal government
agencies have been advised and are aware of this potential problem
and lack of baseline data they have not sought out new information or
rational solutions. Even now, with deadlines of Public Law 92-500 at
hand, some regulatory agencies give the impression of attempting to
ignore the situation. Interim limitations are proposed and permits
are issued with no discernible rationale or basis. Data have not been
obtained relative to the Alaskan mining operations and thus are not
available for use in seeking solutions compatible with mining and environmental protection. Numbers appear to have been arbitrarily
assigned to permits and water quality standards. When permits are
issued, self-monitoring requirements are negligible or nonexistent.
Nor have regulatory agencies demonstrated the ability or inclination
to monitor mining operations or enforce permits and water quality
standards.
It was hoped that the project would bring together miners, environmentalists, and regulators in a cooperative effort to identify the
problems and seek solutions. The investigators recognized the political
sensitivity of the subject matter but proceeded optimistically.
Relatively good cooperation, though not total, occurred early in the
project. In April 1976, a symposium was held to exchange ideas and
determine the state-of-the-art. Although the symposium had good
attendance and an exchange of information occurred, the symposium
itself was somewhat of a disappointment. With few exceptions, the
participants aligned on one side or the other in preconceived fixed
positions. Some even chose not to attend and were therefore able to
avoid the issues. Little hard data was presented.
Optimistically, some of the miners, environmentalists, and
regulators are prepared to resolve their differences. This report,
hopefully, will be of benefit to them. It is our experience that
miners and environmentalists share a love of the land that is uniquely
Alaska. We feel that technology is available for application to this
problem for those who care about doing the job right in the "last
frontier." Whether or not it will be effectively applied to protect
Alaska's water resources is a question which remains unanswered.The work upon which this report is based was supported in part by
funds provided by the United States Department of the Interior, Office
of Water Resources Research Act of 1964, Public Law 88-379, as amended
(Project A-055-ALAS)
- …