66,296 research outputs found
Distributed Robust Learning
We propose a framework for distributed robust statistical learning on {\em
big contaminated data}. The Distributed Robust Learning (DRL) framework can
reduce the computational time of traditional robust learning methods by several
orders of magnitude. We analyze the robustness property of DRL, showing that
DRL not only preserves the robustness of the base robust learning method, but
also tolerates contaminations on a constant fraction of results from computing
nodes (node failures). More precisely, even in presence of the most adversarial
outlier distribution over computing nodes, DRL still achieves a breakdown point
of at least , where is the break down point of
corresponding centralized algorithm. This is in stark contrast with naive
division-and-averaging implementation, which may reduce the breakdown point by
a factor of when computing nodes are used. We then specialize the
DRL framework for two concrete cases: distributed robust principal component
analysis and distributed robust regression. We demonstrate the efficiency and
the robustness advantages of DRL through comprehensive simulations and
predicting image tags on a large-scale image set.Comment: 18 pages, 2 figure
Communication Efficient Checking of Big Data Operations
We propose fast probabilistic algorithms with low (i.e., sublinear in the
input size) communication volume to check the correctness of operations in Big
Data processing frameworks and distributed databases. Our checkers cover many
of the commonly used operations, including sum, average, median, and minimum
aggregation, as well as sorting, union, merge, and zip. An experimental
evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the
low overhead and high failure detection rate predicted by theoretical analysis
Split-domain calibration of an ecosystem model using satellite ocean colour data
The application of satellite ocean colour data to the calibration of plankton
ecosystem models for large geographic domains, over which their ideal parameters cannot be assumed to be invariant, is investigated. A method is presented for seeking the number and geographic scope of parameter sets which allows the best fit to validation data to be achieved. These are independent data not used in the parameter estimation process. The goodness-of-fit of the optimally calibrated model to the validation data is an objective measure of merit for the model, together with its external forcing data. Importantly, this is a statistic which can be used for comparative evaluation of different models. The method makes use of observations from multiple locations, referred to as stations, distributed across the geographic domain. It relies on a technique for finding groups of stations which can be aggregated for parameter estimation purposes with minimal increase in the resulting misfit between model and observations.The results of testing this split-domain calibration method for a simple zero dimensional model, using observations from 30 stations in the North Atlantic, are presented. The stations are divided into separate calibration and validation sets.
One year of ocean colour data from each station were used in conjunction with a
climatological estimate of the stationās annual nitrate maximum. The results
demonstrate the practical utility of the method and imply that an optimal fit of the model to the validation data would be given by two parameter sets. The corresponding division of the North Atlantic domain into two provinces allows a misfit-based cost to be achieved which is 25% lower than that for the single parameter set obtained using all of the calibration stations. In general, parameters are poorly constrained, contributing to a high degree of uncertainty in model output for unobserved variables. This suggests that limited progress towards a definitive model calibration can be made without including other types of observations
Multiple Approaches to Absenteeism Analysis
Absenteeism research has often been criticized for using inappropriate analysis. Characteristics of absence data, notably that it is usually truncated and skewed, violate assumptions of OLS regression; however, OLS and correlation analysis remain the dominant models of absenteeism research. This piece compares eight models that may be appropriate for analyzing absence data. Specifically, this piece discusses and uses OLS regression, OLS regression with a transformed dependent variable, the Tobit model, Poisson regression, Overdispersed Poisson regression, the Negative Binomial model, Ordinal Logistic regression, and the Ordinal Probit model. A simulation methodology is employed to determine the extent to which each model is likely to produce false positives. Simulations vary with respect to the shape of the dependent variable\u27s distribution, sample size, and the shape of the independent variables\u27 distributions. Actual data,based on a sample of 195 manufacturing employees, is used to illustrate how these models might be used to analyze a real data set. Results from the simulation suggest that, despite methodological expectations, OLS regression does not produce significantly more false positives than expected at various alpha levels. However, the Tobit and Poisson models are often shown to yield too many false positives. A number of other models yield less than the expected number of false positives, thus suggesting that they may serve well as conservative hypothesis tests
Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression
Human associated microbial communities exert tremendous influence over human
health and disease. With modern metagenomic sequencing methods it is possible
to follow the relative abundance of microbes in a community over time. These
microbial communities exhibit rich ecological dynamics and an important goal of
microbial ecology is to infer the interactions between species from sequence
data. Any algorithm for inferring species interactions must overcome three
obstacles: 1) a correlation between the abundances of two species does not
imply that those species are interacting, 2) the sum constraint on the relative
abundances obtained from metagenomic studies makes it difficult to infer the
parameters in timeseries models, and 3) errors due to experimental uncertainty,
or mis-assignment of sequencing reads into operational taxonomic units, bias
inferences of species interactions. Here we introduce an approach, Learning
Interactions from MIcrobial Time Series (LIMITS), that overcomes these
obstacles. LIMITS uses sparse linear regression with boostrap aggregation to
infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested
LIMITS on synthetic data and showed that it could reliably infer the topology
of the inter-species ecological interactions. We then used LIMITS to
characterize the species interactions in the gut microbiomes of two individuals
and found that the interaction networks varied significantly between
individuals. Furthermore, we found that the interaction networks of the two
individuals are dominated by distinct "keystone species", Bacteroides fragilis
and Bacteroided stercosis, that have a disproportionate influence on the
structure of the gut microbiome even though they are only found in moderate
abundance. Based on our results, we hypothesize that the abundances of certain
keystone species may be responsible for individuality in the human gut
microbiome
- ā¦