Search CORE

66,296 research outputs found

Distributed Robust Learning

Author: Feng Jiashi
Mannor Shie
Xu Huan
Publication venue
Publication date: 07/02/2015
Field of study

We propose a framework for distributed robust statistical learning on {\em big contaminated data}. The Distributed Robust Learning (DRL) framework can reduce the computational time of traditional robust learning methods by several orders of magnitude. We analyze the robustness property of DRL, showing that DRL not only preserves the robustness of the base robust learning method, but also tolerates contaminations on a constant fraction of results from computing nodes (node failures). More precisely, even in presence of the most adversarial outlier distribution over computing nodes, DRL still achieves a breakdown point of at least

\lambda^*/2

, where

\lambda^*

is the break down point of corresponding centralized algorithm. This is in stark contrast with naive division-and-averaging implementation, which may reduce the breakdown point by a factor of

k

when

k

computing nodes are used. We then specialize the DRL framework for two concrete cases: distributed robust principal component analysis and distributed robust regression. We demonstrate the efficiency and the robustness advantages of DRL through comprehensive simulations and predicting image tags on a large-scale image set.Comment: 18 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

Communication Efficient Checking of Big Data Operations

Author: Hübschle-Schneider Lorenz
Sanders Peter
Publication venue
Publication date: 01/01/2018
Field of study

We propose fast probabilistic algorithms with low (i.e., sublinear in the input size) communication volume to check the correctness of operations in Big Data processing frameworks and distributed databases. Our checkers cover many of the commonly used operations, including sum, average, median, and minimum aggregation, as well as sorting, union, merge, and zip. An experimental evaluation of our implementation in Thrill (Bingmann et al., 2016) confirms the low overhead and high failure detection rate predicted by theoretical analysis

arXiv.org e-Print Archive

Crossref

KITopen

Split-domain calibration of an ecosystem model using satellite ocean colour data

Author: Bleck
Brock
Campbell
Christian
Conkright
Eppley
Evans
Evans
Fasham
Fasham
Fennel
Friedrichs
Friedrichs
Glover
Gregg
Gunson
Hemmings
Hurtt
Hurtt
Jia
John C.P. Hemmings
Kraus
Longhurst
Losa
Matear
McClain
McKay
Meric A. Srokosz
Michael J.R. Fasham
Moyer
Oschlies
Oschlies
Oschlies
Palmer
Peter Challenor
Press
Prunet
Prunet
Rossow
Sarmiento
Sarmiento
Sathyendranath
Schartau
Siegel
Spitz
Spitz
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

The application of satellite ocean colour data to the calibration of plankton ecosystem models for large geographic domains, over which their ideal parameters cannot be assumed to be invariant, is investigated. A method is presented for seeking the number and geographic scope of parameter sets which allows the best fit to validation data to be achieved. These are independent data not used in the parameter estimation process. The goodness-of-fit of the optimally calibrated model to the validation data is an objective measure of merit for the model, together with its external forcing data. Importantly, this is a statistic which can be used for comparative evaluation of different models. The method makes use of observations from multiple locations, referred to as stations, distributed across the geographic domain. It relies on a technique for finding groups of stations which can be aggregated for parameter estimation purposes with minimal increase in the resulting misfit between model and observations.The results of testing this split-domain calibration method for a simple zero dimensional model, using observations from 30 stations in the North Atlantic, are presented. The stations are divided into separate calibration and validation sets. One year of ocean colour data from each station were used in conjunction with a climatological estimate of the station’s annual nitrate maximum. The results demonstrate the practical utility of the method and imply that an optimal fit of the model to the validation data would be given by two parameter sets. The corresponding division of the North Atlantic domain into two provinces allows a misfit-based cost to be achieved which is 25% lower than that for the single parameter set obtained using all of the calibration stations. In general, parameters are poorly constrained, contributing to a high degree of uncertainty in model output for unobserved variables. This suggests that limited progress towards a definitive model calibration can be made without including other types of observations

Southampton (e-Prints Soton)

Crossref

NERC Open Research Archive

Multi-objective optimisation of species distribution models for river management

Author: Gobeyn Sacha
Goethals Peter
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Multiple Approaches to Absenteeism Analysis

Author: Sturman Michael C.
Publication venue: DigitalCommons@ILR
Publication date: 01/03/1996
Field of study

Absenteeism research has often been criticized for using inappropriate analysis. Characteristics of absence data, notably that it is usually truncated and skewed, violate assumptions of OLS regression; however, OLS and correlation analysis remain the dominant models of absenteeism research. This piece compares eight models that may be appropriate for analyzing absence data. Specifically, this piece discusses and uses OLS regression, OLS regression with a transformed dependent variable, the Tobit model, Poisson regression, Overdispersed Poisson regression, the Negative Binomial model, Ordinal Logistic regression, and the Ordinal Probit model. A simulation methodology is employed to determine the extent to which each model is likely to produce false positives. Simulations vary with respect to the shape of the dependent variable\u27s distribution, sample size, and the shape of the independent variables\u27 distributions. Actual data,based on a sample of 195 manufacturing employees, is used to illustrate how these models might be used to analyze a real data set. Results from the simulation suggest that, despite methodological expectations, OLS regression does not produce significantly more false positives than expected at various alpha levels. However, the Tobit and Poisson models are often shown to yield too many false positives. A number of other models yield less than the expected number of false positives, thus suggesting that they may serve well as conservative hypothesis tests

DigitalCommons@ILR

eCommons@Cornell

Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries using Sparse Linear Regression

Author: Fisher Charles K.
Mehta Pankaj
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the interactions between species from sequence data. Any algorithm for inferring species interactions must overcome three obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions. Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct "keystone species", Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of certain keystone species may be responsible for individuality in the human gut microbiome

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Directory of Open Access Journals

PubMed Central

FigShare