74,851 research outputs found
New statistical method identifes cytokines that distinguish stool microbiomes
Regressing an outcome or dependent variable onto a set of input or independent variables allows the analyst to measure associations between the two so that changes in the outcome can be described by and predicted by changes in the inputs. While there are many ways of doing this in classical statistics, where the dependent variable has certain properties (e.g., a scalar, survival time, count), little progress on regression where the dependent variable are microbiome taxa counts has been made that do not impose extremely strict conditions on the data. In this paper, we propose and apply a new regression model combining the Dirichlet-multinomial distribution with recursive partitioning providing a fully non-parametric regression model. This model, called DM-RPart, is applied to cytokine data and microbiome taxa count data and is applicable to any microbiome taxa count/metadata, is automatically fit, and intuitively interpretable. This is a model which can be applied to any microbiome or other compositional data and software (R package HMP) available through the R CRAN website
Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization
Undetected overfitting can occur when there are significant redundancies
between training and validation data. We describe AVE, a new measure of
training-validation redundancy for ligand-based classification problems that
accounts for the similarity amongst inactive molecules as well as active. We
investigated seven widely-used benchmarks for virtual screening and
classification, and show that the amount of AVE bias strongly correlates with
the performance of ligand-based predictive methods irrespective of the
predicted property, chemical fingerprint, similarity measure, or
previously-applied unbiasing techniques. Therefore, it may be that the
previously-reported performance of most ligand-based methods can be explained
by overfitting to benchmarks rather than good prospective accuracy
Rule-based Machine Learning Methods for Functional Prediction
We describe a machine learning method for predicting the value of a
real-valued function, given the values of multiple input variables. The method
induces solutions from samples in the form of ordered disjunctive normal form
(DNF) decision rules. A central objective of the method and representation is
the induction of compact, easily interpretable solutions. This rule-based
decision model can be extended to search efficiently for similar cases prior to
approximating function values. Experimental results on real-world data
demonstrate that the new techniques are competitive with existing machine
learning and statistical methods and can sometimes yield superior regression
performance.Comment: See http://www.jair.org/ for any accompanying file
Testate amoebae as a proxy for reconstructing Holocene water table dynamics in southern Patagonian peat bogs
Funded by Natural Environment Research Council. Grant Numbers: NE/I022809/1, NE/I022981/1, NE/I022833/1, NE/I023104/1 Ricardo Muza and the Wildlife Conservation Society Karukinka Park Acknowledgements This work was supported by the Natural Environment Research Council (grant numbers NE/I022809/1, NE/I022981/1, NE/I022833/1 and NE/I023104/1). We thank Ricardo Muza and the Wildlife Conservation Society (WCS) Karukinka Park rangers for facilitating access to Karukinka Park. We also thank François De Vleeschouwer, Gaël Le Roux, Heleen Vanneste, Sébastien Bertrand, Zakaria Ghazoui and Jean-Yves De Vleeschouwer for fieldwork assistance. Nelson Bahamonde (INIA, Punta Arenas, Chile) and Ernesto Teneb (UMag, Punta Arenas, Chile) provided logistical support for the fieldwork in Chile. Dr Andrea Coronato (CADIC, Ushuaia) kindly provided logistical support for the research in Argentina. Thanks to Jenny Johnston for cartography, David Jolley for assistance in microscopic photography and Audrey Innes for laboratory assistance. We highly appreciate reviews by Matt Amesbury and an anonymous reviewer. R.P. is supported by an Impact Fellowship from the University of Stirling.Peer reviewedPublisher PD
Bandwidth selection for kernel estimation in mixed multi-dimensional spaces
Kernel estimation techniques, such as mean shift, suffer from one major
drawback: the kernel bandwidth selection. The bandwidth can be fixed for all
the data set or can vary at each points. Automatic bandwidth selection becomes
a real challenge in case of multidimensional heterogeneous features. This paper
presents a solution to this problem. It is an extension of \cite{Comaniciu03a}
which was based on the fundamental property of normal distributions regarding
the bias of the normalized density gradient. The selection is done iteratively
for each type of features, by looking for the stability of local bandwidth
estimates across a predefined range of bandwidths. A pseudo balloon mean shift
filtering and partitioning are introduced. The validity of the method is
demonstrated in the context of color image segmentation based on a
5-dimensional space
- âŠ