74,836 research outputs found

    New statistical method identifes cytokines that distinguish stool microbiomes

    Get PDF
    Regressing an outcome or dependent variable onto a set of input or independent variables allows the analyst to measure associations between the two so that changes in the outcome can be described by and predicted by changes in the inputs. While there are many ways of doing this in classical statistics, where the dependent variable has certain properties (e.g., a scalar, survival time, count), little progress on regression where the dependent variable are microbiome taxa counts has been made that do not impose extremely strict conditions on the data. In this paper, we propose and apply a new regression model combining the Dirichlet-multinomial distribution with recursive partitioning providing a fully non-parametric regression model. This model, called DM-RPart, is applied to cytokine data and microbiome taxa count data and is applicable to any microbiome taxa count/metadata, is automatically fit, and intuitively interpretable. This is a model which can be applied to any microbiome or other compositional data and software (R package HMP) available through the R CRAN website

    Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

    Full text link
    Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy

    Rule-based Machine Learning Methods for Functional Prediction

    Full text link
    We describe a machine learning method for predicting the value of a real-valued function, given the values of multiple input variables. The method induces solutions from samples in the form of ordered disjunctive normal form (DNF) decision rules. A central objective of the method and representation is the induction of compact, easily interpretable solutions. This rule-based decision model can be extended to search efficiently for similar cases prior to approximating function values. Experimental results on real-world data demonstrate that the new techniques are competitive with existing machine learning and statistical methods and can sometimes yield superior regression performance.Comment: See http://www.jair.org/ for any accompanying file

    Testate amoebae as a proxy for reconstructing Holocene water table dynamics in southern Patagonian peat bogs

    Get PDF
    Funded by Natural Environment Research Council. Grant Numbers: NE/I022809/1, NE/I022981/1, NE/I022833/1, NE/I023104/1 Ricardo Muza and the Wildlife Conservation Society Karukinka Park Acknowledgements This work was supported by the Natural Environment Research Council (grant numbers NE/I022809/1, NE/I022981/1, NE/I022833/1 and NE/I023104/1). We thank Ricardo Muza and the Wildlife Conservation Society (WCS) Karukinka Park rangers for facilitating access to Karukinka Park. We also thank François De Vleeschouwer, Gaël Le Roux, Heleen Vanneste, Sébastien Bertrand, Zakaria Ghazoui and Jean-Yves De Vleeschouwer for fieldwork assistance. Nelson Bahamonde (INIA, Punta Arenas, Chile) and Ernesto Teneb (UMag, Punta Arenas, Chile) provided logistical support for the fieldwork in Chile. Dr Andrea Coronato (CADIC, Ushuaia) kindly provided logistical support for the research in Argentina. Thanks to Jenny Johnston for cartography, David Jolley for assistance in microscopic photography and Audrey Innes for laboratory assistance. We highly appreciate reviews by Matt Amesbury and an anonymous reviewer. R.P. is supported by an Impact Fellowship from the University of Stirling.Peer reviewedPublisher PD

    Bandwidth selection for kernel estimation in mixed multi-dimensional spaces

    Get PDF
    Kernel estimation techniques, such as mean shift, suffer from one major drawback: the kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. Automatic bandwidth selection becomes a real challenge in case of multidimensional heterogeneous features. This paper presents a solution to this problem. It is an extension of \cite{Comaniciu03a} which was based on the fundamental property of normal distributions regarding the bias of the normalized density gradient. The selection is done iteratively for each type of features, by looking for the stability of local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift filtering and partitioning are introduced. The validity of the method is demonstrated in the context of color image segmentation based on a 5-dimensional space
    • 

    corecore