245 research outputs found

    Multiway calibration in 3D QSAR

    Get PDF

    To aggregate or not to aggregate high-dimensional classifiers

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data.</p> <p>Results</p> <p>Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets.</p> <p>Conclusions</p> <p>The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.</p

    Critical evaluation of assessor difference correction approaches in sensory analysis

    Get PDF
    In sensory data analysis, assessor-dependent scaling effects may hinder the analysis of product differences. Romano et al. (2008) compared several approaches to reduce scaling differences between assessors by their ability to maximise the product effect F-values in a mixed ANOVA analysis. Their study on a sensory dataset of 14 cheese samples assessed by twelve assessors on a continuous scale showed that some of these approaches apparently improved the F-value of the product effect. However, this direct comparison is only legitimate if these F-values originate from the same null distribution. To obtain the null distributions of the different correction methods, we employed a permutation approach on the same cheese dataset also used by Romano et al. (2008) and a random noise simulation approach. Based on the empirically obtained null distributions, we calculated the corrected product effect significance to directly compare the performance of the preprocessing methods. Our results show that the null distributions of some preprocessing methods do not correspond to the expected F-distribution. In particular for the ten Berge method, the null distribution is shifted towards higher F-values. Therefore, an observed increase of the product effect F-value, as compared to the F-value on raw data, does not necessarily lead to increased product effect significance. If p-values are calculated based on such inflated F-values, significance may thus be overestimated. In contrast, calculation of p-values directly from the empirical null distributions obtained by permutation provides a common ground to properly compare method performance. Moreover, we show that differences in reproducibility between assessors, as they exist in real-world sensory datasets, may lead to overestimation of product effect significance by the mixed assessor model (MAM).publishedVersio

    Metabolic network discovery through reverse engineering of metabolome data

    Get PDF
    Reverse engineering of high-throughput omics data to infer underlying biological networks is one of the challenges in systems biology. However, applications in the field of metabolomics are rather limited. We have focused on a systematic analysis of metabolic network inference from in silico metabolome data based on statistical similarity measures. Three different data types based on biological/environmental variability around steady state were analyzed to compare the relative information content of the data types for inferring the network. Comparing the inference power of different similarity scores indicated the clear superiority of conditioning or pruning based scores as they have the ability to eliminate indirect interactions. We also show that a mathematical measure based on the Fisher information matrix gives clues on the information quality of different data types to better represent the underlying metabolic network topology. Results on several datasets of increasing complexity consistently show that metabolic variations observed at steady state, the simplest experimental analysis, are already informative to reveal the connectivity of the underlying metabolic network with a low false-positive rate when proper similarity-score approaches are employed. For experimental situations this implies that a single organism under slightly varying conditions may already generate more than enough information to rightly infer networks. Detailed examination of the strengths of interactions of the underlying metabolic networks demonstrates that the edges that cannot be captured by similarity scores mainly belong to metabolites connected with weak interaction strength
    corecore