303,780 research outputs found

    Tellipsoid: Exploiting inter-gene correlation for improved detection of differential gene expression

    Full text link
    Motivation: Algorithms for differential analysis of microarray data are vital to modern biomedical research. Their accuracy strongly depends on effective treatment of inter-gene correlation. Correlation is ordinarily accounted for in terms of its effect on significance cut-offs. In this paper it is shown that correlation can, in fact, be exploited {to share information across tests}, which, in turn, can increase statistical power. Results: Vastly and demonstrably improved differential analysis approaches are the result of combining identifiability (the fact that in most microarray data sets, a large proportion of genes can be identified a priori as non-differential) with optimization criteria that incorporate correlation. As a special case, we develop a method which builds upon the widely used two-sample t-statistic based approach and uses the Mahalanobis distance as an optimality criterion. Results on the prostate cancer data of Singh et al. (2002) suggest that the proposed method outperforms all published approaches in terms of statistical power. Availability: The proposed algorithm is implemented in MATLAB and in R. The software, called Tellipsoid, and relevant data sets are available at http://www.egr.msu.edu/~desaikeyComment: 19 pages, Submitted to Bioinformatic

    Pattern Search Ranking and Selection Algorithms for Mixed-Variable Optimization of Stochastic Systems

    Get PDF
    A new class of algorithms is introduced and analyzed for bound and linearly constrained optimization problems with stochastic objective functions and a mixture of design variable types. The generalized pattern search (GPS) class of algorithms is extended to a new problem setting in which objective function evaluations require sampling from a model of a stochastic system. The approach combines GPS with ranking and selection (R&S) statistical procedures to select new iterates. The derivative-free algorithms require only black-box simulation responses and are applicable over domains with mixed variables (continuous, discrete numeric, and discrete categorical) to include bound and linear constraints on the continuous variables. A convergence analysis for the general class of algorithms establishes almost sure convergence of an iteration subsequence to stationary points appropriately defined in the mixed-variable domain. Additionally, specific algorithm instances are implemented that provide computational enhancements to the basic algorithm. Implementation alternatives include the use modern R&S procedures designed to provide efficient sampling strategies and the use of surrogate functions that augment the search by approximating the unknown objective function with nonparametric response surfaces. In a computational evaluation, six variants of the algorithm are tested along with four competing methods on 26 standardized test problems. The numerical results validate the use of advanced implementations as a means to improve algorithm performance

    Quantlyzer : an R package for automated exploratory and predictive data analysis

    Get PDF
    Machine learning (ML) and statistical algorithms have been significantly used in various applications, such as data classification, predictive regression, and feature selection. As the need for data-driven insights continues to grow, there is an increasing demand for exploratory and predictive data analysis to support business decision-making, academic research, and other applications. Identifying the best model that has optimal performance for a specific dataset usually consumes much time depending on the purpose of analysis. Although there are many packages that provide pre-built machine learning or statistical models, users still need time to load various suitable packages or functions, optimization of hyperparameters, validate the model, acknowledge the statistical relationship between each random or bivariate variable, and so on. This paper presents a package for R, "Quantlyzer" that contains various popular algorithms from machine learning and statistics. This tool aims to make automated data analysis more convenient for all different levels of users from no data analytics experience to domain experts to improve the efficiency of analyzing data. A workflow pipeline of exploratory analytics that contains various popular descriptive analysis techniques (e.g., Pearson Correlation Coefficient, a statistical summary of each variable, data visualization), statistical algorithms (e.g. Ridge regression, Ordinary Least Squares (OLS), Decision tree), machine learning models (e.g. Support Vector Machine (SVM), Random Forest, eXtreme Gradient Boosting (XGBoost), Gradient Boosting Machines (GBMs)), and Automated machine learning (AutoML) were ensembled in this package. The five-fold cross-validation is always used in every machine learning model to avoid overfitting or selection bias. A dataset with the index of soil organic carbon as the dependent variable and Near-Infrared Spectroscopy (NIRS) as independent variables was used to evaluate the performance of the predictive regressions for the package. Another dataset, which is based on estimating different levels of dicamba damage on the soybean plot with extracted image features as independent variables were used to testify classification performance by using the Quantlyzer. The result shows that a pipeline using ensembled various machine learning and statistical algorithms can not only generate the report within 2 hours but also provide broader information including data visualization, statistical analysis, and machine learning results with a few lines of code. This study demonstrates a possible method to develop an automated data analysis platform with various techniques to help both academics and industry to discover patterns in data. The goal of this project ).is designed to enhance the data analysis efficiency for users by minimizing the need for manual code input, ultimately reducing the effort and time consumption. The source code is freely available through GitHub (https://github.com/tianfengkai/quantlyzer).Includes bibliographical references

    GMCM: Unsupervised Clustering and Meta-Analysis Using Gaussian Mixture Copula Models

    Get PDF
    Methods for clustering in unsupervised learning are an important part of the statistical toolbox in numerous scientific disciplines. Tewari, Giering, and Raghunathan (2011) proposed to use so-called Gaussian mixture copula models (GMCM) for general unsupervised learning based on clustering. Li, Brown, Huang, and Bickel (2011) independently discussed a special case of these GMCMs as a novel approach to meta-analysis in highdimensional settings. GMCMs have attractive properties which make them highly flexible and therefore interesting alternatives to other well-established methods. However, parameter estimation is hard because of intrinsic identifiability issues and intractable likelihood functions. Both aforementioned papers discuss similar expectation-maximization-like algorithms as their pseudo maximum likelihood estimation procedure. We present and discuss an improved implementation in R of both classes of GMCMs along with various alternative optimization routines to the EM algorithm. The software is freely available in the R package GMCM. The implementation is fast, general, and optimized for very large numbers of observations. We demonstrate the use of package GMCM through different applications

    On the design of R-based scalable frameworks for data science applications

    Get PDF
    This thesis is comprised of three papers "On the design of R-based scalable frameworks for data science applications". We discuss the design of conceptual and computational frameworks for the R language for statistical computing and graphics and build software artifacts for two typical data science use cases: optimization problem solving and large scale text analysis. Each part follows a design science approach. We use a verification method for the software frameworks introduced, i.e., prototypical instantiations of the designed artifacts are evaluated on the basis of real-world applications in mixed integer optimization (consensus journal ranking) and text mining (culturomics). The first paper introduces an extensible object oriented R Optimization Infrastructure (ROI). Methods from the field of optimization play an important role in many techniques routinely used in statistics, machine learning and data science. Often, implementations of these methods rely on highly specialized optimization algorithms, designed to be only applicable within a specific application. However, in many instances recent advances, in particular in the field of convex optimization, make it possible to conveniently and straightforwardly use modern solvers instead with the advantage of enabling broader usage scenarios and thus promoting reusability. With ROI one can formulate and solve optimization problems in a consistent way. It is capable of modeling linear, quadratic, conic, and general nonlinear optimization problems. Furthermore, the paper discusses how extension packages can add additional optimization solvers, read/write functions and additional resources such as model collections. Selected examples from the field of statistics conclude the paper. With the second paper we aim to answer two questions. Firstly, it addresses the issue on how to construct suitable aggregates of individual journal rankings, using an optimization-based consensus ranking approach. Secondly, the presented application serves as an evaluation of the ROI prototype. Regarding the first research question we apply the proposed method to a subset of marketing-related journals from a list of collected journal rankings. Next, the paper studies the stability of the derived consensus solution, and degeneration effects that occur when excluding journals and/or rankings. Finally, we investigate the similarities/dissimilarities of the consensus with a naive meta-ranking and with individual rankings. The results show that, even though journals are not uniformly ranked, one may derive a consensus ranking with considerably high agreement with the individual rankings. In the third paper we examine how we can extend the text mining package tm to handle large (text) corpora. This enables statisticians to answer many interesting research questions via statistical analysis or modeling of data sets that cannot be analyzed easily otherwise, e.g., due to software or hardware induced data size limitations. Adequate programming models like MapReduce facilitate parallelization of text mining tasks and allow for processing large data sets by using a distributed file system possibly spanning over several machines, e.g., in a cluster of workstations. The paper presents a plug-in package to tm called tm.plugin.dc implementing a distributed corpus class which can take advantage of the Hadoop MapReduce library for large scale text mining tasks. We evaluate the presented prototype on the basis of an application in culturomics and show that it can handle data sets of significant size efficiently
    • …
    corecore