887 research outputs found

    Shrunken p -Values for Assessing Differential Expression with Applications to Genomic Data Analysis

    Get PDF
    In many scientific problems involving high-throughput technology, inference must be made involving several hundreds or thousands of hypotheses. Recent attention has focused on how to address the multiple testing issue; much focus has been devoted toward the use of the false discovery rate. In this article, we consider an alternative estimation procedure titled shrunken p -values for assessing differential expression (SPADE). The estimators are motivated by risk considerations from decision theory and lead to a completely new method for adjustment in the multiple testing problem. In addition, the decision-theoretic framework can be used to derive a decision rule for controlling the number of false positive results. Some theoretical results are outlined. The proposed methodology is illustrated using simulation studies and with application to data from a prostate cancer gene expression profiling study.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/66004/1/j.1541-0420.2006.00616.x.pd

    Unsupervised Discovery and Representation of Subspace Trends in Massive Biomedical Datasets

    Get PDF
    The goal of this dissertation is to develop unsupervised algorithms for discovering previously unknown subspace trends in massive multivariate biomedical data sets without the benefit of prior information. A subspace trend is a sustained pattern of gradual/progressive changes within an unknown subset of feature dimensions. A fundamental challenge to subspace trend discovery is the presence of irrelevant data dimensions, noise, outliers, and confusion from multiple subspace trends driven by independent factors that are mixed in with each other. These factors can obscure the trends in traditional dimension reduction and projection based data visualizations. To overcome these limitations, we propose a novel graph-theoretic neighborhood similarity measure for sensing concordant progressive changes across data dimensions. Using this measure, we present an unsupervised algorithm for trend-relevant feature selection and visualization. Additionally, we propose to use an efficient online density-based representation to make the algorithm scalable for massive datasets. The representation not only assists in trend discovery, but also in cluster detection including rare populations. Our method has been successfully applied to diverse synthetic and real-world biomedical datasets, such as gene expression microarray and arbor morphology of neurons and microglia in brain tissue. Derived representations revealed biologically meaningful hidden subspace trend(s) that were obscured by irrelevant features and noise. Although our applications are mostly from the biomedical domain, the proposed algorithm is broadly applicable to exploratory analysis of high-dimensional data including visualization, hypothesis generation, knowledge discovery, and prediction in diverse other applications.Electrical and Computer Engineering, Department o

    Penalized Clustering of Large Scale Functional Data with Multiple Covariates

    Full text link
    In this article, we propose a penalized clustering method for large scale data with multiple covariates through a functional data approach. In the proposed method, responses and covariates are linked together through nonparametric multivariate functions (fixed effects), which have great flexibility in modeling a variety of function features, such as jump points, branching, and periodicity. Functional ANOVA is employed to further decompose multivariate functions in a reproducing kernel Hilbert space and provide associated notions of main effect and interaction. Parsimonious random effects are used to capture various correlation structures. The mixed-effect models are nested under a general mixture model, in which the heterogeneity of functional data is characterized. We propose a penalized Henderson's likelihood approach for model-fitting and design a rejection-controlled EM algorithm for the estimation. Our method selects smoothing parameters through generalized cross-validation. Furthermore, the Bayesian confidence intervals are used to measure the clustering uncertainty. Simulation studies and real-data examples are presented to investigate the empirical performance of the proposed method. Open-source code is available in the R package MFDA

    Wrapper algorithms and their performance assessment on high-dimensional molecular data

    Get PDF
    Prediction problems on high-dimensional molecular data, e.g. the classification of microar- ray samples into normal and cancer tissues, are complex and ill-posed since the number of variables usually exceeds the number of observations by orders of magnitude. Recent research in the area has propagated a variety of new statistical models in order to handle these new biological datasets. In practice, however, these models are always applied in combination with preprocessing and variable selection methods as well as model selection which is mostly performed by cross-validation. Varma and Simon (2006) have used the term ‘wrapper-algorithm’ for this integration of preprocessing and model selection into the construction of statistical models. Additionally, they have proposed the method of nested cross-validation (NCV) as a way of estimating their prediction error which has evolved to the gold-standard by now. In the first part, this thesis provides further theoretical and empirical justification for the usage of NCV in the context of wrapper-algorithms. Moreover, a computationally less intensive alternative to NCV is proposed which can be motivated in a decision theoretic framework. The new method can be interpreted as a smoothed variant of NCV and, in contrast to NCV, guarantees intuitive bounds for the estimation of the prediction error. The second part focuses on the ranking of wrapper algorithms. Cross-study-validation is proposed as an alternative concept to the repetition of separated within-study-validations if several similar prediction problems are available. The concept is demonstrated using six different wrapper algorithms for survival prediction on censored data on a selection of eight breast cancer datasets. Additionally, a parametric bootstrap approach for simulating realistic data from such related prediction problems is described and subsequently applied to illustrate the concept of cross-study-validation for the ranking of wrapper algorithms. Eventually, the last part approaches computational aspects of the analyses and simula- tions performed in the thesis. The preprocessing before the analysis as well as the evaluation of the prediction models requires the usage of large computing resources. Parallel comput- ing approaches are illustrated on cluster, cloud and high performance computing resources using the R programming language. Usage of heterogeneous hardware and processing of large datasets are covered as well as the implementation of the R-package survHD for the analysis and evaluation of high-dimensional wrapper algorithms for survival prediction from censored data.Prädiktionsprobleme für hochdimensionale genetische Daten, z.B. die Klassifikation von Proben in normales und Krebsgewebe, sind komplex und unterbestimmt, da die Anzahl der Variablen die Anzahl der Beobachtungen um ein Vielfaches übersteigt. Die Forschung hat auf diesem Gebiet in den letzten Jahren eine Vielzahl an neuen statistischen Meth- oden hervorgebracht. In der Praxis werden diese Algorithmen jedoch stets in Kombination mit Vorbearbeitung und Variablenselektion sowie Modellwahlverfahren angewandt, wobei letztere vorwiegend mit Hilfe von Kreuzvalidierung durchgeführt werden. Varma und Simon (2006) haben den Begriff ’Wrapper-Algorithmus’ für eine derartige Einbet- tung von Vorbearbeitung und Modellwahl in die Konstruktion einer statistischen Methode verwendet. Zudem haben sie die genestete Kreuzvalidierung (NCV) als eine Methode zur Sch ̈atzung ihrer Fehlerrate eingeführt, welche sich mittlerweile zum Goldstandard entwickelt hat. Im ersten Teil dieser Doktorarbeit, wird eine tiefergreifende theoretische Grundlage sowie eine empirische Rechtfertigung für die Anwendung von NCV bei solchen ’Wrapper-Algorithmen’ vorgestellt. Außerdem wird eine alternative, weniger computerintensive Methode vorgeschlagen, welche im Rahmen der Entscheidungstheorie motiviert wird. Diese neue Methode kann als eine gegl ̈attete Variante von NCV interpretiert wer- den und hält im Gegensatz zu NCV intuitive Grenzen bei der Fehlerratenschätzung ein. Der zweite Teil behandelt den Vergleich verschiedener ’Wrapper-Algorithmen’ bzw. das Sch ̈atzen ihrer Reihenfolge gem ̈aß eines bestimmten Gütekriteriums. Als eine Alterna- tive zur wiederholten Durchführung von Kreuzvalidierung auf einzelnen Datensätzen wird das Konzept der studienübergreifenden Validierung vorgeschlagen. Das Konzept wird anhand von sechs verschiedenen ’Wrapper-Algorithmen’ für die Vorhersage von Uberlebenszeiten bei acht Brustkrebsstudien dargestellt. Zusätzlich wird ein Bootstrapverfahren beschrieben, mit dessen Hilfe man mehrere realistische Datens ̈atze aus einer Menge von solchen verwandten Prädiktionsproblemen generieren kann. Der letzte Teil beleuchtet schließlich computationale Verfahren, die bei der Umsetzung der Analysen in dieser Dissertation eine tragende Rolle gespielt haben. Die Vorbearbeitungsschritte sowie die Evaluation der Prädiktionsmodelle erfordert die extensive Nutzung von Computerressourcen. Es werden Ansätze zum parallelen Rechnen auf Cluster-, Cloud- und Hochleistungsrechen- ressourcen unter der Verwendung der Programmiersprache R beschrieben. Die Benutzung von heterogenen Hardwarearchitekturen, die Verarbeitung von großen Datensätzen sowie die Entwicklung des R-Pakets survHD für die Analyse und Evaluierung von ’Wrapper- Algorithmen’ zur Uberlebenszeitenanalyse werden thematisiert

    Towards a dynamic view of genetic networks: A Kalman filtering framework for recovering temporally-rewiring stable networks from undersampled data

    Get PDF
    It is widely accepted that cellular requirements and environmental conditions dictate the architecture of genetic regulatory networks. Nonetheless, the status quo in regulatory network modeling and analysis assumes an invariant network topology over time. We refocus on a dynamic perspective of genetic networks, one that can uncover substantial topological changes in network structure during biological processes such as developmental growth and cancer progression. We propose a novel outlook on the inference of time-varying genetic networks, from a limited number of noisy observations, by formulating the networks estimation as a target tracking problem. Assuming linear dynamics, we formulate a constrained Kalman ltering framework, which recursively computes the minimum mean-square, sparse and stable estimate of the network connectivity at each time point. The sparsity constraint is enforced using the weighted l1-norm; and the stability constraint is incorporated using the Lyapounov stability condition. The proposed constrained Kalman lter is formulated to preserve the convex nature of the problem. The algorithm is applied to estimate the time-varying networks during the life cycle of the Drosophila Melanogaster (fruit fly)

    Data based identification and prediction of nonlinear and complex dynamical systems

    Get PDF
    We thank Dr. R. Yang (formerly at ASU), Dr. R.-Q. Su (formerly at ASU), and Mr. Zhesi Shen for their contributions to a number of original papers on which this Review is partly based. This work was supported by ARO under Grant No. W911NF-14-1-0504. W.-X. Wang was also supported by NSFC under Grants No. 61573064 and No. 61074116, as well as by the Fundamental Research Funds for the Central Universities, Beijing Nova Programme.Peer reviewedPostprin
    corecore