68 research outputs found

    Challenges of Big Data Analysis

    Full text link
    Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

    Integration and visualisation of clinical-omics datasets for medical knowledge discovery

    Get PDF
    In recent decades, the rise of various omics fields has flooded life sciences with unprecedented amounts of high-throughput data, which have transformed the way biomedical research is conducted. This trend will only intensify in the coming decades, as the cost of data acquisition will continue to decrease. Therefore, there is a pressing need to find novel ways to turn this ocean of raw data into waves of information and finally distil those into drops of translational medical knowledge. This is particularly challenging because of the incredible richness of these datasets, the humbling complexity of biological systems and the growing abundance of clinical metadata, which makes the integration of disparate data sources even more difficult. Data integration has proven to be a promising avenue for knowledge discovery in biomedical research. Multi-omics studies allow us to examine a biological problem through different lenses using more than one analytical platform. These studies not only present tremendous opportunities for the deep and systematic understanding of health and disease, but they also pose new statistical and computational challenges. The work presented in this thesis aims to alleviate this problem with a novel pipeline for omics data integration. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. However, many of these features might be completely irrelevant to the studied biological problem or redundant in the context of others. Therefore, in this thesis, clinical metadata driven feature selection is proposed as a viable option for narrowing down the focus of analyses in biomedical research. Our visual cortex has been fine-tuned through millions of years to become an outstanding pattern recognition machine. To leverage this incredible resource of the human brain, we need to develop advanced visualisation software that enables researchers to explore these vast biological datasets through illuminating charts and interactivity. Accordingly, a substantial portion of this PhD was dedicated to implementing truly novel visualisation methods for multi-omics studies.Open Acces

    Estimation and Inference for High-Dimensional Gaussian Graphical Models with Structural Constraints.

    Full text link
    This work discusses several aspects of estimation and inference for high-dimensional Gaussian graphical models and consists of two main parts. The first part considers network-based pathway enrichment analysis based on incomplete network information. Pathway enrichment analysis has become a key tool for biomedical researchers to gain insight into the underlying biology of differentially expressed genes, proteins and metabolites. We propose a constrained network estimation framework that combines network estimation based on cell- and condition-specific high-dimensional Omics data with interaction information from existing data bases. The resulting pathway topology information is subsequently used to provide a framework for simultaneous testing of differences in expression levels of pathway members, as well as their interactions. We study the asymptotic properties of the proposed network estimator and the test for pathway enrichment, and investigate its small sample performance in simulated experiments and illustrate it on two cancer data sets. The second part of the thesis is devoted to reconstructing multiple graphical models simultaneously from high-dimensional data. We develop methodology that jointly estimates multiple Gaussian graphical models, assuming that there exists prior information on how they are structurally related. The proposed method consists of two steps: in the first one, we employ neighborhood selection to obtain estimated edge sets of the graphs using a group lasso penalty. In the second step, we estimate the nonzero entries in the inverse covariance matrices by maximizing the corresponding Gaussian likelihood. We establish the consistency of the proposed method for sparse high-dimensional Gaussian graphical models and illustrate its performance using simulation experiments. An application to a climate data set is also discussed.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113495/1/mjing_1.pd

    Sparse Model Selection using Information Complexity

    Get PDF
    This dissertation studies and uses the application of information complexity to statistical model selection through three different projects. Specifically, we design statistical models that incorporate sparsity features to make the models more explanatory and computationally efficient. In the first project, we propose a Sparse Bridge Regression model for variable selection when the number of variables is much greater than the number of observations if model misspecification occurs. The model is demonstrated to have excellent explanatory power in high-dimensional data analysis through numerical simulations and real-world data analysis. The second project proposes a novel hybrid modeling method that utilizes a mixture of sparse principal component regression (MIX-SPCR) to segment high-dimensional time series data. Using the MIX-SPCR model, we empirically analyze the S\&P 500 index data (from 1999 to 2019) and identify two key change points. The third project investigates the use of nonlinear features in the Sparse Kernel Factor Analysis (SKFA) method to derive the information criterion. Using a variety of wide datasets, we demonstrate the benefits of SKFA in the nonlinear representation and classification of data. The results obtained show the flexibility and the utility of information complexity in such data modeling problems

    Extending principal covariates regression for high-dimensional multi-block data

    Get PDF
    This dissertation addresses the challenge of deciphering extensive datasets collected from multiple sources, such as health habits and genetic information, in the context of studying complex issues like depression. A data analysis method known as Principal Covariate Regression (PCovR) provides a strong basis in this challenge.Yet, analyzing these intricate datasets is far from straightforward. The data often contain redundant and irrelevant variables, making it difficult to extract meaningful insights. Furthermore, these data may involve different types of outcome variables (for instance, the variable pertaining to depression could manifest as a score from a depression scale or a binary diagnosis (yes/no) from a medical professional), adding another layer of complexity.To overcome these obstacles, novel adaptations of PCovR are proposed in this dissertation. The methods automatically select important variables, categorize insights into those originating from a single source or multiple sources, and accommodate various outcome variable types. The effectiveness of these methods is demonstrated in predicting outcomes and revealing the subtle relationships within data from multiple sources.Moreover, the dissertation offers a glimpse of future directions in enhancing PCovR. Implications of extending the method such that it selects important variables are critically examined. Also, an algorithm that has the potential to yield optimal results is suggested. In conclusion, this dissertation proposes methods to tackle the complexity of large data from multiple sources, and points towards where opportunities may lie in the next line of research

    Extending principal covariates regression for high-dimensional multi-block data

    Get PDF
    This dissertation addresses the challenge of deciphering extensive datasets collected from multiple sources, such as health habits and genetic information, in the context of studying complex issues like depression. A data analysis method known as Principal Covariate Regression (PCovR) provides a strong basis in this challenge.Yet, analyzing these intricate datasets is far from straightforward. The data often contain redundant and irrelevant variables, making it difficult to extract meaningful insights. Furthermore, these data may involve different types of outcome variables (for instance, the variable pertaining to depression could manifest as a score from a depression scale or a binary diagnosis (yes/no) from a medical professional), adding another layer of complexity.To overcome these obstacles, novel adaptations of PCovR are proposed in this dissertation. The methods automatically select important variables, categorize insights into those originating from a single source or multiple sources, and accommodate various outcome variable types. The effectiveness of these methods is demonstrated in predicting outcomes and revealing the subtle relationships within data from multiple sources.Moreover, the dissertation offers a glimpse of future directions in enhancing PCovR. Implications of extending the method such that it selects important variables are critically examined. Also, an algorithm that has the potential to yield optimal results is suggested. In conclusion, this dissertation proposes methods to tackle the complexity of large data from multiple sources, and points towards where opportunities may lie in the next line of research

    Dimensionality and Structure in Cancer Genomics: A Statistical Learning Perspective

    Get PDF
    Computational analysis of genomic data has transformed research and clinical practice in oncology. Machine learning and AI advancements hold promise for answering theoretical and practical questions. While the modern researcher has access to a catalogue of tools from disciplines such as natural language processing and image recognition, before browsing for our favourite off-the-shelf technique it is worth asking a sequence of questions. What sort of data are we dealing with in cancer genomics? Do we have enough of it to be successful without designing into our models what we already know about its structure? If our methods do work, will we understand why? Are our tools robust enough to be applied in clinical practice? If so, are the technologies upon which they rely economically viable? While we will not answer all of these questions, we will provide language with which to discuss them. Understanding how much information we can expect to extract from data is a statistical question

    Building Information Filtering Networks with Topological Constraints: Algorithms and Applications

    Get PDF
    We propose a new methodology for learning the structure of sparse networks from data; in doing so we adopt a dual perspective where we consider networks both as weighted graphs and as simplicial complexes. The proposed learning methodology belongs to the family of preferential attachment algorithms, where a network is extended by iteratively adding new vertices. In the conventional preferential attachment algorithm a new vertex is added to the network by adding a single edge to another existing vertex; in our approach a new vertex is added to a set of vertices by adding one or more new simplices to the simplicial complex. We propose the use of a score function to quantify the strength of the association between the new vertex and the attachment points. The methodology performs a greedy optimisation of the total score by selecting, at each step, the new vertex and the attachment points that maximise the gain in the score. Sparsity is enforced by restricting the space of the feasible configurations through the imposition of topological constraints on the candidate networks; the constraint is fulfilled by allowing only topological operations that are invariant with respect to the required property. For instance, if the topological constraint requires the constructed network to be be planar, then only planarity-invariant operations are allowed; if the constraint is that the network must be a clique forest, then only simplicial vertices can be added. At each step of the algorithm, the vertex to be added and the attachment points are those that provide the maximum increase in score while maintaining the topological constraints. As a concrete but general realisation we propose the clique forest as a possible topological structure for the representation of sparse networks, and we allow to specify further constraints such as the allowed range of clique sizes and the saturation of the attachment points. In this thesis we originally introduce the Maximally Filtered Clique Forest (MFCF) algorithm: the MFCF builds a clique forest by repeated application of a suitably invariant operation that we call Clique Expansion operator and adds vertices according to a strategy that greedily maximises the gain in a local score function. The gains produced by the Clique Expansion operator can be validated in a number of ways, including statistical testing, cross-validation or value thresholding. The algorithm does not prescribe a specific form for the gain function, but allows the use of any number of gain functions as long as they are consistent with the Clique Expansion operator. We describe several examples of gain functions suited to different problems. As a specific practical realisation we study the extraction of planar networks with the Triangulated Maximally Filtered Graph (TMFG). The TMFG, in its simplest form, is a specialised version of the MFCF, but it can be made more powerful by allowing the use of specialised planarity invariant operators that are not based on the Clique Expansion operator. We provide applications to two well known applied problems: the Maximum Weight Planar Subgraph Problem (MWPSP) and the Covariance Selection problem. With regards to the Covariance Selection problem we compare our results to the state of the art solution (the Graphical Lasso) and we highlight the benefits of our methodology. Finally, we study the geometry of clique trees as simplicial complexes and note how the statistics based on cliques and separators provides information equivalent to the one that can be achieved by means of homological methods, such as the analysis of Betti numbers, however with our approach being computationally more efficient and intuitively simpler. Finally, we use the geometric tools developed to provide a possible methodology for inferring the size of a dataset generated by a factor model. As an example we show that our tools provide a solution for inferring the size of a dataset generated by a factor model
    • …
    corecore