573 research outputs found

    OPENMENDEL: A Cooperative Programming Project for Statistical Genetics

    Full text link
    Statistical methods for genomewide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDELproject (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.Comment: 16 pages, 2 figures, 2 table

    A Provable Smoothing Approach for High Dimensional Generalized Regression with Applications in Genomics

    Get PDF
    In many applications, linear models fit the data poorly. This article studies an appealing alternative, the generalized regression model. This model only assumes that there exists an unknown monotonically increasing link function connecting the response YY to a single index XTβ∗X^T\beta^* of explanatory variables X∈RdX\in\mathbb{R}^d. The generalized regression model is flexible and covers many widely used statistical models. It fits the data generating mechanisms well in many real problems, which makes it useful in a variety of applications where regression models are regularly employed. In low dimensions, rank-based M-estimators are recommended to deal with the generalized regression model, giving root-nn consistent estimators of β∗\beta^*. Applications of these estimators to high dimensional data, however, are questionable. This article studies, both theoretically and practically, a simple yet powerful smoothing approach to handle the high dimensional generalized regression model. Theoretically, a family of smoothing functions is provided, and the amount of smoothing necessary for efficient inference is carefully calculated. Practically, our study is motivated by an important and challenging scientific problem: decoding gene regulation by predicting transcription factors that bind to cis-regulatory elements. Applying our proposed method to this problem shows substantial improvement over the state-of-the-art alternative in real data.Comment: 53 page

    Iterative Visual Analytics and its Applications in Bioinformatics

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)You, Qian. Ph.D., Purdue University, December, 2010. Iterative Visual Analytics and its Applications in Bioinformatics. Major Professors: Shiaofen Fang and Luo Si. Visual Analytics is a new and developing field that addresses the challenges of knowledge discoveries from the massive amount of available data. It facilitates humans‘ reasoning capabilities with interactive visual interfaces for exploratory data analysis tasks, where automatic data mining methods fall short due to the lack of the pre-defined objective functions. Analyzing the large volume of data sets for biological discoveries raises similar challenges. The domain knowledge of biologists and bioinformaticians is critical in the hypothesis-driven discovery tasks. Yet developing visual analytics frameworks for bioinformatic applications is still in its infancy. In this dissertation, we propose a general visual analytics framework – Iterative Visual Analytics (IVA) – to address some of the challenges in the current research. The framework consists of three progressive steps to explore data sets with the increased complexity: Terrain Surface Multi-dimensional Data Visualization, a new multi-dimensional technique that highlights the global patterns from the profile of a large scale network. It can lead users‘ attention to characteristic regions for discovering otherwise hidden knowledge; Correlative Multi-level Terrain Surface Visualization, a new visual platform that provides the overview and boosts the major signals of the numeric correlations among nodes in interconnected networks of different contexts. It enables users to gain critical insights and perform data analytical tasks in the context of multiple correlated networks; and the Iterative Visual Refinement Model, an innovative process that treats users‘ perceptions as the objective functions, and guides the users to form the optimal hypothesis by improving the desired visual patterns. It is a formalized model for interactive explorations to converge to optimal solutions. We also showcase our approach with bio-molecular data sets and demonstrate its effectiveness in several biomarker discovery applications

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    MACHINE LEARNING APPROACHES FOR BIOMARKER IDENTIFICATION AND SUBGROUP DISCOVERY FOR POST-TRAUMATIC STRESS DISORDER

    Get PDF
    Post-traumatic stress disorder (PTSD) is a psychiatric disorder caused by environmental and genetic factors resulting from alterations in genetic variation, epigenetic changes and neuroimaging characteristics. There is a pressing need to identify reliable molecular and physiological biomarkers for accurate diagnosis, prognosis, and treatment, as well to deepen the understanding of PTSD pathophysiology. Machine learning methods are widely used to infer patterns from biological data, identify biomarkers, and make predictions. The objective of this research is to apply machine learning methods for the accurate classification of human diseases from genome-scale datasets, focusing primarily on PTSD.The DoD-funded Systems Biology of PTSD Consortium has recruited combat veterans with and without PTSD for measurement of molecular and physiological data from blood or urine samples with the goal of identifying accurate and specific PTSD biomarkers. As a member of the Consortium with access to these PTSD multiple omics datasets, we first completed a project titled Clinical Subgroup-Specific PTSD Classification and Biomarker Discovery. We applied machine learning approaches to these data to build classification models consisting of molecular and clinical features to predict PTSD status. We also identified candidate biomarkers for diagnosis, which improves our understanding of PTSD pathogenesis. In a second project, entitled Multi-Omic PTSD Subgroup Identification and Clinical Characterization, we applied methods for integrating multiple omics datasets to investigate the complex, multivariate nature of the biological systems underlying PTSD. We identified an optimal 2 PTSD subgroups using two different machine learning approaches from 82 PTSD positive samples, and we found that the subgroups exhibited different remitting behavior as inferred from subjects recalled at a later time point. The results from our association, differential expression, and classification analyses demonstrated the distinct clinical and molecular features characterizing these subgroups.Taken together, our work has advanced our understanding of PTSD biomarkers and subgroups through the use of machine learning approaches. Results from our work should strongly contribute to the precise diagnosis and eventual treatment of PTSD, as well as other diseases. Future work will involve continuing to leverage these results to enable precision medicine for PTSD

    Biomarker Prioritisation and Power Estimation Using Ensemble Gene Regulatory Network Inference

    Get PDF
    Inferring the topology of a gene regulatory network (GRN) from gene expression data is a challenging but important undertaking for gaining a better understanding of gene regulation. Key challenges include working with noisy data and dealing with a higher number of genes than samples. Although a number of different methods have been proposed to infer the structure of a GRN, there are large discrepancies among the different inference algorithms they adopt, rendering their meaningful comparison challenging. In this study, we used two methods, namely the MIDER (Mutual Information Distance and Entropy Reduction) and the PLSNET (Partial least square based feature selection) methods, to infer the structure of a GRN directly from data and computationally validated our results. Both methods were applied to different gene expression datasets resulting from inflammatory bowel disease (IBD), pancreatic ductal adenocarcinoma (PDAC), and acute myeloid leukaemia (AML) studies. For each case, gene regulators were successfully identified. For example, for the case of the IBD dataset, the UGT1A family genes were identified as key regulators while upon analysing the PDAC dataset, the SULF1 and THBS2 genes were depicted. We further demonstrate that an ensemble-based approach, that combines the output of the MIDER and PLSNET algorithms, can infer the structure of a GRN from data with higher accuracy. We have also estimated the number of the samples required for potential future validation studies. Here, we presented our proposed analysis framework that caters not only to candidate regulator genes prediction for potential validation experiments but also an estimation of the number of samples required for these experiments

    Spatial statistics from hyperplexed immunofluorescence images: to elucidate tumor microenvironment, to characterize intratumor heterogeneity, and to predict metastatic potential

    Get PDF
    The composition of the tumor microenvironment (TME)–the malignant, immune, and stromal cells implicated in tumor biology as well as the extracellular matrix and noncellular elements–and the spatial relationships between its constituents are important diagnostic biomarkers for cancer progression, proliferation, and therapeutic response. In this thesis, we develop methods to quantify spatial intratumor heterogeneity (ITH). We apply a novel pattern recognition framework to phenotype cells, encode spatial information, and calculate pairwise association statistics between cell phenotypes in the tumor using pointwise mutual information. These association statistics are summarized in a heterogeneity map, used to compare and contrast cancer subtypes and identify interaction motifs that may underlie signaling pathways and functional heterogeneity. Additionally, we test the prognostic power of spatial protein expression and association profiles for predicting clinical cancer staging and recurrence, using multivariate modeling techniques. By demonstrating the relationship between spatial ITH and outcome, we advocate this method as a novel source of information for cancer diagnostics. To this end, we have released an open-source analysis and visualization platform, THRIVE (Tumor Heterogeneity Research Image Visualization Environment), to segment and quantify multiplexed imaging samples, and assess underlying heterogeneity of those samples. The quantification of spatial ITH will uncover key spatial interactions, which contribute to disease proliferation and progression, and may confer metastatic potential in the primary neoplasm

    Surface fluid registration of conformal representation: Application to detect disease burden and genetic influence on hippocampus

    Get PDF
    abstract: In this paper, we develop a new automated surface registration system based on surface conformal parameterization by holomorphic 1-forms, inverse consistent surface fluid registration, and multivariate tensor-based morphometty (mTBM). First, we conformally map a surface onto a planar rectangle space with holomorphic 1-forms. Second, we compute surface conformal representation by combining its local conformal factor and mean curvature and linearly scale the dynamic range of the conformal representation to form the feature image of the surface. Third, we align the feature image with a chosen template image via the fluid image registration algorithm, which has been extended into the curvilinear coordinates to adjust for the distortion introduced by surface parameterization. The inverse consistent image registration algorithm is also incorporated in the system to jointly estimate the forward and inverse transformations between the study and template images. This alignment induces a corresponding deformation on the surface. We tested the system on Alzheimer's Disease Neuroimaging Initiative (ADNI) baseline dataset to study AD symptoms on hippocampus. In our system, by modeling a hippocampus as a 3D parametric surface, we nonlinearly registered each surface with a selected template surface. Then we used mTBM to analyze the morphometry difference between diagnostic groups. Experimental results show that the new system has better performance than two publicly available subcortical surface registration tools: FIRST and SPHARM. We also analyzed the genetic influence of the Apolipoprotein E(is an element of)4 allele (ApoE4), which is considered as the most prevalent risk factor for AD. Our work successfully detected statistically significant difference between ApoE4 carriers and non-carriers in both patients of mild cognitive impairment (MCI) and healthy control subjects. The results show evidence that the ApoE genotype may be associated with accelerated brain atrophy so that our work provides a new MRI analysis tool that may help presymptomatic AD research.NOTICE: this is the author’s version of a work that was accepted for publication in NEUROIMAGE. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Neuroimage, 78, 111-134 [2013] http://dx.doi.org/10.1016/j.neuroimage.2013.04.01
    • …
    corecore