305 research outputs found

    Evolving Spatially Aggregated Features from Satellite Imagery for Regional Modeling

    Full text link
    Satellite imagery and remote sensing provide explanatory variables at relatively high resolutions for modeling geospatial phenomena, yet regional summaries are often desirable for analysis and actionable insight. In this paper, we propose a novel method of inducing spatial aggregations as a component of the machine learning process, yielding regional model features whose construction is driven by model prediction performance rather than prior assumptions. Our results demonstrate that Genetic Programming is particularly well suited to this type of feature construction because it can automatically synthesize appropriate aggregations, as well as better incorporate them into predictive models compared to other regression methods we tested. In our experiments we consider a specific problem instance and real-world dataset relevant to predicting snow properties in high-mountain Asia

    Visual Similarity Perception of Directed Acyclic Graphs: A Study on Influencing Factors

    Full text link
    While visual comparison of directed acyclic graphs (DAGs) is commonly encountered in various disciplines (e.g., finance, biology), knowledge about humans' perception of graph similarity is currently quite limited. By graph similarity perception we mean how humans perceive commonalities and differences in graphs and herewith come to a similarity judgment. As a step toward filling this gap the study reported in this paper strives to identify factors which influence the similarity perception of DAGs. In particular, we conducted a card-sorting study employing a qualitative and quantitative analysis approach to identify 1) groups of DAGs that are perceived as similar by the participants and 2) the reasons behind their choice of groups. Our results suggest that similarity is mainly influenced by the number of levels, the number of nodes on a level, and the overall shape of the graph.Comment: Graph Drawing 2017 - arXiv Version; Keywords: Graphs, Perception, Similarity, Comparison, Visualizatio

    Survival prediction from clinico-genomic models - a comparative study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Survival prediction from high-dimensional genomic data is an active field in today's medical research. Most of the proposed prediction methods make use of genomic data alone without considering established clinical covariates that often are available and known to have predictive value. Recent studies suggest that combining clinical and genomic information may improve predictions, but there is a lack of systematic studies on the topic. Also, for the widely used Cox regression model, it is not obvious how to handle such combined models.</p> <p>Results</p> <p>We propose a way to combine classical clinical covariates with genomic data in a clinico-genomic prediction model based on the Cox regression model. The prediction model is obtained by a simultaneous use of both types of covariates, but applying dimension reduction only to the high-dimensional genomic variables. We describe how this can be done for seven well-known prediction methods: variable selection, unsupervised and supervised principal components regression and partial least squares regression, ridge regression, and the lasso. We further perform a systematic comparison of the performance of prediction models using clinical covariates only, genomic data only, or a combination of the two. The comparison is done using three survival data sets containing both clinical information and microarray gene expression data. Matlab code for the clinico-genomic prediction methods is available at <url>http://www.med.uio.no/imb/stat/bmms/software/clinico-genomic/</url>.</p> <p>Conclusions</p> <p>Based on our three data sets, the comparison shows that established clinical covariates will often lead to better predictions than what can be obtained from genomic data alone. In the cases where the genomic models are better than the clinical, ridge regression is used for dimension reduction. We also find that the clinico-genomic models tend to outperform the models based on only genomic data. Further, clinico-genomic models and the use of ridge regression gives for all three data sets better predictions than models based on the clinical covariates alone.</p

    A multimodal approach to cardiovascular risk stratification in patients with type 2 diabetes incorporating retinal, genomic and clinical features

    Get PDF
    Cardiovascular diseases are a public health concern; they remain the leading cause of morbidity and mortality in patients with type 2 diabetes. Phenotypic information available from retinal fundus images and clinical measurements, in addition to genomic data, can identify relevant biomarkers of cardiovascular health. In this study, we assessed whether such biomarkers stratified risks of major adverse cardiac events (MACE). A retrospective analysis was carried out on an extract from the Tayside GoDARTS bioresource of participants with type 2 diabetes (n = 3,891). A total of 519 features were incorporated, summarising morphometric properties of the retinal vasculature, various single nucleotide polymorphisms (SNPs), as well as routine clinical measurements. After imputing missing features, a predictive model was developed on a randomly sampled set (n = 2,918) using L1-regularised logistic regression (lasso). The model was evaluated on an independent set (n = 973) and its performance associated with overall hazard rate after censoring (log-rank p < 0.0001), suggesting that multimodal features were able to capture important knowledge for MACE risk assessment. We further showed through a bootstrap analysis that all three sources of information (retinal, genetic, routine clinical) offer robust signal. Particularly robust features included: tortuousity, width gradient, and branching point retinal groupings; SNPs known to be associated with blood pressure and cardiovascular phenotypic traits; age at imaging; clinical measurements such as blood pressure and high density lipoprotein. This novel approach could be used for fast and sensitive determination of future risks associated with MACE

    Stepwise classification of cancer samples using clinical and molecular data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient.</p> <p>Results</p> <p>We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples.</p> <p>Conclusions</p> <p>Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package <it>stepwiseCM </it>and available at the Bioconductor website.</p

    Determining Frequent Patterns of Copy Number Alterations in Cancer

    Get PDF
    Cancer progression is often driven by an accumulation of genetic changes but also accompanied by increasing genomic instability. These processes lead to a complicated landscape of copy number alterations (CNAs) within individual tumors and great diversity across tumor samples. High resolution array-based comparative genomic hybridization (aCGH) is being used to profile CNAs of ever larger tumor collections, and better computational methods for processing these data sets and identifying potential driver CNAs are needed. Typical studies of aCGH data sets take a pipeline approach, starting with segmentation of profiles, calls of gains and losses, and finally determination of frequent CNAs across samples. A drawback of pipelines is that choices at each step may produce different results, and biases are propagated forward. We present a mathematically robust new method that exploits probe-level correlations in aCGH data to discover subsets of samples that display common CNAs. Our algorithm is related to recent work on maximum-margin clustering. It does not require pre-segmentation of the data and also provides grouping of recurrent CNAs into clusters. We tested our approach on a large cohort of glioblastoma aCGH samples from The Cancer Genome Atlas and recovered almost all CNAs reported in the initial study. We also found additional significant CNAs missed by the original analysis but supported by earlier studies, and we identified significant correlations between CNAs

    Sparse canonical correlation analysis for identifying, connecting and completing gene-expression networks

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We generalized penalized canonical correlation analysis for analyzing microarray gene-expression measurements for checking completeness of known metabolic pathways and identifying candidate genes for incorporation in the pathway. We used Wold's method for calculation of the canonical variates, and we applied ridge penalization to the regression of pathway genes on canonical variates of the non-pathway genes, and the elastic net to the regression of non-pathway genes on the canonical variates of the pathway genes.</p> <p>Results</p> <p>We performed a small simulation to illustrate the model's capability to identify new candidate genes to incorporate in the pathway: in our simulations it appeared that a gene was correctly identified if the correlation with the pathway genes was 0.3 or more. We applied the methods to a gene-expression microarray data set of 12, 209 genes measured in 45 patients with glioblastoma, and we considered genes to incorporate in the glioma-pathway: we identified more than 25 genes that correlated > 0.9 with canonical variates of the pathway genes.</p> <p>Conclusion</p> <p>We concluded that penalized canonical correlation analysis is a powerful tool to identify candidate genes in pathway analysis.</p

    Bayesian lasso binary quantile regression

    Get PDF
    In this paper, a Bayesian hierarchical model for variable selection and estimation in the context of binary quantile regression is proposed. Existing approaches to variable selection in a binary classification context are sensitive to outliers, heteroskedasticity or other anomalies of the latent response. The method proposed in this study overcomes these problems in an attractive and straightforward way. A Laplace likelihood and Laplace priors for the regression parameters are proposed and estimated with Bayesian Markov Chain Monte Carlo. The resulting model is equivalent to the frequentist lasso procedure. A conceptional result is that by doing so, the binary regression model is moved from a Gaussian to a full Laplacian framework without sacrificing much computational efficiency. In addition, an efficient Gibbs sampler to estimate the model parameters is proposed that is superior to the Metropolis algorithm that is used in previous studies on Bayesian binary quantile regression. Both the simulation studies and the real data analysis indicate that the proposed method performs well in comparison to the other methods. Moreover, as the base model is binary quantile regression, a much more detailed insight in the effects of the covariates is provided by the approach. An implementation of the lasso procedure for binary quantile regression models is available in the R-package bayesQR

    ICE COLD ERIC – International collaborative effort on chronic obstructive lung disease: exacerbation risk index cohorts – Study protocol for an international COPD cohort study

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Chronic Obstructive Pulmonary Disease (COPD) is a systemic disease; morbidity and mortality due to COPD are on the increase, and it has great impact on patients' lives. Most COPD patients are managed by general practitioners (GP). Too often, GPs base their initial assessment of patient's disease severity mainly on lung function. However, lung function correlates poorly with COPD-specific health-related quality of life and exacerbation frequency. A validated COPD disease risk index that better represents the clinical manifestations of COPD and is feasible in primary care seems to be useful. The objective of this study is to develop and validate a practical COPD disease risk index that predicts the clinical course of COPD in primary care patients with GOLD stages 2–4.</p> <p>Methods/Design</p> <p>We will conduct 2 linked prospective cohort studies with COPD patients from GPs in Switzerland and the Netherlands. We will perform a baseline assessment including detailed patient history, questionnaires, lung function, history of exacerbations, measurement of exercise capacity and blood sampling. During the follow-up of at least 2 years, we will update the patients' profile by registering exacerbations, health-related quality of life and any changes in the use of medication. The primary outcome will be health-related quality of life. Secondary outcomes will be exacerbation frequency and mortality. Using multivariable regression analysis, we will identify the best combination of variables predicting these outcomes over one and two years and, depending on funding, even more years.</p> <p>Discussion</p> <p>Despite the diversity of clinical manifestations and available treatments, assessment and management today do not reflect the multifaceted character of the disease. This is in contrast to preventive cardiology where, nowadays, the treatment in primary care is based on patient-specific and fairly refined cardiovascular risk profile corresponding to differences in prognosis. After completion of this study, we will have a practical COPD-disease risk index that predicts the clinical course of COPD in primary care patients with GOLD stages 2–4. In a second step we will incorporate evidence-based treatment effects into this model, such that the instrument may guide physicians in selecting treatment based on the individual patients' prognosis.</p> <p>Trial registration</p> <p>ClinicalTrials.gov Archive NCT00706602</p

    Identification of Yeast Transcriptional Regulation Networks Using Multivariate Random Forests

    Get PDF
    The recent availability of whole-genome scale data sets that investigate complementary and diverse aspects of transcriptional regulation has spawned an increased need for new and effective computational approaches to analyze and integrate these large scale assays. Here, we propose a novel algorithm, based on random forest methodology, to relate gene expression (as derived from expression microarrays) to sequence features residing in gene promoters (as derived from DNA motif data) and transcription factor binding to gene promoters (as derived from tiling microarrays). We extend the random forest approach to model a multivariate response as represented, for example, by time-course gene expression measures. An analysis of the multivariate random forest output reveals complex regulatory networks, which consist of cohesive, condition-dependent regulatory cliques. Each regulatory clique features homogeneous gene expression profiles and common motifs or synergistic motif groups. We apply our method to several yeast physiological processes: cell cycle, sporulation, and various stress conditions. Our technique displays excellent performance with regard to identifying known regulatory motifs, including high order interactions. In addition, we present evidence of the existence of an alternative MCB-binding pathway, which we confirm using data from two independent cell cycle studies and two other physioloigical processes. Finally, we have uncovered elaborate transcription regulation refinement mechanisms involving PAC and mRRPE motifs that govern essential rRNA processing. These include intriguing instances of differing motif dosages and differing combinatorial motif control that promote regulatory specificity in rRNA metabolism under differing physiological processes
    corecore