48 research outputs found

    Gene Function Prediction from Functional Association Networks Using Kernel Partial Least Squares Regression

    Get PDF
    With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene function from these networks have been proposed. However, evaluating the relative performance of these algorithms may not be trivial: concerns have been raised over biases in different benchmarking methods and datasets, particularly relating to non-independence of functional association data and test data. In this paper we propose a new network-based gene function prediction algorithm using a commute-time kernel and partial least squares regression (Compass). We compare Compass to GeneMANIA, a leading network-based prediction algorithm, using a number of different benchmarks, and find that Compass outperforms GeneMANIA on these benchmarks. We also explicitly explore problems associated with the non-independence of functional association data and test data. We find that a benchmark based on the Gene Ontology database, which, directly or indirectly, incorporates information from other databases, may considerably overestimate the performance of algorithms exploiting functional association data for prediction

    Network connectivity and structural correlates of survival in progressive supranuclear palsy and corticobasal syndrome

    Get PDF
    There is a pressing need to understand the factors that predict prognosis in progressive supranuclear palsy (PSP) and corticobasal syndrome (CBS), with high heterogeneity over the poor average survival. We test the hypothesis that the magnitude and distribution of connectivity changes in PSP and CBS predict the rate of progression and survival time, using datasets from the Cambridge Centre for Parkinson-plus and the UK National PSP Research Network (PROSPECT-MR). Resting-state functional MRI images were available from 146 participants with PSP, 82 participants with CBS, and 90 healthy controls. Large-scale networks were identified through independent component analyses, with correlations taken between component time series. Independent component analysis was also used to select between-network connectivity components to compare with baseline clinical severity, longitudinal rate of change in severity, and survival. Transdiagnostic survival predictors were identified using partial least squares regression for Cox models, with connectivity compared to patients' demographics, structural imaging, and clinical scores using five-fold cross-validation. In PSP and CBS, between-network connectivity components were identified that differed from controls, were associated with disease severity, and were related to survival and rate of change in clinical severity. A transdiagnostic component predicted survival beyond demographic and motion metrics but with lower accuracy than an optimal model that included the clinical and structural imaging measures. Cortical atrophy enhanced the connectivity changes that were most predictive of survival. Between-network connectivity is associated with variability in prognosis in PSP and CBS but does not improve predictive accuracy beyond clinical and structural imaging metrics

    Supervised Dimension Reduction for Large-scale Omics Data with Censored Survival Outcomes Under Possible Non-proportional Hazards

    Get PDF
    The past two decades have witnessed significant advances in high-throughput ``omics technologies such as genomics, proteomics, metabolomics, transcriptomics and radiomics. These technologies have enabled simultaneous measurement of the expression levels of tens of thousands of features from individual patient samples and have generated enormous amounts of data that require analysis and interpretation. One specific area of interest has been in studying the relationship between these features and patient outcomes, such as overall and recurrence-free survival, with the goal of developing a predictive ``omics profile. Large-scale studies often suffer from the presence of a large fraction of censored observations and potential time-varying effects of features, and methods for handling them have been lacking. In this paper, we propose supervised methods for feature selection and survival prediction that simultaneously deal with both issues. Our approach utilizes continuum power regression (CPR) - a framework that includes a variety of regression methods - in conjunction with the parametric or semi-parametric accelerated failure time (AFT) model. Both CPR and AFT fall within the linear models framework and, unlike black-box models, the proposed prognostic index has a simple yet useful interpretation. We demonstrate the utility of our methods using simulated and publicly available cancer genomics data

    Dimension reduction methods with applications to high dimensional data with a censored response

    Get PDF
    Dimension reduction methods have come to the forefront of many applications where the number of covariates, p, far exceed the sample size, N. For example, in survival analysis studies using microarray gene expression data, 10--30K expressions per patient are collected, but only a few hundred patients are available for the study. The focus of this work is on linear dimension reduction methods. Attention is given to the dimension reduction method of Random Projection (RP), in which the original p-dimensional data matrix X is projected onto a k-dimensional subspace using a random matrix Gamma. The motivation of RP is the Johnson-Lindenstrauss (JL) Lemma, which states that a set of N points in p-dimensional Euclidean space can be projected onto a k ≥ 24lnN3e2-2e 3 dimensional Euclidean space such that the pairwise distances between the points are preserved within a factor 1 +/- epsilon. In this work, the JL Lemma is revisited when the random matrix Gamma is defined as standard Gaussian and Achlioptas-typed. An improvement on the lower bound for k is provided by working directly with the distributions of the random distances rather than resorting to the moment generating function technique used in the literature. An improvement on the lower bound for k is also provided when using pairwise L2 distances in the space of the original points and pairwise L 1 distances in the space of the projected points. Another popular dimension reduction method is Partial Least Squares. In this work, a variant of Partial Least Squares is proposed, denoted by Rank-based Modified Partial Least Squares (RMPLS). The weight vectors of RMPLS can be seen to be the solution to an optimization problem. The method is insensitive to outlying values of both the response and the covariates, and takes into account the censoring information in the construction of its weight vectors. Results from simulation and real datasets under the Cox and Accelerated Failure Time (AFT) models indicate that RMPLS outperforms other leading methods for various measures when outliers are present in the response, and is comparable to other methods in the absence of outliers in the response

    New covariates selection approaches in high dimensional or functional regression models

    Get PDF
    In a Big Data context, the number of covariates used to explain a variable of interest, p, is likely to be high, sometimes even higher than the available sample size (p > n). Ordinary procedures for fitting regression models start to perform wrongly in this situation. As a result, other approaches are needed. A first covariates selection step is of interest to consider only the relevant terms and to reduce the problem dimensionality. The purpose of this thesis is the study and development of covariates selection techniques for regression models in complex settings. In particular, we focus on recent high dimensional or functional data contexts of interest. Assuming some model structure, regularization techniques are widely employed alternatives for both: model estimation and covariates selection simultaneously. Specifically, an extensive and critical review of penalization techniques for covariates selection is carried out. This is developed in the context of the high dimensional linear model of the vectorial framework. Conversely, if no model structure wants to be assumed, stateof- the-art dependence measures based on distances are an attractive option for covariates selection. New specification tests using these ideas are proposed for the functional concurrent model. Both versions are considered separately: the synchronous and the asynchronous case. These approaches are based on novel dependence measures derived from the distance covariance coefficient

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
    corecore