407 research outputs found

    A novel weighted rank aggregation algorithm with applications in gene prioritization

    Get PDF
    We propose a new family of algorithms for bounding/approximating the optimal solution of rank aggregation problems based on weighted Kendall distances. The algorithms represent linear programming relaxations of integer programs that involve variables reflecting partial orders of three or more candidates. Our simulation results indicate that the linear programs give near-optimal performance for a number of important voting parameters, and outperform methods based on PageRank and Weighted Bipartite Matching. Finally, we illustrate the performance of the aggregation method on a set of test genes pertaining to the Bardet-Biedl syndrome, schizophrenia, and HIV and show that the combinatorial method matches or outperforms state-of-the art algorithms such as ToppGene

    Gene prioritization through hybrid distance-score rank aggregation

    Get PDF
    This thesis is concerned with developing novel rank aggregation methods for gene prioritization. Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: a) First, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. b) Second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-vs-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally. Specifically, we combine score-based Borda and Kendall permutation distance aggregation methods with TvB weightings. c) Third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. We tested HyDRA on a number of gene sets, including Autism, Breast cancer, Colorectal cancer, Endometriosis, Ischaemic stroke, Leukemia, Lymphoma, and Osteoarthritis. Furthermore, we performed iterative gene discovery for Glioblastoma, Meningioma and Breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour

    Rank-Similarity Measures for Comparing Gene Prioritizations: A Case Study in Autism

    Get PDF
    We discuss the challenge of comparing three gene prioritization methods: network propagation, integer linear programming rank aggregation (RA), and statistical RA. These methods are based on different biological categories and estimate disease?gene association. Previously proposed comparison schemes are based on three measures of performance: receiver operating curve, area under the curve, and median rank ratio. Although they may capture important aspects of gene prioritization performance, they may fail to capture important differences in the rankings of individual genes. We suggest that comparison schemes could be improved by also considering recently proposed measures of similarity between gene rankings. We tested this suggestion on comparison schemes for prioritizations of genes associated with autism that were obtained using brain- and tissue-specific data. Our results show the effectiveness of our measures of similarity in clustering brain regions based on their relevance to autism

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    Integrative multi-omic network strategies for unraveling complex disease biology and the identification of novel phenotype associated genes

    Full text link
    Identifying the genetic risk factors underlying a given disease is an essential step for informing effective drug targets, understanding disease architecture, and predicting at-risk individuals. A commonly applied approach for identifying novel disease-associated genes is the Genome Wide Association Study (GWAS) approach, in which a high number of individuals are sequenced and genetic variants are then tested for an association with disease status. While the GWAS approach has identified countless disease-associated genes, there remain plenty of diseases for which our genetic understanding is still incomplete. One strategy for augmenting the GWAS approach is to incorporate additional omics data in order to prioritize biologically plausible candidate genes. In this thesis work, we integrate network-based strategies with existing genetic analysis pipelines in order to identify novel Alzheimer’s disease (AD) genes. Two types of biological data inform the underlying structure of the networks: a) protein-protein interactions and b) gene expression in the human brain. Genes which interact or are co-expressed across similar conditions have been shown to have a higher probability of being functionally related. Using a set or previously known AD genes, we apply a network propagation strategy to score genes based upon their proximity to the known AD genes within these networks. Then we integrate the network score of each gene with its risk score from GWAS to identify novel candidates. To further affirm the reproducibility of findings, we further incorporate additional information in the form of knockout models in flies, bootstrap aggregation, and external genetic datasets. In addition to predicting novel genes, we are able to utilize regional co-expression networks to further understand how the known AD genes behave within the various sub-divisions of the brain. We find that regions of the brain which are known to have the earliest vulnerability to AD-induced neurodegeneration also tend to be where AD genes are highly correlated

    Genome-wide association study of thyroid-stimulating hormone highlights new genes, pathways and associations with thyroid disease.

    Get PDF
    Thyroid hormones play a critical role in regulation of multiple physiological functions and thyroid dysfunction is associated with substantial morbidity. Here, we use electronic health records to undertake a genome-wide association study of thyroid-stimulating hormone (TSH) levels, with a total sample size of 247,107. We identify 158 novel genetic associations, more than doubling the number of known associations with TSH, and implicate 112 putative causal genes, of which 76 are not previously implicated. A polygenic score for TSH is associated with TSH levels in African, South Asian, East Asian, Middle Eastern and admixed American ancestries, and associated with hypothyroidism and other thyroid disease in South Asians. In Europeans, the TSH polygenic score is associated with thyroid disease, including thyroid cancer and age-of-onset of hypothyroidism and hyperthyroidism. We develop pathway-specific genetic risk scores for TSH levels and use these in phenome-wide association studies to identify potential consequences of pathway perturbation. Together, these findings demonstrate the potential utility of genetic associations to inform future therapeutics and risk prediction for thyroid diseases

    Rank Aggregation via Heterogeneous Thurstone Preference Models

    Full text link
    We propose the Heterogeneous Thurstone Model (HTM) for aggregating ranked data, which can take the accuracy levels of different users into account. By allowing different noise distributions, the proposed HTM model maintains the generality of Thurstone's original framework, and as such, also extends the Bradley-Terry-Luce (BTL) model for pairwise comparisons to heterogeneous populations of users. Under this framework, we also propose a rank aggregation algorithm based on alternating gradient descent to estimate the underlying item scores and accuracy levels of different users simultaneously from noisy pairwise comparisons. We theoretically prove that the proposed algorithm converges linearly up to a statistical error which matches that of the state-of-the-art method for the single-user BTL model. We evaluate the proposed HTM model and algorithm on both synthetic and real data, demonstrating that it outperforms existing methods.Comment: 36 pages, 2 figures, 8 tables. In AAAI 202

    Gene co-expression analyses: an overview from microarray collections in Arabidopsis thaliana

    Get PDF
    4noBioinformatics web-based resources and databases are precious references for most biological laboratories worldwide. However, the quality and reliability of the information they provide depends on them being used in an appropriate way that takes into account their specific features. Huge collections of gene expression data are currently publicly available, ready to support the understanding of gene and genome functionalities. In this context, tools and resources for gene co-expression analyses have flourished to exploit the ‘guilty by association' principle, which assumes that genes with correlated expression profiles are functionally related. In the case of Arabidopsis thaliana, the reference species in plant biology, the resources available mainly consist of microarray results. After a general overview of such resources, we tested and compared the results they offer for gene co-expression analysis. We also discuss the effect on the results when using different data sets, as well as different data normalization approaches and parameter settings, which often consider different metrics for establishing co-expression. A dedicated example analysis of different gene pools, implemented by including/excluding mutant samples in a reference data set, showed significant variation of gene co-expression occurrence, magnitude and direction. We conclude that, as the heterogeneity of the resources and methods may produce different results for the same query genes, the exploration of more than one of the available resources is strongly recommended. The aim of this article is to show how best to integrate data sources and/or merge outputs to achieve robust analyses and reliable interpretations, thereby making use of diverse data resources an opportunity for added value.openembargoed_20170219Di Salle, Pasquale; Incerti, Guido; Colantuono, Chiara; Chiusano, Maria LuisaDi Salle, Pasquale; Incerti, Guido; Colantuono, Chiara; Chiusano, Maria Luis

    Rank-based Bayesian clustering via covariate-informed Mallows mixtures

    Full text link
    Data in the form of rankings, ratings, pair comparisons or clicks are frequently collected in diverse fields, from marketing to politics, to understand assessors' individual preferences. Combining such preference data with features associated with the assessors can lead to a better understanding of the assessors' behaviors and choices. The Mallows model is a popular model for rankings, as it flexibly adapts to different types of preference data, and the previously proposed Bayesian Mallows Model (BMM) offers a computationally efficient framework for Bayesian inference, also allowing capturing the users' heterogeneity via a finite mixture. We develop a Bayesian Mallows-based finite mixture model that performs clustering while also accounting for assessor-related features, called the Bayesian Mallows model with covariates (BMMx). BMMx is based on a similarity function that a priori favours the aggregation of assessors into a cluster when their covariates are similar, using the Product Partition models (PPMx) proposal. We present two approaches to measure the covariate similarity: one based on a novel deterministic function measuring the covariates' goodness-of-fit to the cluster, and one based on an augmented model as in PPMx. We investigate the performance of BMMx in both simulation experiments and real-data examples, showing the method's potential for advancing the understanding of assessor preferences and behaviors in different applications
    • …
    corecore