868 research outputs found

    Identification of genetic variants that impact gene co-expression relationships using large-scale single-cell data

    Get PDF
    Background: Expression quantitative trait loci (eQTL) studies show how genetic variants affect downstream gene expression. Single-cell data allows reconstruction of personalized co-expression networks and therefore the identification of SNPs altering co-expression patterns (co-expression QTLs, co-eQTLs) and the affected upstream regulatory processes using a limited number of individuals. Results: We conduct a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach. Before the analysis, we evaluate the co-expression patterns required for co-eQTL identification using different external resources. We identify a robust set of cell-type-specific co-eQTLs for 72 independent SNPs affecting 946 gene pairs. These co-eQTLs are replicated in a large bulk cohort and provide novel insights into how disease-associated variants alter regulatory networks. One co-eQTL SNP, rs1131017, that is associated with several autoimmune diseases, affects the co-expression of RPS26 with other ribosomal genes. Interestingly, specifically in T cells, the SNP additionally affects co-expression of RPS26 and a group of genes associated with T cell activation and autoimmune disease. Among these genes, we identify enrichment for targets of five T-cell-activation-related transcription factors whose binding sites harbor rs1131017. This reveals a previously overlooked process and pinpoints potential regulators that could explain the association of rs1131017 with autoimmune diseases. Conclusion: Our co-eQTL results highlight the importance of studying context-specific gene regulation to understand the biological implications of genetic variation. With the expected growth of sc-eQTL datasets, our strategy and technical guidelines will facilitate future co-eQTL identification, further elucidating unknown disease mechanisms.</p

    Identification of genetic variants that impact gene co-expression relationships using large-scale single-cell data

    Get PDF
    Background: Expression quantitative trait loci (eQTL) studies show how genetic variants affect downstream gene expression. Single-cell data allows reconstruction of personalized co-expression networks and therefore the identification of SNPs altering co-expression patterns (co-expression QTLs, co-eQTLs) and the affected upstream regulatory processes using a limited number of individuals. Results: We conduct a co-eQTL meta-analysis across four scRNA-seq peripheral blood mononuclear cell datasets using a novel filtering strategy followed by a permutation-based multiple testing approach. Before the analysis, we evaluate the co-expression patterns required for co-eQTL identification using different external resources. We identify a robust set of cell-type-specific co-eQTLs for 72 independent SNPs affecting 946 gene pairs. These co-eQTLs are replicated in a large bulk cohort and provide novel insights into how disease-associated variants alter regulatory networks. One co-eQTL SNP, rs1131017, that is associated with several autoimmune diseases, affects the co-expression of RPS26 with other ribosomal genes. Interestingly, specifically in T cells, the SNP additionally affects co-expression of RPS26 and a group of genes associated with T cell activation and autoimmune disease. Among these genes, we identify enrichment for targets of five T-cell-activation-related transcription factors whose binding sites harbor rs1131017. This reveals a previously overlooked process and pinpoints potential regulators that could explain the association of rs1131017 with autoimmune diseases. Conclusion: Our co-eQTL results highlight the importance of studying context-specific gene regulation to understand the biological implications of genetic variation. With the expected growth of sc-eQTL datasets, our strategy and technical guidelines will facilitate future co-eQTL identification, further elucidating unknown disease mechanisms.</p

    propr: An R-package for identifying proportionally abundant features using compositional data analysis

    Full text link
    In the life sciences, many assays measure only the relative abundances of components in each sample. Such data, called compositional data, require special treatment to avoid misleading conclusions. Awareness of the need for caution in analyzing compositional data is growing, including the understanding that correlation is not appropriate for relative data. Recently, researchers have proposed proportionality as a valid alternative to correlation for calculating pairwise association in relative data. Although the question of how to best measure proportionality remains open, we present here a computationally efficient R package that implements three measures of proportionality. In an effort to advance the understanding and application of proportionality analysis, we review the mathematics behind proportionality, demonstrate its application to genomic data, and discuss some ongoing challenges in the analysis of relative abundance data

    Evaluation of clustering results and novel cluster algorithms

    Get PDF
    Cluster analysis is frequently performed in many application fields to find groups in data. For example, in medicine, researchers have used gene expression data to cluster patients suffering from a particular disease (e.g., breast cancer), in order to detect new disease subtypes. Many cluster algorithms and methods for cluster validation, i.e., methods for evaluating the quality of cluster analysis results, have been proposed in the literature. However, open questions about the evaluation of both clustering results and novel cluster algorithms remain. It has rarely been discussed whether a) interesting clustering results or b) promising performance evaluations of newly presented cluster algorithms might be over-optimistic, in the sense that these good results cannot be replicated on new data or in other settings. Such questions are relevant in light of the so-called "replication crisis"; in various research disciplines such as medicine, biology, psychology, and economics, many results have turned out to be non-replicable, casting doubt on the trustworthiness and reliability of scientific findings. This crisis has led to increasing popularity of "metascience". Metascientific studies analyze problems that have contributed to the replication crisis (e.g., questionable research practices), and propose and evaluate possible solutions. So far, metascientific studies have mainly focused on issues related to significance testing. In contrast, this dissertation addresses the reliability of a) clustering results in applied research and b) results concerning newly presented cluster algorithms in the methodological literature. Different aspects of this topic are discussed in three Contributions. The first Contribution presents a framework for validating clustering results on validation data. Using validation data is vital to examine the replicability and generalizability of results. While applied researchers sometimes use validation data to check their clustering results, our article is the first to review the different approaches in the literature and to structure them in a systematic manner. We demonstrate that many classical cluster validation techniques, such as internal and external validation, can be combined with validation data. Our framework provides guidance to applied researchers who wish to evaluate their own clustering results or the results of other teams on new data. The second Contribution applies the framework from Contribution 1 to quantify over-optimistic bias in the context of a specific application field, namely unsupervised microbiome research. We analyze over-optimism effects which result from the multiplicity of analysis strategies for cluster analysis and network learning. The plethora of possible analysis strategies poses a challenge for researchers who are often uncertain about which method to use. Researchers might be tempted to try different methods on their dataset and look for the method yielding the "best" result. If only the "best" result is selectively reported, this may cause "overfitting" of the method to the dataset and the result might not be replicable on validation data. We quantify such over-optimism effects for four illustrative types of unsupervised research tasks (clustering of bacterial genera, hub detection in microbial association networks, differential network analysis, and clustering of samples). Contributions 1 and 2 consider the evaluation of clustering results and thus adopt a metascientific perspective on applied research. In contrast, the third Contribution is a metascientific study about methodological research on the development of new cluster algorithms. This Contribution analyzes the over-optimistic evaluation and reporting of novel cluster algorithms. As an illustrative example, we consider the recently proposed cluster algorithm "Rock"; initially deemed promising, it later turned out to be not generally better than its competitors. We demonstrate how Rock can nevertheless appear to outperform competitors via optimization of the evaluation design, namely the used data types, data characteristics, the algorithm’s parameters, and the choice of competing algorithms. The study is a cautionary tale that illustrates how easy it can be for researchers to claim apparent "superiority" of a new cluster algorithm. This, in turn, stresses the importance of strategies for avoiding the problems of over-optimism, such as neutral benchmark studies

    Statistical physics approaches to protein translation

    Get PDF
    In this work we present an integrated approach to the study of protein translation, based on Statistical Physics. We adopted three different but complementary perspectives: building hypothesis up from the data, modeling down from reasonable assumptions, and using computer simulations when everything else fails. In particular, we first analyze the mRNA sequences by means of inforation theory. We focus on the way the redundancy of the genetic code (the 61 sense triplets of nucleotides -the codons- encode for 20 amino acids) is utilized in the actual sequences, a phenomenon known as the codon bias. We observe that it is not completely random, and encodes information in the frequencies and in the order of the codons. With the scope of explaining these anomalies, we develop and analyze a family of stochastic models. Translation emerges as a systemic process, where the limited amount of resources in the cell couples the expression of the genes at a global level. We also suggest a game-theoretical interpretation of the codon bias. We finally attack the problem of cotranslational folding (i.e., the folding of the protein while it is still being translated). Specifically, we ask how this process depends on the order of the codons. This question is computationally very cumbersome. We thus propose a framework based on Markov chains, which allows the efficient simulation of arbitrarily complicate cotranslational folding mechanisms

    Detection of recombination in DNA multiple alignments with hidden markov models

    Get PDF
    CConventional phylogenetic tree estimation methods assume that all sites in a DNA multiple alignment have the same evolutionary history. This assumption is violated in data sets from certain bacteria and viruses due to recombination, a process that leads to the creation of mosaic sequences from different strains and, if undetected, causes systematic errors in phylogenetic tree estimation. In the current work, a hidden Markov model (HMM) is employed to detect recombination events in multiple alignments of DNA sequences. The emission probabilities in a given state are determined by the branching order (topology) and the branch lengths of the respective phylogenetic tree, while the transition probabilities depend on the global recombination probability. The present study improves on an earlier heuristic parameter optimization scheme and shows how the branch lengths and the recombination probability can be optimized in a maximum likelihood sense by applying the expectation maximization (EM) algorithm. The novel algorithm is tested on a synthetic benchmark problem and is found to clearly outperform the earlier heuristic approach. The paper concludes with an application of this scheme to a DNA sequence alignment of the argF gene from four Neisseria strains, where a likely recombination event is clearly detected

    Risk-conscious correction of batch effects: maximising information extraction from high-throughput genomic datasets

    Get PDF
    Contains additional information and discussion on gPCA (Reese et al., 2013). Table S1. Demonstrates the inverse proportionality between gPCA p-value and the associated ‘delta’ score, reflecting unadjusted relative magnitude of batch effects (Reese et al., 2013). The table shows the scores for all three datasets. Figure S1. Contains an Illustration to further help interpret gPCA p-value vs preserved data variance plots. (DOCX 60 kb
    corecore