183 research outputs found

    Identification of intracellular bacteria from multiple single-cell RNA-seq platforms using CSI-Microbes

    Get PDF
    The study of the tumor microbiome has been garnering increased attention. We developed a computational pipeline (CSI-Microbes) for identifying microbial reads from single-cell RNA sequencing (scRNA-seq) data and for analyzing differential abundance of taxa. Using a series of controlled experiments and analyses, we performed the first systematic evaluation of the efficacy of recovering microbial unique molecular identifiers by multiple scRNA-seq technologies, which identified the newer 10x chemistries (3' v3 and 5') as the best suited approach. We analyzed patient esophageal and colorectal carcinomas and found that reads from distinct genera tend to co-occur in the same host cells, testifying to possible intracellular polymicrobial interactions. Microbial reads are disproportionately abundant within myeloid cells that up-regulate proinflammatory cytokines like IL1Β and CXCL8, while infected tumor cells up-regulate antigen processing and presentation pathways. These results show that myeloid cells with bacteria engulfed are a major source of bacterial RNA within the tumor microenvironment (TME) and may inflame the TME and influence immunotherapy response

    A self-organized model for cell-differentiation based on variations of molecular decay rates

    Get PDF
    Systemic properties of living cells are the result of molecular dynamics governed by so-called genetic regulatory networks (GRN). These networks capture all possible features of cells and are responsible for the immense levels of adaptation characteristic to living systems. At any point in time only small subsets of these networks are active. Any active subset of the GRN leads to the expression of particular sets of molecules (expression modes). The subsets of active networks change over time, leading to the observed complex dynamics of expression patterns. Understanding of this dynamics becomes increasingly important in systems biology and medicine. While the importance of transcription rates and catalytic interactions has been widely recognized in modeling genetic regulatory systems, the understanding of the role of degradation of biochemical agents (mRNA, protein) in regulatory dynamics remains limited. Recent experimental data suggests that there exists a functional relation between mRNA and protein decay rates and expression modes. In this paper we propose a model for the dynamics of successions of sequences of active subnetworks of the GRN. The model is able to reproduce key characteristics of molecular dynamics, including homeostasis, multi-stability, periodic dynamics, alternating activity, differentiability, and self-organized critical dynamics. Moreover the model allows to naturally understand the mechanism behind the relation between decay rates and expression modes. The model explains recent experimental observations that decay-rates (or turnovers) vary between differentiated tissue-classes at a general systemic level and highlights the role of intracellular decay rate control mechanisms in cell differentiation.Comment: 16 pages, 5 figure

    Phylogenetic tree information aids supervised learning for predicting protein-protein interaction based on distance matrices

    Get PDF
    BACKGROUND: Protein-protein interactions are critical for cellular functions. Recently developed computational approaches for predicting protein-protein interactions utilize co-evolutionary information of the interacting partners, e.g., correlations between distance matrices, where each matrix stores the pairwise distances between a protein and its orthologs from a group of reference genomes. RESULTS: We proposed a novel, simple method to account for some of the intra-matrix correlations in improving the prediction accuracy. Specifically, the phylogenetic species tree of the reference genomes is used as a guide tree for hierarchical clustering of the orthologous proteins. The distances between these clusters, derived from the original pairwise distance matrix using the Neighbor Joining algorithm, form intermediate distance matrices, which are then transformed and concatenated into a super phylogenetic vector. A support vector machine is trained and tested on pairs of proteins, represented as super phylogenetic vectors, whose interactions are known. The performance, measured as ROC score in cross validation experiments, shows significant improvement of our method (ROC score 0.8446) over that of using Pearson correlations (0.6587). CONCLUSION: We have shown that the phylogenetic tree can be used as a guide to extract intra-matrix correlations in the distance matrices of orthologous proteins, where these correlations are represented as intermediate distance matrices of the ancestral orthologous proteins. Both the unsupervised and supervised learning paradigms benefit from the explicit inclusion of these intermediate distance matrices, and particularly so in the latter case, which offers a better balance between sensitivity and specificity in the prediction of protein-protein interactions

    Effect of promoter architecture on the cell-to-cell variability in gene expression

    Get PDF
    According to recent experimental evidence, the architecture of a promoter, defined as the number, strength and regulatory role of the operators that control the promoter, plays a major role in determining the level of cell-to-cell variability in gene expression. These quantitative experiments call for a corresponding modeling effort that addresses the question of how changes in promoter architecture affect noise in gene expression in a systematic rather than case-by-case fashion. In this article, we make such a systematic investigation, based on a simple microscopic model of gene regulation that incorporates stochastic effects. In particular, we show how operator strength and operator multiplicity affect this variability. We examine different modes of transcription factor binding to complex promoters (cooperative, independent, simultaneous) and how each of these affects the level of variability in transcription product from cell-to-cell. We propose that direct comparison between in vivo single-cell experiments and theoretical predictions for the moments of the probability distribution of mRNA number per cell can discriminate between different kinetic models of gene regulation.Comment: 35 pages, 6 figures, Submitte

    Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

    Get PDF
    BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms

    Accelerated Profile HMM Searches

    Get PDF
    Profile hidden Markov models (profile HMMs) and probabilistic inference methods have made important contributions to the theory of sequence database homology search. However, practical use of profile HMM methods has been hindered by the computational expense of existing software implementations. Here I describe an acceleration heuristic for profile HMMs, the “multiple segment Viterbi” (MSV) algorithm. The MSV algorithm computes an optimal sum of multiple ungapped local alignment segments using a striped vector-parallel approach previously described for fast Smith/Waterman alignment. MSV scores follow the same statistical distribution as gapped optimal local alignment scores, allowing rapid evaluation of significance of an MSV score and thus facilitating its use as a heuristic filter. I also describe a 20-fold acceleration of the standard profile HMM Forward/Backward algorithms using a method I call “sparse rescaling”. These methods are assembled in a pipeline in which high-scoring MSV hits are passed on for reanalysis with the full HMM Forward/Backward algorithm. This accelerated pipeline is implemented in the freely available HMMER3 software package. Performance benchmarks show that the use of the heuristic MSV filter sacrifices negligible sensitivity compared to unaccelerated profile HMM searches. HMMER3 is substantially more sensitive and 100- to 1000-fold faster than HMMER2. HMMER3 is now about as fast as BLAST for protein searches

    Detection of Heteroplasmic Mitochondrial DNA in Single Mitochondria

    Get PDF
    BACKGROUND: Mitochondrial DNA (mtDNA) genome mutations can lead to energy and respiratory-related disorders like myoclonic epilepsy with ragged red fiber disease (MERRF), mitochondrial myopathy, encephalopathy, lactic acidosis and stroke (MELAS) syndrome, and Leber's hereditary optic neuropathy (LHON). It is not well understood what effect the distribution of mutated mtDNA throughout the mitochondrial matrix has on the development of mitochondrial-based disorders. Insight into this complex sub-cellular heterogeneity may further our understanding of the development of mitochondria-related diseases. METHODOLOGY: This work describes a method for isolating individual mitochondria from single cells and performing molecular analysis on that single mitochondrion's DNA. An optical tweezer extracts a single mitochondrion from a lysed human HL-60 cell. Then a micron-sized femtopipette tip captures the mitochondrion for subsequent analysis. Multiple rounds of conventional DNA amplification and standard sequencing methods enable the detection of a heteroplasmic mixture in the mtDNA from a single mitochondrion. SIGNIFICANCE: Molecular analysis of mtDNA from the individually extracted mitochondrion demonstrates that a heteroplasmy is present in single mitochondria at various ratios consistent with the 50/50 heteroplasmy ratio found in single cells that contain multiple mitochondria

    Identifying Cognate Binding Pairs among a Large Set of Paralogs: The Case of PE/PPE Proteins of Mycobacterium tuberculosis

    Get PDF
    We consider the problem of how to detect cognate pairs of proteins that bind when each belongs to a large family of paralogs. To illustrate the problem, we have undertaken a genomewide analysis of interactions of members of the PE and PPE protein families of Mycobacterium tuberculosis. Our computational method uses structural information, operon organization, and protein coevolution to infer the interaction of PE and PPE proteins. Some 289 PE/PPE complexes were predicted out of a possible 5,590 PE/PPE pairs genomewide. Thirty-five of these predicted complexes were also found to have correlated mRNA expression, providing additional evidence for these interactions. We show that our method is applicable to other protein families, by analyzing interactions of the Esx family of proteins. Our resulting set of predictions is a starting point for genomewide experimental interaction screens of the PE and PPE families, and our method may be generally useful for detecting interactions of proteins within families having many paralogs

    Enchytraeus albidus Microarray: Enrichment, Design, Annotation and Database (EnchyBASE)

    Get PDF
    Enchytraeus albidus (Oligochaeta) is an ecologically relevant species used as standard test organisms for risk assessment. Effects of stressors in this species are commonly determined at the population level using reproduction and survival as endpoints. The assessment of transcriptomic responses can be very useful e.g. to understand underlying mechanisms of toxicity with gene expression fingerprinting. In the present paper the following is being addressed: 1) development of suppressive subtractive hybridization (SSH) libraries enriched for differentially expressed genes after metal and pesticide exposures; 2) sequencing and characterization of all generated cDNA inserts; 3) development of a publicly available genomic database on E. albidus. A total of 2100 Expressed Sequence Tags (ESTs) were isolated, sequenced and assembled into 1124 clusters (947 singletons and 177 contigs). From these sequences, 41% matched known proteins in GenBank (BLASTX, e-value≤10-5) and 37% had at least one Gene Ontology (GO) term assigned. In total, 5.5% of the sequences were assigned to a metabolic pathway, based on KEGG. With this new sequencing information, an Agilent custom oligonucleotide microarray was designed, representing a potential tool for transcriptomic studies. EnchyBASE (http://bioinformatics.ua.pt/enchybase/) was developed as a web freely available database containing genomic information on E. albidus and will be further extended in the near future for other enchytraeid species. The database so far includes all ESTs generated for E. albidus from three cDNA libraries. This information can be downloaded and applied in functional genomics and transcription studies

    Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A widely-used approach for discovering functional and physical interactions among proteins involves phylogenetic profile comparisons (PPCs). Here, proteins with similar profiles are inferred to be functionally related under the assumption that proteins involved in the same metabolic pathway or cellular system are likely to have been co-inherited during evolution.</p> <p>Results</p> <p>Our experimentation with <it>E. coli </it>and yeast proteins with 16 different carefully composed reference sets of genomes revealed that the phyletic patterns of proteins in prokaryotes alone could be adequate enough to make reasonably accurate functional linkage predictions. A slight improvement in performance is observed on adding few eukaryotes into the reference set, but a noticeable drop-off in performance is observed with increased number of eukaryotes. Inclusion of most parasitic, pathogenic or vertebrate genomes and multiple strains of the same species into the reference set do not necessarily contribute to an improved sensitivity or accuracy. Interestingly, we also found that evolutionary histories of individual pathways have a significant affect on the performance of the PPC approach with respect to a particular reference set. For example, to accurately predict functional links in carbohydrate or lipid metabolism, a reference set solely composed of prokaryotic (or bacterial) genomes performed among the best compared to one composed of genomes from all three super-kingdoms; this is in contrast to predicting functional links in translation for which a reference set composed of prokaryotic (or bacterial) genomes performed the worst. We also demonstrate that the widely used random null model to quantify the statistical significance of profile similarity is incomplete, which could result in an increased number of false-positives.</p> <p>Conclusion</p> <p>Contrary to previous proposals, it is not merely the number of genomes but a careful selection of informative genomes in the reference set that influences the prediction accuracy of the PPC approach. We note that the predictive power of the PPC approach, especially in eukaryotes, is heavily influenced by the primary endosymbiosis and subsequent bacterial contributions. The over-representation of parasitic unicellular eukaryotes and vertebrates additionally make eukaryotes less useful in the reference sets. Reference sets composed of highly non-redundant set of genomes from all three super-kingdoms fare better with pathways showing considerable vertical inheritance and strong conservation (e.g. translation apparatus), while reference sets solely composed of prokaryotic genomes fare better for more variable pathways like carbohydrate metabolism. Differential performance of the PPC approach on various pathways, and a weak positive correlation between functional and profile similarities suggest that caution should be exercised while interpreting functional linkages inferred from genome-wide large-scale profile comparisons using a single reference set.</p
    corecore