29 research outputs found
The power of phylogenetic approaches to detect horizontally transferred genes
BACKGROUND: Horizontal gene transfer plays an important role in evolution because it sometimes allows recipient lineages to adapt to new ecological niches. High genes transfer frequencies were inferred for prokaryotic and early eukaryotic evolution. Does horizontal gene transfer also impact phylogenetic reconstruction of the evolutionary history of genomes and organisms? The answer to this question depends at least in part on the actual gene transfer frequencies and on the ability to weed out transferred genes from further analyses. Are the detected transfers mainly false positives, or are they the tip of an iceberg of many transfer events most of which go undetected by current methods? RESULTS: Phylogenetic detection methods appear to be the method of choice to infer gene transfers, especially for ancient transfers and those followed by orthologous replacement. Here we explore how well some of these methods perform using in silico transfers between the terminal branches of a gamma proteobacterial, genome based phylogeny. For the experiments performed here on average the AU test at a 5% significance level detects 90.3% of the transfers and 91% of the exchanges as significant. Using the Robinson-Foulds distance only 57.7% of the exchanges and 60% of the donations were identified as significant. Analyses using bipartition spectra appeared most successful in our test case. The power of detection was on average 97% using a 70% cut-off and 94.2% with 90% cut-off for identifying conflicting bipartitions, while the rate of false positives was below 4.2% and 2.1% for the two cut-offs, respectively. For all methods the detection rates improved when more intervening branches separated donor and recipient. CONCLUSION: Rates of detected transfers should not be mistaken for the actual transfer rates; most analyses of gene transfers remain anecdotal. The method and significance level to identify potential gene transfer events represent a trade-off between the frequency of erroneous identification (false positives) and the power to detect actual transfer events
Impact of constitutional copy number variants on biological pathway evolution
Background: Inherited Copy Number Variants (CNVs) can modulate the expression levels of individual genes. However, little is known about how CNVs alter biological pathways and how this varies across different populations. To trace potential evolutionary changes of well-described biological pathways, we jointly queried the genomes and the transcriptomes of a collection of individuals with Caucasian, Asian or Yoruban descent combining high-resolution array and sequencing data. Results: We implemented an enrichment analysis of pathways accounting for CNVs and genes sizes and detected significant enrichment not only in signal transduction and extracellular biological processes, but also in metabolism pathways. Upon the estimation of CNV population differentiation (CNVs with different polymorphism frequencies across populations), we evaluated that 22% of the pathways contain at least one gene that is proximal to a CNV (CNV-gene pair) that shows significant population differentiation. The majority of these CNV-gene pairs belong to signal transduction pathways and 6% of the CNV-gene pairs show statistical association between the copy number states and the transcript levels. Conclusions: The analysis suggested possible examples of positive selection within individual populations including NF-kB, MAPK signaling pathways, and Alu/L1 retrotransposition factors. Altogether, our results suggest that constitutional CNVs may modulate subtle pathway changes through specific pathway enzymes, which may become fixed in some populations
Unsupervised domain adaptation methods for cross-species transfer of regulatory code signals
Due to advances in NGS technologies whole-genome maps of various functional genomic elements were generated for a dozen of species, however experiments are still expensive and are not available for many species of interest. Deep learning methods became the state-of-the-art computational methods to analyze the available data, but the focus is often only on the species studied. Here we take advantage of the progresses in Transfer Learning in the area of Unsupervised Domain Adaption (UDA) and tested nine UDA methods for prediction of regulatory code signals for genomes of other species. We tested each deep learning implementation by training the model on experimental data from one species, then refined the model using the genome sequence of the target species for which we wanted to make predictions. Among nine tested domain adaptation architectures non-adversarial methods Minimum Class Confusion (MCC) and Deep Adaptation Network (DAN) significantly outperformed others. Conditional Domain Adversarial Network (CDAN) appeared as the third best architecture. Here we provide an empirical assessment of each approach using real world data. The different approaches were tested on ChIP-seq data for transcription factor binding sites and histone marks on human and mouse genomes, but is generalizable to any cross-species transfer of interest. We tested the efficiency of each method using species where experimental data was available for both. The results allows us to assess how well each implementation will work for species for which only limited experimental data is available and will inform the design of future experiments in these understudied organisms. Overall, our results proved the validity of UDA methods for generation of missing experimental data for histone marks and transcription factor binding sites in various genomes and highlights how robust the various approaches are to data that is incomplete, noisy and susceptible to analytic bias
A Computational Framework Discovers New Copy Number Variants with Functional Importance
Structural variants which cause changes in copy numbers constitute an important component of genomic variability. They account for 0.7% of genomic differences in two individual genomes, of which copy number variants (CNVs) are the largest component. A recent population-based CNV study revealed the need of better characterization of CNVs, especially the small ones (<500 bp).We propose a three step computational framework (Identification of germline Changes in Copy Number or IgC2N) to discover and genotype germline CNVs. First, we detect candidate CNV loci by combining information across multiple samples without imposing restrictions to the number of coverage markers or to the variant size. Secondly, we fine tune the detection of rare variants and infer the putative copy number classes for each locus. Last, for each variant we combine the relative distance between consecutive copy number classes with genetic information in a novel attempt to estimate the reference model bias. This computational approach is applied to genome-wide data from 1250 HapMap individuals. Novel variants were discovered and characterized in terms of size, minor allele frequency, type of polymorphism (gains, losses or both), and mechanism of formation. Using data generated for a subset of individuals by a 42 million marker platform, we validated the majority of the variants with the highest validation rate (66.7%) was for variants of size larger than 1 kb. Finally, we queried transcriptomic data from 129 individuals determined by RNA-sequencing as further validation and to assess the functional role of the new variants. We investigated the possible enrichment for variant's regulatory effect and found that smaller variants (<1 Kb) are more likely to regulate gene transcript than larger variants (p-value = 2.04e-08). Our results support the validity of the computational framework to detect novel variants relevant to disease susceptibility studies and provide evidence of the importance of genetic variants in regulatory network studies
BranchClust: a phylogenetic algorithm for selecting gene families
BACKGROUND: Automated methods for assembling families of orthologous genes include those based on sequence similarity scores and those based on phylogenetic approaches. The first are easy to automate but usually they do not distinguish between paralogs and orthologs or have restriction on the number of taxa. Phylogenetic methods often are based on reconciliation of a gene tree with a known rooted species tree; a limitation of this approach, especially in case of prokaryotes, is that the species tree is often unknown, and that from the analyses of single gene families the branching order between related organisms frequently is unresolved. RESULTS: Here we describe an algorithm for the automated selection of orthologous genes that recognizes orthologous genes from different species in a phylogenetic tree for any number of taxa. The algorithm is capable of distinguishing complete (containing all taxa) and incomplete (not containing all taxa) families and recognizes in- and outparalogs. The BranchClust algorithm is implemented in Perl with the use of the BioPerl module for parsing trees and is freely available at . CONCLUSION: BranchClust outperforms the Reciprocal Best Blast hit method in selecting more sets of putatively orthologous genes. In the test cases examined, the correctness of the selected families and of the identified in- and outparalogs was confirmed by inspection of the pertinent phylogenetic trees
Comprehensive analysis of cancer breakpoints reveals signatures of genetic and epigenetic contribution to cancer genome rearrangements.
Understanding mechanisms of cancer breakpoint mutagenesis is a difficult task and predictive models of cancer breakpoint formation have to this time failed to achieve even moderate predictive power. Here we take advantage of a machine learning approach that can gather important features from big data and quantify contribution of different factors. We performed comprehensive analysis of almost 630,000 cancer breakpoints and quantified the contribution of genomic and epigenomic features-non-B DNA structures, chromatin organization, transcription factor binding sites and epigenetic markers. The results showed that transcription and formation of non-B DNA structures are two major processes responsible for cancer genome fragility. Epigenetic factors, such as chromatin organization in TADs, open/closed regions, DNA methylation, histone marks are less informative but do make their contribution. As a general trend, individual features inside the groups show a relatively high contribution of G-quadruplexes and repeats and CTCF, GABPA, RXRA, SP1, MAX and NR2F2 transcription factors. Overall, the cancer breakpoint landscape can be represented by well-predicted hotspots and poorly predicted individual breakpoints scattered across genomes. We demonstrated that hotspot mutagenesis has genomic and epigenomic factors, and not all individual cancer breakpoints are just random noise but have a definite mutation signature. Besides we found a long-range action of some features on breakpoint mutagenesis. Combining omics data, cancer-specific individual feature importance and adding the distant to local features, predictive models for cancer breakpoint formation achieved 70-90% ROC AUC for different cancer types; however precision remained low at 2% and the recall did not exceed 50%. On the one hand, the power of models strongly correlates with the size of available cancer breakpoint and epigenomic data, and on the other hand finding strong determinants of cancer breakpoint formation still remains a challenge. The strength of predictive signals of each group and of each feature inside a group can be converted into cancer-specific breakpoint mutation signatures. Overall our results add to the understanding of cancer genome rearrangement processes
Additional file 8: of Conserved 3′ UTR stem-loop structure in L1 and Alu transposons in human genome: possible role in retrotransposition
Coordinates of the 3′-end stem-loop for consensus sequences for 213 Alu subfamilies. (CSV 28 kb
Mono a Mano: ZBP1’s Love–Hate Relationship with the Kissing Virus
Z-DNA binding protein (ZBP1) very much represents the nuclear option. By initiating inflammatory cell death (ICD), ZBP1 activates host defenses to destroy infectious threats. ZBP1 is also able to induce noninflammatory regulated cell death via apoptosis (RCD). ZBP1 senses the presence of left-handed Z-DNA and Z-RNA (ZNA), including that formed by expression of endogenous retroelements. Viruses such as the Epstein–Barr “kissing virus” inhibit ICD, RCD and other cell death signaling pathways to produce persistent infection. EBV undergoes lytic replication in plasma cells, which maintain detectable levels of basal ZBP1 expression, leading us to suggest a new role for ZBP1 in maintaining EBV latency, one of benefit for both host and virus. We provide an overview of the pathways that are involved in establishing latent infection, including those regulated by MYC and NF-κB. We describe and provide a synthesis of the evidence supporting a role for ZNA in these pathways, highlighting the positive and negative selection of ZNA forming sequences in the EBV genome that underscores the coadaptation of host and virus. Instead of a fight to the death, a state of détente now exists where persistent infection by the virus is tolerated by the host, while disease outcomes such as death, autoimmunity and cancer are minimized. Based on these new insights, we propose actionable therapeutic approaches to unhost EBV
Additional file 4: of Conserved 3′ UTR stem-loop structure in L1 and Alu transposons in human genome: possible role in retrotransposition
Stem-loop profiles for active and highly conserved L1 transposons: (A) set of 6 hottest L1 transposons reported as active in (Brouha, Schustak et al. [3]); (B) set of 33 active L1 transposons reported as active in (Brouha, Schustak et al. [3]); (C) set of 6622 highly conserved L1 transposons (see Methods for selection criteria). (PDF 221 kb
Additional file 5: of Conserved 3′ UTR stem-loop structure in L1 and Alu transposons in human genome: possible role in retrotransposition
Stem-loop profiles for 5′UTR regions of groups of L1 subfamilies having one type of 5′UTR as it was proposed in (Khan, Smit et al. [53]). (PDF 703 kb