2,510 research outputs found

    Investigating selection on viruses: a statistical alignment approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Two problems complicate the study of selection in viral genomes: Firstly, the presence of genes in overlapping reading frames implies that selection in one reading frame can bias our estimates of neutral mutation rates in another reading frame. Secondly, the high mutation rates we are likely to encounter complicate the inference of a reliable alignment of genomes. To address these issues, we develop a model that explicitly models selection in overlapping reading frames. We then integrate this model into a statistical alignment framework, enabling us to estimate selection while explicitly dealing with the uncertainty of individual alignments. We show that in this way we obtain un-biased selection parameters for different genomic regions of interest, and can improve in accuracy compared to using a fixed alignment.</p> <p>Results</p> <p>We run a series of simulation studies to gauge how well we do in selection estimation, especially in comparison to the use of a fixed alignment. We show that the standard practice of using a ClustalW alignment can lead to considerable biases and that estimation accuracy increases substantially when explicitly integrating over the uncertainty in inferred alignments. We even manage to compete favourably for general evolutionary distances with an alignment produced by GenAl. We subsequently run our method on HIV2 and Hepatitis B sequences.</p> <p>Conclusion</p> <p>We propose that marginalizing over all alignments, as opposed to using a fixed one, should be considered in any parametric inference from divergent sequence data for which the alignments are not known with certainty. Moreover, we discover in HIV2 that double coding regions appear to be under less stringent selection than single coding ones. Additionally, there appears to be evidence for differential selection, where one overlapping reading frame is under positive and the other under negative selection.</p

    Informative regions in viral genomes

    Get PDF
    Viruses, far from being just parasites affecting hosts\u27 fitness, are major players in any microbial ecosystem. In spite of their broad abundance, viruses, in particular bacteriophages, remain largely unknown since only about 20% of sequences obtained from viral community DNA surveys could be annotated by comparison with public databases. In order to shed some light into this genetic dark matter we expanded the search of orthologous groups as potential markers to viral taxonomy from bacteriophages and included eukaryotic viruses, establishing a set of 31,150 ViPhOGs (Eukaryotic Viruses and Phages Orthologous Groups). To do this, we examine the non-redundant viral diversity stored in public databases, predict proteins in genomes lacking such information, and used all annotated and predicted proteins to identify potential protein domains. The clustering of domains and unannotated regions into orthologous groups was done using cogSoft. Finally, we employed a random forest implementation to classify genomes into their taxonomy and found that the presence or absence of ViPhOGs is significantly associated with their taxonomy. Furthermore, we established a set of 1457 ViPhOGs that given their importance for the classification could be considered as markers or signatures for the different taxonomic groups defined by the ICTV at the order, family, and genus levels

    A Novel Bioinformatic Approach to Understanding Addiction

    Get PDF
    Finding the genetic markers that influence complex, multigenic substance addiction phenotypes has been an area of significant medical study. Understanding complex disease traits like addiction has been hampered by the lack of functional insights into novel variants to the human genome. We hypothesized that gene location plays a role in functional genomic neighborhoods. To test whether there is a relationship between opiate, dopamine, and GABA disease and population allele frequencies, we used genes obtained from addiction literature curated by the National Center for Biotechnology Information (NCBI). These addiction and metabolism focused search terms generated opiate, dopamine, and GABA addiction results (N=587 genes). These genes were then projected onto the genome to identify cluster regions of genetic importance for substance addiction. Clusters were defined as regions of the genome with more than six genes within a 1.5Mb linear genomic window. We identified seven hotspots located on chromosomes 4, 6 (2 clusters), 10, 11, and 19. Human polymorphism data was surveyed from the 1148 individuals comprising the 11 sample populations of the HapMap Project dataset. Our analyses demonstrate that when human populations are assessed, ten candidate addiction alleles were identified. Finally assessments of public genome wide association studies show long range linkages to canonical addiction genes. This study delineates a novel method to identify novel candidate addiction variants using a systems biology approach that relies on an interdisciplinary set of data, including genomic, pathway data, and population variation. Important connections to sociological and environmental data are discussed to contextualize addiction data

    Is There a Twelfth Protein-Coding Gene in the Genome of Influenza A? A Selection-Based Approach to the Detection of Overlapping Genes in Closely Related Sequences

    Get PDF
    Protein-coding genes often contain long overlapping open-reading frames (ORFs), which may or may not be functional. Current methods that utilize the signature of purifying selection to detect functional overlapping genes are limited to the analysis of sequences from divergent species, thus rendering them inapplicable to genes found only in closely related sequences. Here, we present a method for the detection of selection signatures on overlapping reading frames by using closely related sequences, and apply the method to several known overlapping genes, and to an overlapping ORF on the negative strand of segment 8 of influenza A virus (NEG8), for which the suggestion has been made that it is functional. We find no evidence that NEG8 is under selection, suggesting that the intact reading frame might be non-functional, although we cannot fully exclude the possibility that the method is not sensitive enough to detect the signature of selection acting on this gene. We present the limitations of the method using known overlapping genes and suggest several approaches to improve it in future studies. Finally, we examine alternative explanations for the sequence conservation of NEG8 in the absence of selection. We show that overlap type and genomic context affect the conservation of intact overlapping ORFs and should therefore be considered in any attempt of estimating the signature of selection in overlapping gene

    Computational methods for inferring location and genealogy of overlapping genes in virus genomes: approaches and applications

    Get PDF
    Viruses may evolve to increase the amount of encoded genetic information by means of overlapping genes, which utilize several reading frames. Such overlapping genes may be especially impactful for genomes of small size, often serving a source of novel accessory proteins, some of which play a crucial role in viral pathogenicity or in promoting the systemic spread of virus. Diverse genome-based metrics were proposed to facilitate recognition of overlapping genes that otherwise may be overlooked during genome annotation. They can detect the atypical codon bias associated with the overlap (e.g. a statistically significant reduction in variability at synonymous sites) or other sequence-composition features peculiar to overlapping genes. In this review, I compare nine computational methods, discuss their strengths and limitations, and survey how they were applied to detect candidate overlapping genes in the genome of SARS-CoV-2, the etiological agent of COVID-19 pandemic

    Computational tools for viral metagenomics and their application in clinical research

    Get PDF
    AbstractThere are 100 times more virions than eukaryotic cells in a healthy human body. The characterization of human-associated viral communities in a non-pathological state and the detection of viral pathogens in cases of infection are essential for medical care and epidemic surveillance. Viral metagenomics, the sequenced-based analysis of the complete collection of viral genomes directly isolated from an organism or an ecosystem, bypasses the “single-organism-level” point of view of clinical diagnostics and thus the need to isolate and culture the targeted organism. The first part of this review is dedicated to a presentation of past research in viral metagenomics with an emphasis on human-associated viral communities (eukaryotic viruses and bacteriophages). In the second part, we review more precisely the computational challenges posed by the analysis of viral metagenomes, and we illustrate the problem of sequences that do not have homologs in public databases and the possible approaches to characterize them

    Integrating Human Population Genetics And Genomics To Elucidate The Etiology Of Brain Disorders

    Get PDF
    Brain disorders present a significant burden on affected individuals, their families and society at large. Existing diagnostic tests suffer from a lack of genetic biomarkers, particularly for substance use disorders, such as alcohol dependence (AD). Numerous studies have demonstrated that AD has a genetic heritability of 40-60%. The existing genetics literature of AD has primarily focused on linkage analyses in small family cohorts and more recently on genome-wide association analyses (GWAS) in large case-control cohorts, fueled by rapid advances in next generation sequencing (NGS). Numerous AD-associated genomic variations are present at a common frequency in the general population, making these variants of public health significance. However, known AD-associated variants explain only a fraction of the expected heritability. In this dissertation, we demonstrate that systems biology applications that integrate evolutionary genomics, rare variants and structural variation can dissect the genetic architecture of AD and elucidate its heritability. We identified several complex human diseases, including AD and other brain disorders, as potential targets of natural selection forces in diverse world populations. Further evidence of natural selection forces affecting AD was revealed when we identified an association between eye color, a trait under strong selection, and AD. These findings provide strong support for conducting GWAS on brain disorder phenotypes. However, with the ever-increasing abundance of rare genomic variants and large cohorts of multi-ethnic samples, population stratification becomes a serious confounding factor for GWAS. To address this problem, we designed a novel approach to identify ancestry informative single nucleotide polymorphisms (SNPs) for population stratification adjustment in association analyses. Furthermore, to leverage untyped variants from genotyping arrays – particularly rare variants – for GWAS and meta-analysis through rapid imputation, we designed a tool that converts genotype definitions across various array platforms. To further elucidate the genetic heritability of brain disorders, we designed approaches aimed at identifying Copy Number Variations (CNVs) and viral insertions into the human genome. We conducted the first CNV-based whole genome meta-analysis for AD. We also designed an integrated approach to estimate the sensitivity of NGS-based methods of viral insertion detection. For the first time in the literature, we identified herpesvirus in NGS data from an Alzheimer’s disease brain sample. The work in this dissertation represents a three-faceted advance in our understanding of brain disease etiology: 1) evolutionary genomic insights, 2) novel resources and tools to leverage rare variants, and 3) the discovery of disease-associated structural genomic aberrations. Our findings have broad implications on the genetics of complex human disease and hold promise for delivering clinically useful knowledge and resources
    corecore