4,905 research outputs found
A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data
Statistical tests for Hardy–Weinberg equilibrium have been an important tool for detecting genotyping errors in the past, and remain important in the quality control of next generation sequence data. In this paper, we analyze complete chromosomes of the 1000 genomes project by using exact test procedures for autosomal and X-chromosomal variants. We find that the rate of disequilibrium largely exceeds what might be expected by chance alone for all chromosomes. Observed disequilibrium is, in about 60% of the cases, due to heterozygote excess. We suggest that most excess disequilibrium can be explained by sequencing problems, and hypothesize mechanisms that can explain exceptional heterozygosities. We report higher rates of disequilibrium for the MHC region on chromosome 6, regions flanking centromeres and p-arms of acrocentric chromosomes. We also detected long-range haplotypes and areas with incidental high disequilibrium. We report disequilibrium to be related to read depth, with variants having extreme read depths being more likely to be out of equilibrium. Disequilibrium rates were found to be 11 times higher in segmental duplications and simple tandem repeat regions. The variants with significant disequilibrium are seen to be concentrated in these areas. For next generation sequence data, Hardy–Weinberg disequilibrium seems to be a major indicator for copy number variation.Peer ReviewedPostprint (published version
Neutral genomic microevolution of a recently emerged pathogen, salmonella enterica serovar agona
Salmonella enterica serovar Agona has caused multiple food-borne outbreaks of gastroenteritis since it was first isolated in
1952. We analyzed the genomes of 73 isolates from global sources, comparing five distinct outbreaks with sporadic
infections as well as food contamination and the environment. Agona consists of three lineages with minimal mutational
diversity: only 846 single nucleotide polymorphisms (SNPs) have accumulated in the non-repetitive, core genome since
Agona evolved in 1932 and subsequently underwent a major population expansion in the 1960s. Homologous
recombination with other serovars of S. enterica imported 42 recombinational tracts (360 kb) in 5/143 nodes within the
genealogy, which resulted in 3,164 additional SNPs. In contrast to this paucity of genetic diversity, Agona is highly diverse
according to pulsed-field gel electrophoresis (PFGE), which is used to assign isolates to outbreaks. PFGE diversity reflects a
highly dynamic accessory genome associated with the gain or loss (indels) of 51 bacteriophages, 10 plasmids, and 6
integrative conjugational elements (ICE/IMEs), but did not correlate uniquely with outbreaks. Unlike the core genome, indels
occurred repeatedly in independent nodes (homoplasies), resulting in inaccurate PFGE genealogies. The accessory genome
contained only few cargo genes relevant to infection, other than antibiotic resistance. Thus, most of the genetic diversity
within this recently emerged pathogen reflects changes in the accessory genome, or is due to recombination, but these
changes seemed to reflect neutral processes rather than Darwinian selection. Each outbreak was caused by an independent
clade, without universal, outbreak-associated genomic features, and none of the variable genes in the pan-genome seemed
to be associated with an ability to cause outbreaks
Identifying statistical dependence in genomic sequences via mutual information estimates
Questions of understanding and quantifying the representation and amount of
information in organisms have become a central part of biological research, as
they potentially hold the key to fundamental advances. In this paper, we
demonstrate the use of information-theoretic tools for the task of identifying
segments of biomolecules (DNA or RNA) that are statistically correlated. We
develop a precise and reliable methodology, based on the notion of mutual
information, for finding and extracting statistical as well as structural
dependencies. A simple threshold function is defined, and its use in
quantifying the level of significance of dependencies between biological
segments is explored. These tools are used in two specific applications. First,
for the identification of correlations between different parts of the maize
zmSRp32 gene. There, we find significant dependencies between the 5'
untranslated region in zmSRp32 and its alternatively spliced exons. This
observation may indicate the presence of as-yet unknown alternative splicing
mechanisms or structural scaffolds. Second, using data from the FBI's Combined
DNA Index System (CODIS), we demonstrate that our approach is particularly well
suited for the problem of discovering short tandem repeats, an application of
importance in genetic profiling.Comment: Preliminary version. Final version in EURASIP Journal on
Bioinformatics and Systems Biology. See http://www.hindawi.com/journals/bsb
Comparative Analysis of Tandem Repeats from Hundreds of Species Reveals Unique Insights into Centromere Evolution
Centromeres are essential for chromosome segregation, yet their DNA sequences
evolve rapidly. In most animals and plants that have been studied, centromeres
contain megabase-scale arrays of tandem repeats. Despite their importance, very
little is known about the degree to which centromere tandem repeats share
common properties between different species across different phyla. We used
bioinformatic methods to identify high-copy tandem repeats from 282 species
using publicly available genomic sequence and our own data. The assumption that
the most abundant tandem repeat is the centromere DNA was true for most species
whose centromeres have been previously characterized, suggesting this is a
general property of genomes. Our methods are compatible with all current
sequencing technologies. Long Pacific Biosciences sequence reads allowed us to
find tandem repeat monomers up to 1,419 bp. High-copy centromere tandem repeats
were found in almost all animal and plant genomes, but repeat monomers were
highly variable in sequence composition and in length. Furthermore,
phylogenetic analysis of sequence homology showed little evidence of sequence
conservation beyond ~50 million years of divergence. We find that despite an
overall lack of sequence conservation, centromere tandem repeats from diverse
species showed similar modes of evolution, including the appearance of higher
order repeat structures in which several polymorphic monomers make up a larger
repeating unit. While centromere position in most eukaryotes is epigenetically
determined, our results indicate that tandem repeats are highly prevalent at
centromeres of both animals and plants. This suggests a functional role for
such repeats, perhaps in promoting concerted evolution of centromere DNA across
chromosomes
Systems Biology and the Development of Vaccines and Drugs for Malaria Treatments
The sequencing race has ended and the functional race has already begun. Microarray technology enables
simultaneous gene expression analysis of thousands of genes, enabling a snapshot of an organisms’
transcriptome at an unprecedented resolution. The close correlation between gene transcription and
function, allow the inference of biological processes from the assessed transcriptome profile. Among the
sophisticated analytical problems in microarray technology at the front and back ends respectively, are the
selection of optimal DNA oligos and computational analysis of the genes expression. In this review paper,
we analyse important methods in use today in customized oligos design. In the course of executing this,
we discovered that the oligos designer algorithm hanged on gene PFA0135w of chromosome 1, while
designing oligos for the gene sequences of Plasmodium falciparum. We do not know the reason for this
yet, as the algorithm runs on other sequences like the yeast (Saccharomyces cervisiae) and Neurospora
crassa. We conclude the paper highlighting the procedures encompassing the back end phase and discuss
their application to the development of vaccines and drugs for malaria treatment. Note that, malaria is the
cause of significant global morbidity and mortality with 300-500 million cases annually. Our aims are not
ends, but a means to achieve the following: Iterate the need for experimental biologists to (i) know how to
design their customized oligos and (ii) have some idea about gene expression analysis and the need for
cooperation between experimental biologists and their counterpart, the computational biologists. These
will help experimental biologists to coordinate very well the front and the back ends of the system
biology analysis of the whole genome effectively
Detecting CRISPR Arrays Using Long-Short Term Memory Network
CRISPR (Clustered Regularly Interspaced Short Palindromic Repeat) is a se- quence found in the DNA sequence of an organism. It provides provides immunity to the organism. Recently, it was found that the CRISPR-based immunity mechanism can be manipulated to perform genome editing. The problem is, it is hard to know the specificity of this system and in turn, making it highly specific is difficult. More re- search is required to improve this CRISPR-based genome editing. Detecting CRISPR arrays in the DNA sequence is the first step towards this research. In this work, a CRISPR array detection pipeline, CRISPRLstm, is proposed. CRISPRLstm leverages the power of artificial intelligence to improve its performance over existing CRISPR array detection programs. Why and how artificial intelligence, or specifically, Long- Short Term Memory (LSTM) models, can be used to tackle this problem effectively is explained in this report. The CRSIPR arrays detected by CRISPRLstm are in good agreement with other widely used and freely available CRISPR array detection tools. CRISPRLstm is available in form of a web-tool. It visualizes the detected CRISPR arrays in a highly interactive interface with options to view secondary structure of the repeat and spacer sequences, blast them, create sequence logos of repeat sequences, and more
TRDB—The Tandem Repeats Database
Tandem repeats in DNA have been under intensive study for many years, first, as a consequence of their usefulness as genomic markers and DNA fingerprints and more recently as their role in human disease and regulatory processes has become apparent. The Tandem Repeats Database (TRDB) is a public repository of information on tandem repeats in genomic DNA. It contains a variety of tools for repeat analysis, including the Tandem Repeats Finder program, query and filtering capabilities, repeat clustering, polymorphism prediction, PCR primer selection, data visualization and data download in a variety of formats. In addition, TRDB serves as a centralized research workbench. It provides user storage space and permits collaborators to privately share their data and analysis. TRDB is available at
- …