746 research outputs found

    Enabling high-throughput sequencing data analysis with MOSAIK

    Get PDF
    Thesis advisor: Gabor T. MarthDuring the last few years, numerous new sequencing technologies have emerged that require tools that can process large amounts of read data quickly and accurately. Regardless of the downstream methods used, reference-guided aligners are at the heart of all next-generation analysis studies. I have developed a general reference-guided aligner, MOSAIK, to support all current sequencing technologies (Roche 454, Illumina, Applied Biosystems SOLiD, Helicos, and Sanger capillary). The calibrated alignment qualities calculated by MOSAIK allow the user to fine-tune the alignment accuracy for a given study. MOSAIK is a highly configurable and easy-to-use suite of alignment tools that is used in hundreds of labs worldwide. MOSAIK is an integral part of our genetic variant discovery pipeline. From SNP and short-INDEL discovery to structural variation discovery, alignment accuracy is an essential requirement and enables our downstream analyses to provide accurate calls. In this thesis, I present three major studies that were formative during the development of MOSAIK and our analysis pipeline. In addition, I present a novel algorithm that identifies mobile element insertions (non-LTR retrotransposons) in the human genome using split-read alignments in MOSAIK. This algorithm has a low false discovery rate (4.4 %) and enabled our group to be the first to determine the number of mobile elements that differentially occur between any two individuals.Thesis (PhD) — Boston College, 2010.Submitted to: Boston College. Graduate School of Arts and Sciences.Discipline: Biology

    Efficient error correction for next-generation sequencing of viral amplicons

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.</p> <p>Results</p> <p>In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.</p> <p>Conclusions</p> <p>Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.</p> <p>The implementations of the algorithms and data sets used for their testing are available at: <url>http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm</url></p

    Calculating the structure-based phylogenetic relationship of distantly related homologous proteins utilizing maximum likelihood structural alignment combinatorics and a novel structural molecular clock hypothesis

    Get PDF
    A dissertation in Molecular Biology and Biochemistry and Cell Biology and BiophysicsIncludes bibliographical references (pages 113-116)Dendrograms establish the evolutionary relationships and homology of species, proteins, or genes. Homology modeling, ligand binding, and pharmaceutical testing all depend upon the homology ascertained by dendrograms. Regardless of the specific algorithm, all dendrograms that ascertain protein evolutionary homology are generated utilizing polypeptide sequences. However, because protein structures superiorly conserve homology and contain more biochemical information than their associated protein sequences, I hypothesize that utilizing the structure of a protein instead of its sequence will generate a superior dendrogram. Generating a dendrogram utilizing protein structure requires a unique methodology and novel bioinformatic programs to implement this methodology. Contained within this dissertation is an original methodology that permits the aforementioned structure-based iv dendrogram generation hypothesis. Additionally, I have scripted three novel bioinformatics programs required by this proposed methodology: a protein structure alignment program that proficiently superimposes distant homologs, an accurate structure-dependent sequence alignment program, and a dendrogram generation program that employs a novel structural molecular clock hypothesis. The results from this methodology support the proposed hypothesis by demonstrating that generating dendrograms utilizing protein structures is superior to those generated utilizing exclusively protein sequences.Introduction -- Sable: Structural alignment by maximum likelihood -- UNITS: Universal true SDSA (Structure-dependent sequience alignment) -- Push: phlyogenetic tree using structural homology) -- Push discussion and general conclusion -- Generic sorting algorithm -- Template protein selection -- Units and chimera SDSAs -- References -- Vit

    Introducing deep learning -based methods into the variant calling analysis pipeline

    Get PDF
    Biological interpretation of the genetic variation enhances our understanding of normal and pathological phenotypes, and may lead to the development of new therapeutics. However, it is heavily dependent on the genomic data analysis, which might be inaccurate due to the various sequencing errors and inconsistencies caused by these errors. Modern analysis pipelines already utilize heuristic and statistical techniques, but the rate of falsely identified mutations remains high and variable, particular sequencing technology, settings and variant type. Recently, several tools based on deep neural networks have been published. The neural networks are supposed to find motifs in the data that were not previously seen. The performance of these novel tools is assessed in terms of precision and recall, as well as computational efficiency. Following the established best practices in both variant detection and benchmarking, the discussed tools demonstrate accuracy metrics and computational efficiency that spur further discussion

    Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication

    Get PDF
    Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of "living fossils." As arthropods, they belong to the Ecdysozoa}, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes, and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Here we use a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers and 5,775 candidate conserved protein coding genes. Comparison to other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications (WGDs) ~ 300 MYA, followed by extensive chromosome fusion

    Alternative applications of whole genome de novo assembly in animal genomics /

    Get PDF
    Genome sequencing is the process by which the sequence of deoxyribonucleic acid (DNA) residues that compromise the genome, or complete set of genetic materials of an organism or individual, is determined. Down-stream analysis of genome sequencing data requires that short reads be compiled into contiguous sequences. These methods, called de novo assembly, are based in statistical methods and graph theory. In addition to genome assembly, the research presented in this dissertation demonstrates the alternative use of these methods. Using these novel approaches, de novo assembly algorithms can be utilized to gain insight into commensal and parasitic organisms of livestock, genes containing candidate mutations for genetic defects, and population-level and species-level variation in a poorly studied organisms.Dr. Jared E. Decker, Dissertation Advisor.Includes bibliographical references (pages 101-127)

    The fate of Arabidopsis thaliana homeologous CNSs and their motifs in the Paleohexaploid Brassica rapa.

    Get PDF
    Following polyploidy, duplicate genes are often deleted, and if they are not, then duplicate regulatory regions are sometimes lost. By what mechanism is this loss and what is the chance that such a loss removes function? To explore these questions, we followed individual Arabidopsis thaliana-A. thaliana conserved noncoding sequences (CNSs) into the Brassica ancestor, through a paleohexaploidy and into Brassica rapa. Thus, a single Brassicaceae CNS has six potential orthologous positions in B. rapa; a single Arabidopsis CNS has three potential homeologous positions. We reasoned that a CNS, if present on a singlet Brassica gene, would be unlikely to lose function compared with a more redundant CNS, and this is the case. Redundant CNSs go nondetectable often. Using this logic, each mechanism of CNS loss was assigned a metric of functionality. By definition, proved deletions do not function as sequence. Our results indicated that CNSs that go nondetectable by base substitution or large insertion are almost certainly still functional (redundancy does not matter much to their detectability frequency), whereas those lost by inferred deletion or indels are approximately 75% likely to be nonfunctional. Overall, an average nondetectable, once-redundant CNS more than 30 bp in length has a 72% chance of being nonfunctional, and that makes sense because 97% of them sort to a molecular mechanism with deletion in its description, but base substitutions do cause loss. Similarly, proved-functional G-boxes go undetectable by deletion 82% of the time. Fractionation mutagenesis is a procedure that uses polyploidy as a mutagenic agent to genetically alter RNA expression profiles, and then to construct testable hypotheses as to the function of the lost regulatory site. We show fractionation mutagenesis to be a deletion machine in the Brassica lineage

    Algebraic Distribution of Segmental Duplication Lengths in Whole-Genome Sequence Self-Alignments

    Get PDF
    Distributions of duplicated sequences from genome self-alignment are characterized, including forward and backward alignments in bacteria and eukaryotes. A Markovian process without auto-correlation should generate an exponential distribution expected from local effects of point mutation and selection on localised function; however, the observed distributions show substantial deviation from exponential form – they are roughly algebraic instead – suggesting a novel kind of long-distance correlation that must be non-local in origin
    • …
    corecore