10,353 research outputs found

    Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement

    Full text link
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200

    Characterization of statistical features for plant microRNA prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Several tools are available to identify miRNAs from deep-sequencing data, however, only a few of them, like miRDeep, can identify novel miRNAs and are also available as a standalone application. Given the difference between plant and animal miRNAs, particularly in terms of distribution of hairpin length and the nature of complementarity with its duplex partner (or miRNA star), the underlying (statistical) features of miRDeep and other tools, using similar features, are likely to get affected.</p> <p>Results</p> <p>The potential effects on features, such as minimum free energy, stability of secondary structures, excision length, etc., were examined, and the parameters of those displaying sizable changes were estimated for plant specific miRNAs. We found most of these features acquired a new set of values or distributions for plant specific miRNAs. While the length of conserved positions (nucleus) in mature miRNAs were relatively longer in plants, the difference in distribution of minimum free energy, between real and background hairpins, was marginal. However, the choice of source (species) of background sequences was found to affect both the minimum free energy and miRNA hairpin stability. The new parameters were tested on an Illumina dataset from maize seedlings, and the results were compared with those obtained using default parameters. The newly parameterized model was found to have much improved specificity and sensitivity over its default counterpart.</p> <p>Conclusions</p> <p>In summary, the present study reports behavior of few general and tool-specific statistical features for improving the prediction accuracy of plant miRNAs from deep-sequencing data.</p

    Statistical modeling of RNA structure profiling experiments enables parsimonious reconstruction of structure landscapes.

    Get PDF
    RNA plays key regulatory roles in diverse cellular processes, where its functionality often derives from folding into and converting between structures. Many RNAs further rely on co-existence of alternative structures, which govern their response to cellular signals. However, characterizing heterogeneous landscapes is difficult, both experimentally and computationally. Recently, structure profiling experiments have emerged as powerful and affordable structure characterization methods, which improve computational structure prediction. To date, efforts have centered on predicting one optimal structure, with much less progress made on multiple-structure prediction. Here, we report a probabilistic modeling approach that predicts a parsimonious set of co-existing structures and estimates their abundances from structure profiling data. We demonstrate robust landscape reconstruction and quantitative insights into structural dynamics by analyzing numerous data sets. This work establishes a framework for data-directed characterization of structure landscapes to aid experimentalists in performing structure-function studies

    NOVEL COMPUTATIONAL METHODS FOR CANCER GENOMICS DATA ANALYSIS

    Get PDF
    Cancer is a genetic disease responsible for one in eight deaths worldwide. The advancement of next-generation sequencing (NGS) technology has revolutionized the cancer research, allowing comprehensively profiling the cancer genome at great resolution. Large-scale cancer genomics research has sparked the needs for efficient and accurate Bioinformatics methods to analyze the data. The research presented in this dissertation focuses on three areas in cancer genomics: cancer somatic mutation detection; cancer driver genes identification and transcriptome profiling on single-cell level. NGS data analysis involves a series of complicated data transformation that convert raw sequencing data to the information that is interpretable by cancer researchers. The first project in the dissertation established a robust, reproducible and scalable cancer genomics data analysis workflow management system that automates the best practice mutation calling pipelines to detect somatic single nucleotide polymorphisms, insertion, deletion and copy number variation from NGS data. It integrates mutation annotation, clinically actionable therapy prediction and data visualization that streamlines the sequence-to-report data transformation. In order to differentiate the driver mutations buried among a vast pool of passenger mutations from a somatic mutation calling project, we developed MEScan in the second project, a novel method that enables genome-scale driver mutations identification based on mutual exclusivity test using cancer somatic mutation data. MEScan implements an efficient statistical framework to de novo screen mutual exclusive patterns and in the meantime taking into account the patient-specific and gene-specific background mutation rate and adjusting the heterogenous mutation frequency. It outperforms several existing methods based on simulation studies and real-world datasets. Genome-wide screening using existing TCGA somatic mutation data discovers novel cancer-specific and pan-cancer mutually exclusive patterns. Bulk RNA sequencing (RNA-Seq) has become one of the most commonly used techniques for transcriptome profiling in a wide spectrum of biomedical and biological research. Analyzing bulk RNA-Seq reads to quantify expression at each gene locus is the first step towards the identification of differentially expressed genes for downstream biological interpretation. Recent advances in single-cell RNA-seq (scRNA-seq) technology allows cancer biologists to profile gene expression on higher resolution cellular level. Preprocessing scRNA-seq data to quantify UMI-based gene count is the key to characterize intra-tumor cellular heterogeneity and identify rare cells that governs tumor progression, metastasis and treatment resistance. Despite its popularity, summarizing gene count from raw sequencing reads remains the one of the most time-consuming steps with existing tools. Current pipelines do not balance the efficiency and accuracy in large-scale gene count summarization in both bulk and scRNA-seq experiments. In the third project, we developed a light-weight k-mer based gene counting algorithm, FastCount, to accurately and efficiently quantify gene-level abundance using bulk RNA-seq or UMI-based scRNA-seq data. It achieves at least an order-of-magnitude speed improvement over the current gold standard pipelines while providing competitive accuracy

    A Monte Carlo-based framework enhances the discovery and interpretation of regulatory sequence motifs

    Full text link
    Abstract Background Discovery of functionally significant short, statistically overrepresented subsequence patterns (motifs) in a set of sequences is a challenging problem in bioinformatics. Oftentimes, not all sequences in the set contain a motif. These non-motif-containing sequences complicate the algorithmic discovery of motifs. Filtering the non-motif-containing sequences from the larger set of sequences while simultaneously determining the identity of the motif is, therefore, desirable and a non-trivial problem in motif discovery research. Results We describe MotifCatcher, a framework that extends the sensitivity of existing motif-finding tools by employing random sampling to effectively remove non-motif-containing sequences from the motif search. We developed two implementations of our algorithm; each built around a commonly used motif-finding tool, and applied our algorithm to three diverse chromatin immunoprecipitation (ChIP) data sets. In each case, the motif finder with the MotifCatcher extension demonstrated improved sensitivity over the motif finder alone. Our approach organizes candidate functionally significant discovered motifs into a tree, which allowed us to make additional insights. In all cases, we were able to support our findings with experimental work from the literature. Conclusions Our framework demonstrates that additional processing at the sequence entry level can significantly improve the performance of existing motif-finding tools. For each biological data set tested, we were able to propose novel biological hypotheses supported by experimental work from the literature. Specifically, in Escherichia coli, we suggested binding site motifs for 6 non-traditional LexA protein binding sites; in Saccharomyces cerevisiae, we hypothesize 2 disparate mechanisms for novel binding sites of the Cse4p protein; and in Halobacterium sp. NRC-1, we discoverd subtle differences in a general transcription factor (GTF) binding site motif across several data sets. We suggest that small differences in our discovered motif could confer specificity for one or more homologous GTF proteins. We offer a free implementation of the MotifCatcher software package at http://www.bme.ucdavis.edu/facciotti/resources_data/software/ .http://deepblue.lib.umich.edu/bitstream/2027.42/112965/1/12859_2012_Article_5570.pd

    A survey of DNA motif finding algorithms

    Get PDF
    Background: Unraveling the mechanisms that regulate gene expression is a major challenge in biology. An important task in this challenge is to identify regulatory elements, especially the binding sites in deoxyribonucleic acid (DNA) for transcription factors. These binding sites are short DNA segments that are called motifs. Recent advances in genome sequence availability and in high-throughput gene expression analysis technologies have allowed for the development of computational methods for motif finding. As a result, a large number of motif finding algorithms have been implemented and applied to various motif models over the past decade. This survey reviews the latest developments in DNA motif finding algorithms.Results: Earlier algorithms use promoter sequences of coregulated genes from single genome and search for statistically overrepresented motifs. Recent algorithms are designed to use phylogenetic footprinting or orthologous sequences and also an integrated approach where promoter sequences of coregulated genes and phylogenetic footprinting are used. All the algorithms studied have been reported to correctly detect the motifs that have been previously detected by laboratory experimental approaches, and some algorithms were able to find novel motifs. However, most of these motif finding algorithms have been shown to work successfully in yeast and other lower organisms, but perform significantly worse in higher organisms.Conclusion: Despite considerable efforts to date, DNA motif finding remains a complex challenge for biologists and computer scientists. Researchers have taken many different approaches in developing motif discovery tools and the progress made in this area of research is very encouraging. Performance comparison of different motif finding tools and identification of the best tools have proven to be a difficult task because tools are designed based on algorithms and motif models that are diverse and complex and our incomplete understanding of the biology of regulatory mechanism does not always provide adequate evaluation of underlying algorithms over motif models.Peer reviewedComputer Scienc
    • …
    corecore