9,020 research outputs found

    Multiple organism algorithm for finding ultraconserved elements

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality.</p> <p>Results</p> <p>We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences.</p> <p>Conclusion</p> <p>Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

    Design of small molecule-responsive microRNAs based on structural requirements for Drosha processing

    Get PDF
    MicroRNAs (miRNAs) are prevalent regulatory RNAs that mediate gene silencing and play key roles in diverse cellular processes. While synthetic RNA-based regulatory systems that integrate regulatory and sensing functions have been demonstrated, the lack of detail on miRNA structure–function relationships has limited the development of integrated control systems based on miRNA silencing. Using an elucidated relationship between Drosha processing and the single-stranded nature of the miRNA basal segments, we developed a strategy for designing ligand-responsive miRNAs. We demonstrate that ligand binding to an aptamer integrated into the miRNA basal segments inhibits Drosha processing, resulting in titratable control over gene silencing. The generality of this control strategy was shown for three aptamer–small molecule ligand pairs. The platform can be extended to the design of synthetic miRNAs clusters, cis-acting miRNAs and self-targeting miRNAs that act both in cis and trans, enabling fine-tuning of the regulatory strength and dynamics. The ability of our ligand-responsive miRNA platform to respond to user-defined inputs, undergo regulatory performance tuning and display scalable combinatorial control schemes will help advance applications in biological research and applied medicine

    Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

    Get PDF
    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

    Drosophila Spastin Regulates Synaptic Microtubule Networks and Is Required for Normal Motor Function

    Get PDF
    Nina Tang Sherwood is with California Institute of Technology, Qi Sun is with California Institute of Technology, Mingshan Xue is with UT Austin, Bing Zhang is with UT Austin, Kai Zinn is with California Institute of Technology.The most common form of human autosomal dominant hereditary spastic paraplegia (AD-HSP) is caused by mutations in the SPG4 (spastin) gene, which encodes an AAA ATPase closely related in sequence to the microtubule-severing protein Katanin. Patients with AD-HSP exhibit degeneration of the distal regions of the longest axons in the spinal cord. Loss-of-function mutations in the Drosophila spastin gene produce larval neuromuscular junction (NMJ) phenotypes. NMJ synaptic boutons in spastin mutants are more numerous and more clustered than in wild-type, and transmitter release is impaired. spastin-null adult flies have severe movement defects. They do not fly or jump, they climb poorly, and they have short lifespans. spastin hypomorphs have weaker behavioral phenotypes. Overexpression of Spastin erases the muscle microtubule network. This gain-of-function phenotype is consistent with the hypothesis that Spastin has microtubule-severing activity, and implies that spastin loss-of-function mutants should have an increased number of microtubules. Surprisingly, however, we observed the opposite phenotype: in spastin-null mutants, there are fewer microtubule bundles within the NMJ, especially in its distal boutons. The Drosophila NMJ is a glutamatergic synapse that resembles excitatory synapses in the mammalian spinal cord, so the reduction of organized presynaptic microtubules that we observe in spastin mutants may be relevant to an understanding of human Spastin's role in maintenance of axon terminals in the spinal cord.Biological Sciences, School o

    The first peptides: the evolutionary transition between prebiotic amino acids and early proteins

    Get PDF
    The issues we attempt to tackle here are what the first peptides did look like when they emerged on the primitive earth, and what simple catalytic activities they fulfilled. We conjecture that the early functional peptides were short (3 to 8 amino acids long), were made of those amino acids, Gly, Ala, Val and Asp, that are abundantly produced in many prebiotic synthesis experiments and observed in meteorites, and that the neutralization of Asp's negative charge is achieved by metal ions. We further assume that some traces of these prebiotic peptides still exist, in the form of active sites in present-day proteins. Searching these proteins for prebiotic peptide candidates led us to identify three main classes of motifs, bound mainly to Mg^{2+} ions: D(F/Y)DGD corresponding to the active site in RNA polymerases, DGD(G/A)D present in some kinds of mutases, and DAKVGDGD in dihydroxyacetone kinase. All three motifs contain a DGD submotif, which is suggested to be the common ancestor of all active peptides. Moreover, all three manipulate phosphate groups, which was probably a very important biological function in the very first stages of life. The statistical significance of our results is supported by the frequency of these motifs in today's proteins, which is three times higher than expected by chance, with a P-value of 3 10^{-2}. The implications of our findings in the context of the appearance of life and the possibility of an experimental validation are discussed.Comment: 22 pages, 2 figures, J. Theor. Biol. (2009) in pres

    Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

    Get PDF
    Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.Table of Contents 1. GENERAL INTRODUCTION.........................................................................1 1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9 1.1.1. Suffix Tree..........................................................................................11 1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13 1.1.3. Suffix Array........................................................................................14 1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15 1.1.5. Enhanced Suffix Array.......................................................................17 1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21 1.1.7. Self-index...........................................................................................22 1.1.8. Burrows-Wheeler transform..............................................................23 1.1.9. The FM-Index and the backward search algorithm..........................25 1.1.10. The space and the time-complexity of the FM-index.........................29 2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31 2.1. Introduction................................................................................................31 2.2. Methods.....................................................................................................33 2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33 2.2.2. Problem statement.............................................................................35 2.2.3. Time complexity analysis of the previous approach (kr 1)................35 2.2.4. Time complexity analysis of the new approach (kr 2).......................37 2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38 2.2.6. The implementation of kr version 2...................................................44 2.3. Analysis of Kr on simulated data sets........................................................45 2.3.1. Auxiliary programs............................................................................45 2.3.2. Consistency of Kr...............................................................................46 i 2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48 2.3.4. The effect of genome duplication on the accuracy of Kr....................49 2.3.5. Run time comparison of kr 1 and kr 2...............................................50 2.4. Application of kr version 2........................................................................53 2.4.1. Auxililary software used for the analysis of real data sets................56 2.4.2. The analysis of 12 Drosophila genomes............................................57 2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58 2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61 2.5. Discussion..................................................................................................62 3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66 3.1. Introduction................................................................................................66 3.2. Methods.....................................................................................................69 3.2.1. Problem statement – determining subtype(s) of a query sequence....69 3.2.2. Construction of locally homologous segments..................................71 3.2.3. Time complexity of computing a list of intervals Ii............................72 3.2.4. Algorithm 2: Construction of an interval tree...................................73 3.2.5. Computing a list of segements Gi.......................................................80 3.3. Analysis of st on simulated data sets.........................................................82 3.3.1. Run-time and memory usage analysis of st........................................82 3.3.2. Consistency of st................................................................................85 3.3.3. Comparison to SCUEAL on simulated data sets...............................92 3.4. Application of st.........................................................................................97 3.4.1. The analysis of Neisseria meningitidis..............................................98 3.4.2. The analysis of a recombinant form of HIV-1...................................99 3.4.3. The analysis of circulating recombinant forms of HIV-1................103 3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104 3.5. Discussion................................................................................................107 4. CONCLUSION..............................................................................................110 5. REFERENCES..............................................................................................112 6. ELECTRONIC SOURCES...........................................................................121 7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii iii ABSTRACT............................................................................................................124 SAŽETAK..............................................................................................................125 CURRICULUM VITAE........................................................................................126 ŽIVOTOPIS...........................................................................................................12

    Sequence queries on temporal graphs

    Get PDF
    Graphs that evolve over time are called temporal graphs. They can be used to describe and represent real-world networks, including transportation networks, social networks, and communication networks, with higher fidelity and accuracy. However, research is still limited on how to manage large scale temporal graphs and execute queries over these graphs efficiently and effectively. This thesis investigates the problems of temporal graph data management related to node and edge sequence queries. In temporal graphs, nodes and edges can evolve over time. Therefore, sequence queries on nodes and edges can be key components in managing temporal graphs. In this thesis, the node sequence query decomposes into two parts: graph node similarity and subsequence matching. For node similarity, this thesis proposes a modified tree edit distance that is metric and polynomially computable and has a natural, intuitive interpretation. Note that the proposed node similarity works even for inter-graph nodes and therefore can be used for graph de-anonymization, network transfer learning, and cross-network mining, among other tasks. The subsequence matching query proposed in this thesis is a framework that can be adopted to index generic sequence and time-series data, including trajectory data and even DNA sequences for subsequence retrieval. For edge sequence queries, this thesis proposes an efficient storage and optimized indexing technique that allows for efficient retrieval of temporal subgraphs that satisfy certain temporal predicates. For this problem, this thesis develops a lightweight data management engine prototype that can support time-sensitive temporal graph analytics efficiently even on a single PC

    Origin of biological information: Inherent occurrence of intron-rich split genes, coding for complex extant proteins, within pre-biotic random genetic sequences

    Get PDF
    The origin of biological information is an unexplained phenomenon. Prior research in resolving the origin of proteins, based on the assumption that the first genes were contiguous prokaryotic sequences has not succeeded. Rather, it has been established that contiguous protein-coding genes do not exist in practically any amount of random genetic sequences. We found that complex eukaryotic proteins could be inherently encoded in split genes that could exist by chance within mere micrograms to milligrams of random DNA. Using protein amino acid sequence variability, codon degeneracy, and stringent exon-length restriction, we demonstrate that split genes for proteins of extant eukaryotes occur extensively in random genetic sequences. The results provide evidence that an abundance of split genes encoding advanced proteins in a small amount of prebiotic genetic material could have ignited the evolution of the eukaryotic genome

    Algorithms for the analysis of molecular sequences

    Get PDF
    corecore