262 research outputs found

    Bioinformatic approaches for genome finishing

    Get PDF
    Husemann P, Tauch A. Bioinformatic approaches for genome finishing. Bielefeld: Universitätsbibliothek Bielefeld; 2011

    Indexing Highly Repetitive String Collections

    Full text link
    Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

    Computing MEMs and Relatives on Repetitive Text Collections

    Full text link
    We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern P[1..m]P[1 .. m] on a large repetitive text collection T[1..n]T[1 .. n], which is represented as a (hopefully much smaller) run-length context-free grammar of size grlg_{rl}. We show that the problem can be solved in time O(m2logϵn)O(m^2 \log^\epsilon n), for any constant ϵ>0\epsilon > 0, on a data structure of size O(grl)O(g_{rl}). Further, on a locally consistent grammar of size O(δlognδ)O(\delta\log\frac{n}{\delta}), the time decreases to O(mlogm(logm+logϵn))O(m\log m(\log m + \log^\epsilon n)). The value δ\delta is a function of the substring complexity of TT and Ω(δlognδ)\Omega(\delta\log\frac{n}{\delta}) is a tight lower bound on the compressibility of repetitive texts TT, so our structure has optimal size in terms of nn and δ\delta. We extend our results to several related problems, such as finding kk-MEMs, MUMs, rare MEMs, and applications

    Document retrieval hacks

    Get PDF
    Publisher Copyright: © Simon J. Puglisi and Bella Zhukova; licensed under Creative Commons License CC-BY 4.0 19th International Symposium on Experimental Algorithms (SEA 2021).Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.Peer reviewe

    Whole-genome assembly of the coral reef Pearlscale Pygmy Angelfish (Centropyge vrolikii)

    Get PDF
    The diversity of DNA sequencing methods and algorithms for genome assemblies presents scientists with a bewildering array of choices. Here, we construct and compare eight candidate assemblies combining overlapping shotgun read data, mate-pair and Chicago libraries and four different genome assemblers to produce a high-quality draft genome of the iconic coral reef Pearlscale Pygmy Angelfish, Centropyge vrolikii (family Pomacanthidae). The best candidate assembly combined all four data types and had a scaffold N50 127.5 times higher than the candidate assembly obtained from shotgun data only. Our best candidate assembly had a scaffold N50 of 8.97 Mb, contig N50 of 189,827, and 97.4% complete for BUSCO v2 (Actinopterygii set) and 95.6% complete for CEGMA matches. These contiguity and accuracy scores are higher than those of any other fish assembly released to date that did not apply linkage map information, including those based on more expensive long-read sequencing data. Our analysis of how different data types improve assembly quality will help others choose the most appropriate de novo genome sequencing strategy based on resources and target applications. Furthermore, the draft genome of the Pearlscale Pygmy angelfish will play an important role in future studies of coral reef fish evolution, diversity and conservationUC Berkeley | Ref. S10RR029668UC Berkeley | Ref. S10RR02730

    Unveiling Human Non-Random Genome Editing Mechanisms Activated in Response to Chronic Environmental Changes: I. Where Might These Mechanisms Come from and What Might They Have Led To?

    Get PDF
    none1noThis article challenges the notion of the randomness of mutations in eukaryotic cells by unveiling stress-induced human non-random genome editing mechanisms. To account for the existence of such mechanisms, I have developed molecular concepts of the cell environment and cell environmental stressors and, making use of a large quantity of published data, hypothesised the origin of some crucial biological leaps along the evolutionary path of life on Earth under the pressure of natural selection, in particular, (1) virus-cell mating as a primordial form of sexual recombination and symbiosis; (2) Lamarckian CRISPR-Cas systems; (3) eukaryotic gene development; (4) antiviral activity of retrotransposon-guided mutagenic enzymes; and finally, (5) the exaptation of antiviral mutagenic mechanisms to stress-induced genome editing mechanisms directed at "hyper-transcribed" endogenous genes. Genes transcribed at their maximum rate (hyper-transcribed), yet still unable to meet new chronic environmental demands generated by "pollution", are inadequate and generate more and more intronic retrotransposon transcripts. In this scenario, RNA-guided mutagenic enzymes (e.g., Apolipoprotein B mRNA editing catalytic polypeptide-like enzymes, APOBECs), which have been shown to bind to retrotransposon RNA-repetitive sequences, would be surgically targeted by intronic retrotransposons on opened chromatin regions of the same "hyper-transcribed" genes. RNA-guided mutagenic enzymes may therefore "Lamarkianly" generate single nucleotide polymorphisms (SNP) and gene copy number variations (CNV), as well as transposon transposition and chromosomal translocations in the restricted areas of hyper-functional and inadequate genes, leaving intact the rest of the genome. CNV and SNP of hyper-transcribed genes may allow cells to surgically explore a new fitness scenario, which increases their adaptability to stressful environmental conditions. Like the mechanisms of immunoglobulin somatic hypermutation, non-random genome editing mechanisms may generate several cell mutants, and those codifying for the most environmentally adequate proteins would have a survival advantage and would therefore be Darwinianly selected. Non-random genome editing mechanisms represent tools of evolvability leading to organismal adaptation including transgenerational non-Mendelian gene transmission or to death of environmentally inadequate genomes. They are a link between environmental changes and biological novelty and plasticity, finally providing a molecular basis to reconcile gene-centred and "ecological" views of evolution.openZamai, LorisZamai, Lori

    Transposable Element Populations Shed Light on the Evolutionary History of Wheat and the Complex Co-Evolution of Autonomous and Non-Autonomous Retrotransposons

    Full text link
    Wheat has one of the largest and most repetitive genomes among major crop plants, containing over 85% transposable elements (TEs). TEs populate genomes much in the way that individuals populate ecosystems, diversifying into different lineages, sub-families and sub-populations. The recent availability of high-quality, chromosome-scale genome sequences from ten wheat lines enables a detailed analysis how TEs evolved in allohexaploid wheat, its diploids progenitors, and in various chromosomal haplotype segments. LTR retrotransposon families evolved into distinct sub-populations and sub-families that were active in waves lasting several hundred thousand years. Furthermore, It is shown that different retrotransposon sub-families were active in the three wheat sub-genomes, making them useful markers to study and date polyploidization events and chromosomal rearrangements. Additionally, haplotype-specific TE sub-families are used to characterize chromosomal introgressions in different wheat lines. Additionally, populations of non-autonomous TEs co-evolved over millions of years with their autonomous partners, leading to complex systems with multiple types of autonomous, semi-autonomous and non-autonomous elements. Phylogenetic and TE population analyses revealed the relationships between non-autonomous elements and their mobilizing autonomous partners. TE population analysis provided insights into genome evolution of allohexaploid wheat and genetic diversity of species, and may have implication for future crop breeding

    Reevaluation of the Toxoplasma gondii and Neospora caninum genomes reveals misassembly, karyotype differences, and chromosomal rearrangements

    Get PDF
    Neospora caninum primarily infects cattle, causing abortions, with an estimated impact of a billion dollars on the worldwide economy annually. However, the study of its biology has been unheeded by the established paradigm that it is virtually identical to its close relative, the widely studied human pathogen Toxoplasma gondii. By revisiting the genome sequence, assembly, and annotation using third-generation sequencing technologies, here we show that the N. caninum genome was originally incorrectly assembled under the presumption of synteny with T. gondii. We show that major chromosomal rearrangements have occurred between these species. Importantly, we show that chromosomes originally named Chr VIIb and VIII are indeed fused, reducing the karyotype of both N. caninum and T. gondii to 13 chromosomes. We reannotate the N. caninum genome, revealing more than 500 new genes. We sequence and annotate the nonphotosynthetic plastid and mitochondrial genomes and show that although apicoplast genomes are virtually identical, high levels of gene fragmentation and reshuffling exist between species and strains. Our results correct assembly artifacts that are currently widely distributed in the genome database of N. caninum and T. gondii and, more importantly, highlight the mitochondria as a previously oversighted source of variability and pave the way for a change in the paradigm of synteny, encouraging rethinking the genome as basis of the comparative unique biology of these pathogens.INIA: FSSA_X_2014_1_10602
    corecore