Search CORE

262 research outputs found

Bioinformatic approaches for genome finishing

Author: Husemann Peter
Tauch Andreas
Publication venue: Universitätsbibliothek Bielefeld
Publication date: 01/01/2011
Field of study

Husemann P, Tauch A. Bioinformatic approaches for genome finishing. Bielefeld: Universitätsbibliothek Bielefeld; 2011

Publications at Bielefeld University

Indexing Highly Repetitive String Collections

Author: Navarro Gonzalo
Publication venue
Publication date: 13/12/2021
Field of study

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

arXiv.org e-Print Archive

Computing MEMs and Relatives on Repetitive Text Collections

Author: Navarro Gonzalo
Publication venue
Publication date: 04/09/2023
Field of study

We consider the problem of computing the Maximal Exact Matches (MEMs) of a given pattern

P[1 .. m]

on a large repetitive text collection

T[1 .. n]

, which is represented as a (hopefully much smaller) run-length context-free grammar of size

g_{rl}

. We show that the problem can be solved in time

O(m^2 \log^\epsilon n)

, for any constant

\epsilon > 0

, on a data structure of size

O(g_{rl})

. Further, on a locally consistent grammar of size

O(\delta\log\frac{n}{\delta})

, the time decreases to

O(m\log m(\log m + \log^\epsilon n))

. The value

\delta

is a function of the substring complexity of

T

and

\Omega(\delta\log\frac{n}{\delta})

is a tight lower bound on the compressibility of repetitive texts

T

, so our structure has optimal size in terms of

n

and

\delta

. We extend our results to several related problems, such as finding

k

-MEMs, MUMs, rare MEMs, and applications

arXiv.org e-Print Archive

Document retrieval hacks

Author: Puglisi Simon J.
Zhukova Bella
Publication venue: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Publication date: 01/01/2021
Field of study

Publisher Copyright: © Simon J. Puglisi and Bella Zhukova; licensed under Creative Commons License CC-BY 4.0 19th International Symposium on Experimental Algorithms (SEA 2021).Given a collection of strings, document listing refers to the problem of finding all the strings (or documents) where a given query string (or pattern) appears. Index data structures that support efficient document listing for string collections have been the focus of intense research in the last decade, with dozens of papers published describing exotic and elegant compressed data structures. The problem is now quite well understood in theory and many of the solutions have been implemented and evaluated experimentally. A particular recent focus has been on highly repetitive document collections, which have become prevalent in many areas (such as version control systems and genomics - to name just two very different sources). The aim of this paper is to describe simple and efficient document listing algorithms that can be used in combination with more sophisticated techniques, or as baselines against which the performance of new document listing indexes can be measured. Our approaches are based on simple combinations of scanning and hashing, which we show to combine very well with dictionary compression to achieve small space usage. Our experiments show these methods to be often much faster and less space consuming than the best specialized indexes for the problem.Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Whole-genome assembly of the coral reef Pearlscale Pygmy Angelfish (Centropyge vrolikii)

Author: Fernández Silva Iria
Henderson James B
Rocha Luiz A
Simison W. Brian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/12/2022
Field of study

The diversity of DNA sequencing methods and algorithms for genome assemblies presents scientists with a bewildering array of choices. Here, we construct and compare eight candidate assemblies combining overlapping shotgun read data, mate-pair and Chicago libraries and four different genome assemblers to produce a high-quality draft genome of the iconic coral reef Pearlscale Pygmy Angelfish, Centropyge vrolikii (family Pomacanthidae). The best candidate assembly combined all four data types and had a scaffold N50 127.5 times higher than the candidate assembly obtained from shotgun data only. Our best candidate assembly had a scaffold N50 of 8.97 Mb, contig N50 of 189,827, and 97.4% complete for BUSCO v2 (Actinopterygii set) and 95.6% complete for CEGMA matches. These contiguity and accuracy scores are higher than those of any other fish assembly released to date that did not apply linkage map information, including those based on more expensive long-read sequencing data. Our analysis of how different data types improve assembly quality will help others choose the most appropriate de novo genome sequencing strategy based on resources and target applications. Furthermore, the draft genome of the Pearlscale Pygmy angelfish will play an important role in future studies of coral reef fish evolution, diversity and conservationUC Berkeley | Ref. S10RR029668UC Berkeley | Ref. S10RR02730

Investigo

Unveiling Human Non-Random Genome Editing Mechanisms Activated in Response to Chronic Environmental Changes: I. Where Might These Mechanisms Come from and What Might They Have Led To?

Author: Loris Zamai
Publication venue: 'MDPI AG'
Publication date: 27/10/2020
Field of study

none1noThis article challenges the notion of the randomness of mutations in eukaryotic cells by unveiling stress-induced human non-random genome editing mechanisms. To account for the existence of such mechanisms, I have developed molecular concepts of the cell environment and cell environmental stressors and, making use of a large quantity of published data, hypothesised the origin of some crucial biological leaps along the evolutionary path of life on Earth under the pressure of natural selection, in particular, (1) virus-cell mating as a primordial form of sexual recombination and symbiosis; (2) Lamarckian CRISPR-Cas systems; (3) eukaryotic gene development; (4) antiviral activity of retrotransposon-guided mutagenic enzymes; and finally, (5) the exaptation of antiviral mutagenic mechanisms to stress-induced genome editing mechanisms directed at "hyper-transcribed" endogenous genes. Genes transcribed at their maximum rate (hyper-transcribed), yet still unable to meet new chronic environmental demands generated by "pollution", are inadequate and generate more and more intronic retrotransposon transcripts. In this scenario, RNA-guided mutagenic enzymes (e.g., Apolipoprotein B mRNA editing catalytic polypeptide-like enzymes, APOBECs), which have been shown to bind to retrotransposon RNA-repetitive sequences, would be surgically targeted by intronic retrotransposons on opened chromatin regions of the same "hyper-transcribed" genes. RNA-guided mutagenic enzymes may therefore "Lamarkianly" generate single nucleotide polymorphisms (SNP) and gene copy number variations (CNV), as well as transposon transposition and chromosomal translocations in the restricted areas of hyper-functional and inadequate genes, leaving intact the rest of the genome. CNV and SNP of hyper-transcribed genes may allow cells to surgically explore a new fitness scenario, which increases their adaptability to stressful environmental conditions. Like the mechanisms of immunoglobulin somatic hypermutation, non-random genome editing mechanisms may generate several cell mutants, and those codifying for the most environmentally adequate proteins would have a survival advantage and would therefore be Darwinianly selected. Non-random genome editing mechanisms represent tools of evolvability leading to organismal adaptation including transgenerational non-Mendelian gene transmission or to death of environmentally inadequate genomes. They are a link between environmental changes and biological novelty and plasticity, finally providing a molecular basis to reconcile gene-centred and "ecological" views of evolution.openZamai, LorisZamai, Lori

Multidisciplinary Digital Publishing Institute

Archivio istituzionale della ricerca - Università di Urbino

Transposable Element Populations Shed Light on the Evolutionary History of Wheat and the Complex Co-Evolution of Autonomous and Non-Autonomous Retrotransposons

Author: Gundlach Heidrun
Poretti Manuel
Pozniak Curtis J
Sotiropoulos Alexandros G
Stein Nils
Stritt Christoph
Walkowiak Sean
Wicker Thomas
Publication venue: Wiley Open Access
Publication date: 01/03/2022
Field of study

Wheat has one of the largest and most repetitive genomes among major crop plants, containing over 85% transposable elements (TEs). TEs populate genomes much in the way that individuals populate ecosystems, diversifying into different lineages, sub-families and sub-populations. The recent availability of high-quality, chromosome-scale genome sequences from ten wheat lines enables a detailed analysis how TEs evolved in allohexaploid wheat, its diploids progenitors, and in various chromosomal haplotype segments. LTR retrotransposon families evolved into distinct sub-populations and sub-families that were active in waves lasting several hundred thousand years. Furthermore, It is shown that different retrotransposon sub-families were active in the three wheat sub-genomes, making them useful markers to study and date polyploidization events and chromosomal rearrangements. Additionally, haplotype-specific TE sub-families are used to characterize chromosomal introgressions in different wheat lines. Additionally, populations of non-autonomous TEs co-evolved over millions of years with their autonomous partners, leading to complex systems with multiple types of autonomous, semi-autonomous and non-autonomous elements. Phylogenetic and TE population analyses revealed the relationships between non-autonomous elements and their mobilizing autonomous partners. TE population analysis provided insights into genome evolution of allohexaploid wheat and genetic diversity of species, and may have implication for future crop breeding

ZORA

Reevaluation of the Toxoplasma gondii and Neospora caninum genomes reveals misassembly, karyotype differences, and chromosomal rearrangements

Author: Berná Luisa
Cabrera Castro Andrés M.
Francia María E
Greif Gonzalo
Marquez Pablo
Robello Porto Carlos
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2021
Field of study

Neospora caninum primarily infects cattle, causing abortions, with an estimated impact of a billion dollars on the worldwide economy annually. However, the study of its biology has been unheeded by the established paradigm that it is virtually identical to its close relative, the widely studied human pathogen Toxoplasma gondii. By revisiting the genome sequence, assembly, and annotation using third-generation sequencing technologies, here we show that the N. caninum genome was originally incorrectly assembled under the presumption of synteny with T. gondii. We show that major chromosomal rearrangements have occurred between these species. Importantly, we show that chromosomes originally named Chr VIIb and VIII are indeed fused, reducing the karyotype of both N. caninum and T. gondii to 13 chromosomes. We reannotate the N. caninum genome, revealing more than 500 new genes. We sequence and annotate the nonphotosynthetic plastid and mitochondrial genomes and show that although apicoplast genomes are virtually identical, high levels of gene fragmentation and reshuffling exist between species and strains. Our results correct assembly artifacts that are currently widely distributed in the genome database of N. caninum and T. gondii and, more importantly, highlight the mitochondria as a previously oversighted source of variability and pave the way for a change in the paradigm of synteny, encouraging rethinking the genome as basis of the comparative unique biology of these pathogens.INIA: FSSA_X_2014_1_10602

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

PubMed Central