339 research outputs found
Dualities in tree representations
A characterization of the tree Tâ such that BP(Tâ) = â DFUDS(T), the reversal of DFUDS(T) is given. An immediate consequence is a rigorous characterization of the tree T such that BP( T^) = DFUDS(T^). In summary, BP and DFUDS are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in BP and/or DFUDS. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem
Draft genome of the lowland anoa (Bubalus depressicornis) and comparison with buffalo genome assemblies (Bovidae, Bubalina)
Genomic data for wild species of the genus Bubalus (Asian buffaloes) are still lacking while several whole genomes are currently available for domestic water buffaloes. To address this, we sequenced the genome of a wild endangered dwarf buffalo, the lowland anoa (Bubalus depressicornis), produced a draft genome assembly, and made comparison to published buffalo genomes. The lowland anoa genome assembly was 2.56 Gbp long and contained 103,135 contigs, the longest contig being 337.39 kbp long. N50 and L50 values were 38.73 kbp and 19.83 kbp, respectively, mean coverage was 44x and GC content was 41.74%. Two strategies were adopted to evaluate genome completeness: (i) determination of genomic features with de novo and homology-based predictions using annotations of chromosome-level genome assembly of the river buffalo, and (ii) employment of benchmarking against universal single-copy orthologs (BUSCO). Homology-based predictions identified 94.51% complete and 3.65% partial genomic features. De novo gene predictions identified 32,393 genes, representing 97.14% of the reference's annotated genes, whilst BUSCO search against the mammalian orthologues database identified 71.1% complete, 11.7% fragmented and 17.2% missing orthologues, indicating a good level of completeness for downstream analyses. Repeat analyses indicated that the lowland anoa genome contains 42.12% of repetitive regions. The genome assembly of the lowland anoa is expected to contribute to comparative genome analyses among bovid species. [Abstract copyright: © The Author(s) 2022. Published by Oxford University Press on behalf of Genetics Society of America.
Using cascading Bloom filters to improve the memory usage for de Brujin graphs
De Brujin graphs are widely used in bioinformatics for processing
next-generation sequencing data. Due to a very large size of NGS datasets, it
is essential to represent de Bruijn graphs compactly, and several approaches to
this problem have been proposed recently. In this work, we show how to reduce
the memory required by the algorithm of [3] that represents de Brujin graphs
using Bloom filters. Our method requires 30% to 40% less memory with respect to
the method of [3], with insignificant impact to construction time. At the same
time, our experiments showed a better query time compared to [3]. This is, to
our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte
Safe and complete contig assembly via omnitigs
Contig assembly is the first stage that most assemblers solve when
reconstructing a genome from a set of reads. Its output consists of contigs --
a set of strings that are promised to appear in any genome that could have
generated the reads. From the introduction of contigs 20 years ago, assemblers
have tried to obtain longer and longer contigs, but the following question was
never solved: given a genome graph (e.g. a de Bruijn, or a string graph),
what are all the strings that can be safely reported from as contigs? In
this paper we finally answer this question, and also give a polynomial time
algorithm to find them. Our experiments show that these strings, which we call
omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of
dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201
Recommended from our members
Seedability: optimizing alignment parameters for sensitive sequence comparison
Data availability:
The data underlying this article are available either in https://github.com/lorrainea/Seedability or in the ensembl database at https://www.ensembl.org, and can be accessed using the gene names ENSPTRG00000044036 and ENSG00000174236 or in the NCBI database at https://www.ncbi.nlm.nih.gov and can be found using the reference sequence NC_000001.11.Motivation:
Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2â , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedabilityâ , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences.
Results:
The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2. We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments.
Availability and implementation:
https://github.com/lorrainea/Seedability (distributed under GPL v3.0).R.C. was supported by ANR Full-RNA, SeqDigger, Inception, and PRAIRIE grants (ANR-22-CE45-0007, ANR-19-CE45-0008, PIA/ANR16-CONV-0005, ANR-19-P3IA-0001). This project has received funding from the European Unionâs Horizon 2020 research and innovation programme under the Marie SkĆodowska-Curie grant agreements No. 872539 (PANGAIA) and 956229 (ALPACA)
A framework for space-efficient string kernels
String kernels are typically used to compare genome-scale sequences whose
length makes alignment impractical, yet their computation is based on data
structures that are either space-inefficient, or incur large slowdowns. We show
that a number of exact string kernels, like the -mer kernel, the substrings
kernels, a number of length-weighted kernels, the minimal absent words kernel,
and kernels with Markovian corrections, can all be computed in time and
in bits of space in addition to the input, using just a
data structure on the Burrows-Wheeler transform of the
input strings, which takes time per element in its output. The same
bounds hold for a number of measures of compositional complexity based on
multiple value of , like the -mer profile and the -th order empirical
entropy, and for calibrating the value of using the data
STRONG: metagenomics strain resolution on assembly graphs
We introduce STrain Resolution ON assembly Graphs (STRONG), which identifies strains de novo, from multiple metagenome samples. STRONG performs coassembly, and binning into metagenome assembled genomes (MAGs), and stores the coassembly graph prior to variant simplification. This enables the subgraphs and their unitig per-sample coverages, for individual single-copy core genes (SCGs) in each MAG, to be extracted. A Bayesian algorithm, BayesPaths, determines the number of strains present, their haplotypes or sequences on the SCGs, and abundances. STRONG is validated using synthetic communities and for a real anaerobic digestor time series generates haplotypes that match those observed from long Nanopore reads
Consequences of breed formation on patterns of genomic diversity and differentiation: the case of highly diverse peripheral Iberian cattle
Iberian primitive breeds exhibit a remarkable phenotypic diversity over a very limited geographical space. While genomic data are accumulating for most commercial cattle, it is still lacking for these primitive breeds. Whole genome data is key to understand the consequences of historic breed formation and the putative role of earlier admixture events in the observed diversity patterns.info:eu-repo/semantics/publishedVersio
The value of the spineless monkey orange tree (Strychnos madagascariensis) for conservation of northern sportive lemurs (Lepilemur milanoii and L. ankaranensis)
Tree hollows provide shelters for a large number of forest-dependent vertebrate species worldwide. In Madagascar, where high historical and ongoing rates of deforestation and forest degradation are responsible for a major environmental crisis, reduced availability of tree hollows may lead to declines in hollow-dwelling species such as sportive lemurs, one of the most species-rich groups of lemurs. The identification of native tree species used by hollow-dwelling lemurs may facilitate targeted management interventions to maintain or improve habitat quality for these lemurs. During an extensive survey of sportive lemurs in northern Madagascar, we identified one tree species, Strychnos madagascariensis (Loganiaceae), the spineless monkey orange tree, as a principal sleeping site of two species of northern sportive lemurs, Lepilemur ankaranensis and L. milanoii (Lepilemuridae). This tree species represented 32.5% (n=150) of the 458 sleeping sites recorded. This result suggests that S. madagascariensis may be valuable for the conservation of hollow-dwelling lemurs. De nombreux vertĂ©brĂ©s forestiers Ă travers le monde trouvent refuge dans des cavitĂ©s et des trous dâarbres. Ă Madagascar, les taux de dĂ©forestation historiques et actuels sont responsables dâune crise environnementale majeure. Dans ce contexte, une disponibilitĂ© rĂ©duite dâarbres pourvus de cavitĂ©s pourrait entrainer le dĂ©clin des espĂšces dĂ©pendant de ces abris comme par exemple les lĂ©pilemurs, un des groupes de lĂ©muriens les plus riches en espĂšces. Lâidentification des espĂšces dâarbres indigĂšnes creusĂ©s de trous et utilisĂ©s par les lĂ©muriens pourrait faciliter la mise en place dâactions de conservation ayant pour but de maintenir ou amĂ©liorer lâhabitat de ces lĂ©muriens. Au cours dâune Ă©tude rĂ©alisĂ©e dans le Nord de Madagascar, nous avons observĂ© que Strychnos madagascariensis (Loganiaceae) Ă©tait  frĂ©quemment utilisĂ© comme site dortoir par les deux espĂšces de lĂ©pilemurs prĂ©sentes, Lepilemur  ankaranensis and L. milanoii (Lepilemuridae). Cette espĂšce dâarbre concernait 32,5% (n = 150) des 458 sites dortoirs enregistrĂ©s. Ce rĂ©sultat suggĂšre que S. madagascariensis pourrait ĂȘtre important pour la conservation des lĂ©muriens dĂ©pendant de sites dortoirs
- âŠ