211 research outputs found
ntLink: a toolkit for de novo genome assembly scaffolding and mapping using long reads
With the increasing affordability and accessibility of genome sequencing
data, de novo genome assembly is an important first step to a wide variety of
downstream studies and analyses. Therefore, bioinformatics tools that enable
the generation of high-quality genome assemblies in a computationally efficient
manner are essential. Recent developments in long-read sequencing technologies
have greatly benefited genome assembly work, including scaffolding, by
providing long-range evidence that can aid in resolving the challenging
repetitive regions of complex genomes. ntLink is a flexible and
resource-efficient genome scaffolding tool that utilizes long-read sequencing
data to improve upon draft genome assemblies built from any sequencing
technologies, including the same long reads. Instead of using read alignments
to identify candidate joins, ntLink utilizes minimizer-based mappings to infer
how input sequences should be ordered and oriented into scaffolds. Recent
improvements to ntLink have added important features such as overlap detection,
gap-filling and in-code scaffolding iterations. Here, we present three basic
protocols demonstrating how to use each of these new features to yield highly
contiguous genome assemblies, while still maintaining ntLink's proven
computational efficiency. Further, as we illustrate in the alternate protocols,
the lightweight minimizer-based mappings that enable ntLink scaffolding can
also be utilized for other downstream applications, such as misassembly
detection. With its modularity and multiple modes of execution, ntLink has
broad benefit to the genomics community, from genome scaffolding and beyond.
ntLink is an open-source project and is freely available from
https://github.com/bcgsc/ntLink.Comment: 23 pages, 2 figure
DIDA: Distributed Indexing Dispatched Alignment
One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use
Swarm v3: towards tera-scale amplicon clustering
Motivation: Previously we presented swarm, an open-source amplicon clustering programme that produces fine-scale molecular operational taxonomic units (OTUs) that are free of arbitrary global clustering thresholds. Here, we present swarm v3 to address issues of contemporary datasets that are growing towards tera-byte sizes.
Results: When compared with previous swarm versions, swarm v3 has modernized C++ source code, reduced memory footprint by up to 50%, optimized CPU-usage and multithreading (more than 7 times faster with default parameters), and it has been extensively tested for its robustness and logic
Conifers Concentrate Large Numbers of NLR Immune Receptor Genes on One Chromosome
Nucleotide-binding domain and leucine-rich repeat (NLR) immune receptor genes form a major line of defense in plants, acting in both pathogen recognition and resistance machinery activation. NLRs are reported to form large gene clusters in limber pine (Pinus flexilis), but it is unknown how widespread this genomic architecture may be among the extant species of conifers (Pinophyta). We used comparative genomic analyses to assess patterns in the abundance, diversity, and genomic distribution of NLR genes. Chromosome-level whole genome assemblies and high-density linkage maps in the Pinaceae, Cupressaceae, Taxaceae, and other gymnosperms were scanned for NLR genes using existing and customized pipelines. The discovered genes were mapped across chromosomes and linkage groups and analyzed phylogenetically for evolutionary history. Conifer genomes are characterized by dense clusters of NLR genes, highly localized on one chromosome. These clusters are rich in TNL-encoding genes, which seem to have formed through multiple tandem duplication events. In contrast to angiosperms and nonconiferous gymnosperms, genomic clustering of NLR genes is ubiquitous in conifers. NLR-dense genomic regions are likely to influence a large part of the plant's resistance, informing our understanding of adaptation to biotic stress and the development of genetic resources through breeding
A descriptive marker gene approach to single-cell pseudotime inference
MotivationPseudotime estimation from single-cell gene expression data allows the recovery of temporal information from otherwise static profiles of individual cells. Conventional pseudotime inference methods emphasize an unsupervised transcriptome-wide approach and use retrospective analysis to evaluate the behaviour of individual genes. However, the resulting trajectories can only be understood in terms of abstract geometric structures and not in terms of interpretable models of gene behaviour.ResultsHere we introduce an orthogonal Bayesian approach termed ‘Ouija’ that learns pseudotimes from a small set of marker genes that might ordinarily be used to retrospectively confirm the accuracy of unsupervised pseudotime algorithms. Crucially, we model these genes in terms of switch-like or transient behaviour along the trajectory, allowing us to understand why the pseudotimes have been inferred and learn informative parameters about the behaviour of each gene. Since each gene is associated with a switch or peak time the genes are effectively ordered along with the cells, allowing each part of the trajectory to be understood in terms of the behaviour of certain genes. We demonstrate that this small panel of marker genes can recover pseudotimes that are consistent with those obtained using the entire transcriptome. Furthermore, we show that our method can detect differences in the regulation timings between two genes and identify ‘metastable’ states—discrete cell types along the continuous trajectories—that recapitulate known cell types.Availability and implementationAn open source implementation is available as an R package at http://www.github.com/kieranrcampbell/ouija and as a Python/TensorFlow package at http://www.github.com/kieranrcampbell/ouijaflow.Supplementary informationSupplementary data are available at Bioinformatics online.</p
Recommended from our members
The genetic landscape of high-risk neuroblastoma
Neuroblastoma is a malignancy of the developing sympathetic nervous system that often presents with widespread metastatic disease, resulting in survival rates of less than 50%1. To determine the spectrum of somatic mutation in high-risk neuroblastoma, we studied 240 cases using a combination of whole exome, genome and transcriptome sequencing as part of the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative. Here we report a low median exonic mutation frequency of 0.60 per megabase (0.48 non-silent), and remarkably few recurrently mutated genes in these tumors. Genes with significant somatic mutation frequencies included ALK (9.2% of cases), PTPN11 (2.9%), ATRX (2.5%, an additional 7.1% had focal deletions), MYCN (1.7%, a recurrent p.Pro44Leu alteration), and NRAS (0.83%). Rare, potentially pathogenic germline variants were significantly enriched in ALK, CHEK2, PINK1, and BARD1. The relative paucity of recurrent somatic mutations in neuroblastoma challenges current therapeutic strategies reliant upon frequently altered oncogenic drivers
- …