307 research outputs found
BISER: Fast Characterization of Segmental Duplication Structure in Multiple Genome Assemblies
The increasing availability of high-quality genome assemblies raised interest in the characterization of genomic architecture. Major architectural parts, such as common repeats and segmental duplications (SDs), increase genome plasticity that stimulates further evolution by changing the genomic structure. However, optimal computation of SDs through standard local alignment algorithms is impractical due to the size of most genomes. A cross-genome evolutionary analysis of SDs is even harder, as one needs to characterize SDs in multiple genomes and find relations between those SDs and unique segments in other genomes. Thus there is a need for fast and accurate algorithms to characterize SD structure in multiple genome assemblies to better understand the evolutionary forces that shaped the genomes of today.
Here we introduce a new tool, BISER, to quickly detect SDs in multiple genomes and identify elementary SDs and core duplicons that drive the formation of such SDs. BISER improves earlier tools by (i) scaling the detection of SDs with low homology (75%) to multiple genomes while introducing further 8-24x speed-ups over the existing tools, and by (ii) characterizing elementary SDs and detecting core duplicons to help trace the evolutionary history of duplications to as far as 90 million years
CoViT: Real-time phylogenetics for the SARS-CoV-2 pandemic using Vision Transformers
Real-time viral genome detection, taxonomic classification and phylogenetic
analysis are critical for efficient tracking and control of viral pandemics
such as Covid-19. However, the unprecedented and still growing amounts of viral
genome data create a computational bottleneck, which effectively prevents the
real-time pandemic tracking. For genomic tracing to work effectively, each new
viral genome sequence must be placed in its pangenomic context. Re-inferring
the full phylogeny of SARS-CoV-2, with datasets containing millions of samples,
is prohibitively slow even using powerful computational resources. We are
attempting to alleviate the computational bottleneck by modifying and applying
Vision Transformer, a recently developed neural network model for image
recognition, to taxonomic classification and placement of viral genomes, such
as SARS-CoV-2. Our solution, CoViT, places SARS-CoV-2 genome accessions onto
SARS-CoV-2 phylogenetic tree with the accuracy of 94.2%. Since CoViT is a
classification neural network, it provides more than one likely placement.
Specifically, one of the two most likely placements suggested by CoViT is
correct with the probability of 97.9%. The probability of the correct placement
to be found among the five most likely placements generated by CoViT is 99.8%.
The placement time is 0.055s per individual genome running on NVIDIAs GeForce
RTX 2080 Ti GPU. We make CoViT available to research community through GitHub:
https://github.com/zuherJahshan/covit.Comment: 11 pages, 4 figures, 2 table
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
Pairwise sequence alignment with block and character edit operations
Pairwise sequence comparison is one of the most fundamental problems in
string processing. The most common metric to quantify the similarity between
sequences S and T is edit distance, d(S,T), which corresponds to the number of
characters that need to be substituted, deleted from, or inserted into S to
generate T. However, fewer edit operations may be sufficient for some string
pairs to transform one string to the other if larger rearrangements are
permitted. Block edit distance refers to such changes in substring level (i.e.,
blocks) that "penalizes" entire block removals, insertions, copies, and
reversals with the same cost as single-character edits (Lopresti & Tomkins,
1997). Most studies to calculate block edit distance to date aimed only to
characterize the distance itself for applications in sequence nearest neighbor
search without reporting the full alignment details. Although a few tools try
to solve block edit distance for genomic sequences, such as GR-Aligner, they
have limited functionality and are no longer maintained.
Here, we present SABER, an algorithm to solve block edit distance that
supports block deletions, block moves, and block reversals in addition to the
classical single-character edit operations. Our algorithm runs in
O(m^2.n.l_range) time for |S|=m, |T|=n and the permitted block size range of
l_range; and can report all breakpoints for the block operations. We also
provide an implementation of SABER currently optimized for genomic sequences
(i.e., generated by the DNA alphabet), although the algorithm can theoretically
be used for any alphabet.
SABER is available at http://github.com/BilkentCompGen/sabe
FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies
A genome read data set can be quickly and efficiently remapped from one
reference to another similar reference (e.g., between two reference versions or
two similar species) using a variety of tools, e.g., the commonly-used CrossMap
tool. With the explosion of available genomic data sets and references,
high-performance remapping tools will be even more important for keeping up
with the computational demands of genome assembly and analysis.
We provide FastRemap, a fast and efficient tool for remapping reads between
genome assemblies. FastRemap provides up to a 7.82 speedup
(6.47, on average) and uses as low as 61.7% (80.7%, on average) of the
peak memory consumption compared to the state-of-the-art remapping tool,
CrossMap.
FastRemap is written in C++. The source code and user manual are freely
available at: github.com/CMU-SAFARI/FastRemap. Docker image available at:
https://hub.docker.com/r/alkanlab/fast. Also available in Bioconda at:
https://anaconda.org/bioconda/fastremap-bio.Comment: FastRemap is open source and all scripts needed to replicate the
results in this paper can be found at https://github.com/CMU-SAFARI/FastRema
- …