Belgrade : Institute of molecular genetics and genetic engineering
Abstract
The use of long read DNA sequencing technologies is producing an explosion of high-quality
de-novo genome assemblies. The availability of these genomes represents a major step
forward for evolution, population genomics, epidemiology, among other applications. A major
bottleneck for many research groups continues to be the availability of tools to build and
analyze the large datasets of genomes that can be produced using these technologies. In this
talk, I summarize the functionalities developed by my research group in the version four of
the Next Generation Sequencing Experience Platform (NGSEP) to perform a comprehensive
analysis of long and short DNA sequencing reads. First, we designed new algorithms for
assembly of haploid and diploid samples from long DNA sequencing reads. A minimizers table
is constructed from the reads , using K-mer hash codes calculated from rankings relative to
the mode of the k-mer counts distribution. Statistics collected during this process are used as
features to build layout paths. For diploid samples, we integrated a reimplementation of the
ReFHap algorithm to perform molecular phasing. Benchmark experiments using PacBio HiFi
and Nanopore sequencing data for different species show that our solution has competitive
contiguity and efficiency, as well as superior accuracy in some cases, compared to other
currently used software. We also developed a functionality to perform ortholog identification
and gene-based alignment of assembled genomes. Proteomes for each genome are extracted
and homology relationships are efficiently predicted building indexes of aminoacid sequences
by k-mer ocurrance. Then, genes are clustered in orthogroups based on the topology of the
graph induced by the predicted relationships. Gene presence/absence matrices are derived
from these orthogroups. If genome assemblies are provided as input, synteny relationships
are identified for each pair of genomes. We also implemented algorithms to perform alignment
of short and long reads to a reference genome. Based on aligned long reads, we improved the
classical variants detector to detect long structural variants. Adding up these developments,
NGSEP is a comprehensive tool to perform de-novo and reference-based analysis of DNA
sequencing reads in a wide variety of experimental settings to solve different research goals.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202