Search CORE

11 research outputs found

Genome Assembly, from Practice to Theory: Safe, Complete and Linear-Time

Author: Cairo Massimo
Rizzi Romeo
Tomescu Alexandru
Zirondelli Elia C.
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 08/11/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Safe and Complete Contig Assembly Through Omnitigs

Author: Medvedev Paul
Tomescu Alexandru I.
Publication venue
Publication date: 01/06/2017
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Image Representations of DNA allow Classification by Convolutional Neural Networks

Author: Hope Joshua
Publication venue
Publication date: 01/12/2020
Field of study

In metagenomic analyses the rapid and accurate identification of DNA sequences is important. This is confounded by the existence of novel species not contained in databases. There exist many methods to identify sequences, but with the increasing amounts of sequencing data from high-throughput technologies, the use of new deep learning methods are made more viable. In an attempt to address this it was decided to use Convolutional Neural Networks (CNNs) to classify DNA sequences of archaea, which are important in anaerobic digestion. CNNs were trained on two different image representations of DNA sequences, Chaos Game Representation (CGR) and Reshape. Three phyla of archaea and randomly generated sequences were used. These were compared against simpler machine learning models trained on the 4-mer and 7-mer frequencies of the same sequences. It was found that the simpler models performed better than CNNs trained on either image representation, and that Reshape was the poorest representation. However, by shuffling sequences whilst preserving 4-mer count it was found that the Reshape model had learnt 4-mers as an important feature. It was also found that the Reshape model was able to perform equally well without depending on the use of 4-mers, indicating that certain training regimes may uncover novel features. The errors of these models were also random or in weak disagreement, suggesting ensemble methods would be viable and help to identify problematic sequences

An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph

Author: Acosta Nidia Obscura
Cairo Massimo
Medvedev Paul
Rizzi Romeo
Tomescu Alexandru I.
Publication venue
Publication date: 01/01/2019
Field of study

In this article, we consider the following problem. Given a directed graph G, output all walks of G that are sub-walks of all closed edge-covering walks of G. This problem was first considered by Tomescu and Medvedev (RECOMB 2016), who characterized these walks through the notion of omnitig. Omnitigs were shown to be relevant for the genome assembly problem from bioinformatics, where a genome sequence must be assembled from a set of reads from a sequencing experiment. Tomescu and Medvedev (RECOMB 2016) also proposed an algorithm for listing all maximal omnitigs, by launching an exhaustive visit from every edge. In this article, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm arid show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev (RECOMB 2016).Peer reviewe

Helsingin yliopiston digitaalinen arkisto