11 research outputs found

    Image Representations of DNA allow Classification by Convolutional Neural Networks

    Get PDF
    In metagenomic analyses the rapid and accurate identification of DNA sequences is important. This is confounded by the existence of novel species not contained in databases. There exist many methods to identify sequences, but with the increasing amounts of sequencing data from high-throughput technologies, the use of new deep learning methods are made more viable. In an attempt to address this it was decided to use Convolutional Neural Networks (CNNs) to classify DNA sequences of archaea, which are important in anaerobic digestion. CNNs were trained on two different image representations of DNA sequences, Chaos Game Representation (CGR) and Reshape. Three phyla of archaea and randomly generated sequences were used. These were compared against simpler machine learning models trained on the 4-mer and 7-mer frequencies of the same sequences. It was found that the simpler models performed better than CNNs trained on either image representation, and that Reshape was the poorest representation. However, by shuffling sequences whilst preserving 4-mer count it was found that the Reshape model had learnt 4-mers as an important feature. It was also found that the Reshape model was able to perform equally well without depending on the use of 4-mers, indicating that certain training regimes may uncover novel features. The errors of these models were also random or in weak disagreement, suggesting ensemble methods would be viable and help to identify problematic sequences

    An Optimal O(nm) Algorithm for Enumerating All Walks Common to All Closed Edge-covering Walks of a Graph

    Get PDF
    In this article, we consider the following problem. Given a directed graph G, output all walks of G that are sub-walks of all closed edge-covering walks of G. This problem was first considered by Tomescu and Medvedev (RECOMB 2016), who characterized these walks through the notion of omnitig. Omnitigs were shown to be relevant for the genome assembly problem from bioinformatics, where a genome sequence must be assembled from a set of reads from a sequencing experiment. Tomescu and Medvedev (RECOMB 2016) also proposed an algorithm for listing all maximal omnitigs, by launching an exhaustive visit from every edge. In this article, we prove new insights about the structure of omnitigs and solve several open questions about them. We combine these to achieve an O(nm)-time algorithm for outputting all the maximal omnitigs of a graph (with n nodes and m edges). This is also optimal, as we show families of graphs whose total omnitig length is Omega(nm). We implement this algorithm arid show that it is 9-12 times faster in practice than the one of Tomescu and Medvedev (RECOMB 2016).Peer reviewe
    corecore