Search CORE

2,484 research outputs found

Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification

Author: Câmara Gabriel B. M.
De Melo Barbosa Raquel
Publication venue: 'MDPI AG'
Publication date: 31/07/2022
Field of study

COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 +/- 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES)High-Performance Computing Center at UFRN(NPAD/UFRN

Repositorio Institucional Universidad de Granada

Improved K-mer Based Prediction of Protein-Protein Interactions With Chaos Game Representation, Deep Learning and Reduced Representation Bias

Author: MacLean Dan
Veevers Ruth
Publication venue
Publication date: 23/10/2023
Field of study

Protein-protein interactions drive many biological processes, including the detection of phytopathogens by plants' R-Proteins and cell surface receptors. Many machine learning studies have attempted to predict protein-protein interactions but performance is highly dependent on training data; models have been shown to accurately predict interactions when the proteins involved are included in the training data, but achieve consistently poorer results when applied to previously unseen proteins. In addition, models that are trained using proteins that take part in multiple interactions can suffer from representation bias, where predictions are driven not by learned biological features but by learning of the structure of the interaction dataset. We present a method for extracting unique pairs from an interaction dataset, generating non-redundant paired data for unbiased machine learning. After applying the method to datasets containing _Arabidopsis thaliana_ and pathogen effector interations, we developed a convolutional neural network model capable of learning and predicting interactions from Chaos Game Representations of proteins' coding genes

arXiv.org e-Print Archive

Interpretable detection of novel human viruses from genome sequencing data

Author: Bartoszewicz Jakub M.
Renard Bernhard Y.
Seidel Anja
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2021
Field of study

Viruses evolve extremely quickly, so reliable meth- ods for viral host prediction are necessary to safe- guard biosecurity and biosafety alike. Novel human- infecting viruses are difficult to detect with stan- dard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next- generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology- based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host pre- diction task. We propose a new approach for con- volutional filter visualization to disentangle the in- formation content of each nucleotide from its contri- bution to the final classification decision. Nucleotide- resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy- to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.Peer Reviewe

Publikationsserver des Robert Koch-Instituts

DNA Sequence Classification: It’s Easier Than You Think: An open-source k-mer based machine learning tool for fast and accurate classification of a variety of genomic datasets

Author: Solis-Reyes Stephen
Publication venue: Scholarship@Western
Publication date: 09/10/2018
Field of study

Supervised classification of genomic sequences is a challenging, well-studied problem with a variety of important applications. We propose an open-source, supervised, alignment-free, highly general method for sequence classification that operates on k-mer proportions of DNA sequences. This method was implemented in a fully standalone general-purpose software package called Kameris, publicly available under a permissive open-source license. Compared to competing software, ours provides key advantages in terms of data security and privacy, transparency, and reproducibility. We perform a detailed study of its accuracy and performance on a wide variety of classification tasks, including virus subtyping, taxonomic classification, and human haplogroup assignment. We demonstrate the success of our method on whole mitochondrial, nuclear, plastid, plasmid, and viral genomes, as well as randomly sampled eukaryote genomes and transcriptomes. Further, we perform head-to-head evaluations on the tasks of HIV-1 virus subtyping and bacterial taxonomic classification with a number of competing state-of-the-art software solutions, and show that we match or exceed all other tested software in terms of accuracy and speed

Scholarship@Western

SARS-CoV-2 virus classification based on stacked sparse autoencoder

Author: Coutinho Maria G. F.
De Melo Barbosa Raquel
Publication venue: 'Elsevier BV'
Publication date: 09/12/2022
Field of study

Since December 2019, the world has been intensely affected by the COVID-19 pandemic, caused by the SARS-CoV-2. In the case of a novel virus identification, the early elucidation of taxonomic classification and origin of the virus genomic sequence is essential for strategic planning, containment, and treatments. Deep learning techniques have been successfully used in many viral classification problems associated with viral infection diagnosis, metagenomics, phylogenetics, and analysis. Considering that motivation, the authors proposed an efficient viral genome classifier for the SARS-CoV-2 using the deep neural network based on the stacked sparse autoencoder (SSAE). For the best performance of the model, we explored the utilization of image representations of the complete genome sequences as the SSAE input to provide a classification of the SARS-CoV-2. For that, a dataset based on k-mers image representation was applied. We performed four experiments to provide different levels of taxonomic classification of the SARS-CoV-2. The SSAE technique provided great performance results in all experiments, achieving classification accuracy between 92% and 100% for the validation set and between 98.9% and 100% when the SARS-CoV-2 samples were applied for the test set. In this work, samples of the SARS-CoV-2 were not used during the training process, only during subsequent tests, in which the model was able to infer the correct classification of the samples in the vast majority of cases. This indicates that our model can be adapted to classify other emerging viruses. Finally, the results indicated the applicability of this deep learning technique in genome classification problems.Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES) 00

Repositorio Institucional Universidad de Granada

Annotated Bibliography: Anticipation

Author: Nadin Mihai
Publication venue
Publication date: 01/01/2010
Field of study

PhilPapers

Beyond Accuracy: Measuring Representation Capacity of Embeddings to Preserve Structural and Contextual Information

Author: Ali Sarwan
Publication venue
Publication date: 20/09/2023
Field of study

Effective representation of data is crucial in various machine learning tasks, as it captures the underlying structure and context of the data. Embeddings have emerged as a powerful technique for data representation, but evaluating their quality and capacity to preserve structural and contextual information remains a challenge. In this paper, we address this need by proposing a method to measure the \textit{representation capacity} of embeddings. The motivation behind this work stems from the importance of understanding the strengths and limitations of embeddings, enabling researchers and practitioners to make informed decisions in selecting appropriate embedding models for their specific applications. By combining extrinsic evaluation methods, such as classification and clustering, with t-SNE-based neighborhood analysis, such as neighborhood agreement and trustworthiness, we provide a comprehensive assessment of the representation capacity. Additionally, the use of optimization techniques (bayesian optimization) for weight optimization (for classification, clustering, neighborhood agreement, and trustworthiness) ensures an objective and data-driven approach in selecting the optimal combination of metrics. The proposed method not only contributes to advancing the field of embedding evaluation but also empowers researchers and practitioners with a quantitative measure to assess the effectiveness of embeddings in capturing structural and contextual information. For the evaluation, we use

3

real-world biological sequence (proteins and nucleotide) datasets and performed representation capacity analysis of

4

embedding methods from the literature, namely Spike2Vec, Spaced

k

-mers, PWM2Vec, and AutoEncoder.Comment: Accepted at ISBRA 202

arXiv.org e-Print Archive

Image Representations of DNA allow Classification by Convolutional Neural Networks

Author: Hope Joshua
Publication venue
Publication date: 01/12/2020
Field of study

In metagenomic analyses the rapid and accurate identification of DNA sequences is important. This is confounded by the existence of novel species not contained in databases. There exist many methods to identify sequences, but with the increasing amounts of sequencing data from high-throughput technologies, the use of new deep learning methods are made more viable. In an attempt to address this it was decided to use Convolutional Neural Networks (CNNs) to classify DNA sequences of archaea, which are important in anaerobic digestion. CNNs were trained on two different image representations of DNA sequences, Chaos Game Representation (CGR) and Reshape. Three phyla of archaea and randomly generated sequences were used. These were compared against simpler machine learning models trained on the 4-mer and 7-mer frequencies of the same sequences. It was found that the simpler models performed better than CNNs trained on either image representation, and that Reshape was the poorest representation. However, by shuffling sequences whilst preserving 4-mer count it was found that the Reshape model had learnt 4-mers as an important feature. It was also found that the Reshape model was able to perform equally well without depending on the use of 4-mers, indicating that certain training regimes may uncover novel features. The errors of these models were also random or in weak disagreement, suggesting ensemble methods would be viable and help to identify problematic sequences

White Rose E-theses Online

DeepCOVID-19: A model for identification of COVID-19 virus sequences with genomic signal processing and deep learning

Author: Abayomi Abdultaofeek
Abolarinwa A.
Adegoke Anthony A.
Adetiba E.
Adetiba Joy N.
Ajayi Oluwaseun T.
Badejo J. A.
Taiwo Tunmike B.
Publication venue: Cogent Engineering
Publication date: 01/01/2022
Field of study

The spread of Coronavirus Disease-2019 worldwide necessitates the development of accurate identification methods and the determination of genetic relatedness. The result of genomic methods involving nucleotide alignment informed the considerations of several alignment-free techniques for virus detection. This paper presents a genomic sequence identification model, developed based on Genomic Signal Processing (GSP), deep learning, and genomic datasets of Coronavirus 2 (SARS-CoV-2), Severe Acute Respiratory Syndrome CoV (SARS-CoV), and Middle East Respiratory Syndrome CoV (MERS-CoV). Our results showed that the Z-Curve images for the three viral strains depicted high visual similarities in texture and color, thus making it difficult to differentiate the strains by visual inspection. However, the homogeneity distance showed that SARS-CoV-2 is closer to SAR-CoV than MERS-CoV. Following a validation accuracy of 98.33%, it became clear that Z-Curve images for MERS-CoV, SARS-CoV and SARS-CoV-2 have distinct features after transformation by the Convolutional Neural Network (CNN) classifier. The divergence in texture and color reflects genetic variation among the strains, which is too insignificant for differentiation via visual inspection. Our results showed that higher layers of CNN amplify aspects of input images that are critical for discrimination, thereby confirming the importance of deep learning and GSP in accurate viral detectio

Covenant University Repository

Entropy in Image Analysis II

Author
Publication venue: 'MDPI AG'
Publication date: 01/05/2021
Field of study

Image analysis is a fundamental task for any application where extracting information from images is required. The analysis requires highly sophisticated numerical and analytical methods, particularly for those applications in medicine, security, and other fields where the results of the processing consist of data of vital importance. This fact is evident from all the articles composing the Special Issue "Entropy in Image Analysis II", in which the authors used widely tested methods to verify their results. In the process of reading the present volume, the reader will appreciate the richness of their methods and applications, in particular for medical imaging and image security, and a remarkable cross-fertilization among the proposed research areas

Directory of Open Access Books (DOAB)