15,319 research outputs found
A Neural Network Classifier for the COI Barcode Gene
Mitochondrial Cytochrome C Oxidase subunit I (CO I – to be read as “see – oh one”) is a 658 base pair region in the gene encoding that is proposed as standard barcode for animals. Meaning, the CO I is a special region found in animal DNA that is studied to identify the species of the animal. Currently, there is an implementation of an algorithm called ARBitrator which identifies and extracts these CO I sequences from enormous genes database called GenBank. The ARBitrator is good at extracting the CO I sequences that have better specificity and accuracy as compared to other existing algorithms for CO I sequence identification[1][2]. Now, this project aims at training a neural network to learn the features of the CO I sequences extracted by ARBitrator, so that this neural network can be used in future to further recognize CO I sequences. Effectively, we are aiming to successfully design, train, and use a deep learning neural network to learn to recognize CO I sequences in a supervised way. This is the first time that a neural network is explored and used for this purpose
Pairwise alignment incorporating dipeptide covariation
Motivation: Standard algorithms for pairwise protein sequence alignment make
the simplifying assumption that amino acid substitutions at neighboring sites
are uncorrelated. This assumption allows implementation of fast algorithms for
pairwise sequence alignment, but it ignores information that could conceivably
increase the power of remote homolog detection. We examine the validity of this
assumption by constructing extended substitution matrixes that encapsulate the
observed correlations between neighboring sites, by developing an efficient and
rigorous algorithm for pairwise protein sequence alignment that incorporates
these local substitution correlations, and by assessing the ability of this
algorithm to detect remote homologies. Results: Our analysis indicates that
local correlations between substitutions are not strong on the average.
Furthermore, incorporating local substitution correlations into pairwise
alignment did not lead to a statistically significant improvement in remote
homology detection. Therefore, the standard assumption that individual residues
within protein sequences evolve independently of neighboring positions appears
to be an efficient and appropriate approximation
Recommended from our members
Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.
High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples
Interpretable detection of novel human viruses from genome sequencing data
Viruses evolve extremely quickly, so reliable meth-
ods for viral host prediction are necessary to safe-
guard biosecurity and biosafety alike. Novel human-
infecting viruses are difficult to detect with stan-
dard bioinformatics workflows. Here, we predict
whether a virus can infect humans directly from next-
generation sequencing reads. We show that deep
neural architectures significantly outperform both
shallow machine learning and standard, homology-
based algorithms, cutting the error rates in half and
generalizing to taxonomic units distant from those
presented during training. Further, we develop a
suite of interpretability tools and show that it can
be applied also to other models beyond the host pre-
diction task. We propose a new approach for con-
volutional filter visualization to disentangle the in-
formation content of each nucleotide from its contri-
bution to the final classification decision. Nucleotide-
resolution maps of the learned associations between
pathogen genomes and the infectious phenotype can
be used to detect regions of interest in novel agents,
for example, the SARS-CoV-2 coronavirus, unknown
before it caused a COVID-19 pandemic in 2020. All
methods presented here are implemented as easy-
to-install packages not only enabling analysis of NGS
datasets without requiring any deep learning skills,
but also allowing advanced users to easily train and
explain new models for genomics.Peer Reviewe
Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error
Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys
- …