Search CORE

15,319 research outputs found

A Neural Network Classifier for the COI Barcode Gene

Author: Marathe Saurabh
Publication venue: SJSU ScholarWorks
Publication date: 01/04/2018
Field of study

Mitochondrial Cytochrome C Oxidase subunit I (CO I – to be read as “see – oh one”) is a 658 base pair region in the gene encoding that is proposed as standard barcode for animals. Meaning, the CO I is a special region found in animal DNA that is studied to identify the species of the animal. Currently, there is an implementation of an algorithm called ARBitrator which identifies and extracts these CO I sequences from enormous genes database called GenBank. The ARBitrator is good at extracting the CO I sequences that have better specificity and accuracy as compared to other existing algorithms for CO I sequence identification[1][2]. Now, this project aims at training a neural network to learn the features of the CO I sequences extracted by ARBitrator, so that this neural network can be used in future to further recognize CO I sequences. Effectively, we are aiming to successfully design, train, and use a deep learning neural network to learn to recognize CO I sequences in a supervised way. This is the first time that a neural network is explored and used for this purpose

SJSU ScholarWorks

Pairwise alignment incorporating dipeptide covariation

Author: Altschul
Altschul
Altschul
Altschul
Bailey
Bishop
Brenner
Cline
Crooks
DOOLITTLE
Frith
Fukami-Kobayashi
G. E. Crooks
Goldman
Gonnet
Henikoff
Henikoff
Jung
Karplus
Lin
Muller
Murzin
Park
Pearson
R. E. Green
RODIONOV
S. E. Brenner
Sander
Smith
Thorne
Thorne
Thorne
Topham
Weiss
Zachariah
Publication venue: 'Oxford University Press (OUP)'
Publication date: 28/07/2005
Field of study

Motivation: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrixes that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations, and by assessing the ability of this algorithm to detect remote homologies. Results: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation

arXiv.org e-Print Archive

Crossref

Proceedings of the 1st Computer Science Student Workshop: Koc University Istinye Campus, Istanbul, Turkey, February 21, 2010

Author
Publication venue: Sabancı University
Publication date: 01/01/2010
Field of study

Sabanci University Research Database

Recommended from our members

Improving nanopore read accuracy with the R2C2 method enables the sequencing of highly multiplexed full-length single-cell cDNA.

Author: Byrne Ashley
Cole Charles
Green Richard E
Palmer Theron
Schmitz Robert J
Volden Roger
Vollmers Christopher
Publication venue: eScholarship, University of California
Publication date: 01/09/2018
Field of study

High-throughput short-read sequencing has revolutionized how transcriptomes are quantified and annotated. However, while Illumina short-read sequencers can be used to analyze entire transcriptomes down to the level of individual splicing events with great accuracy, they fall short of analyzing how these individual events are combined into complete RNA transcript isoforms. Because of this shortfall, long-distance information is required to complement short-read sequencing to analyze transcriptomes on the level of full-length RNA transcript isoforms. While long-read sequencing technology can provide this long-distance information, there are issues with both Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) long-read sequencing technologies that prevent their widespread adoption. Briefly, PacBio sequencers produce low numbers of reads with high accuracy, while ONT sequencers produce higher numbers of reads with lower accuracy. Here, we introduce and validate a long-read ONT-based sequencing method. At the same cost, our Rolling Circle Amplification to Concatemeric Consensus (R2C2) method generates more accurate reads of full-length RNA transcript isoforms than any other available long-read sequencing method. These reads can then be used to generate isoform-level transcriptomes for both genome annotation and differential expression analysis in bulk or single-cell samples

eScholarship - University of California

Computational analysis of proteomes from parasitic nematodes

Author: Wasmuth James D.
Publication venue: The University of Edinburgh
Publication date: 01/01/2006
Field of study

Edinburgh Research Archive

Interpretable detection of novel human viruses from genome sequencing data

Author: Bartoszewicz Jakub M.
Renard Bernhard Y.
Seidel Anja
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2021
Field of study

Viruses evolve extremely quickly, so reliable meth- ods for viral host prediction are necessary to safe- guard biosecurity and biosafety alike. Novel human- infecting viruses are difficult to detect with stan- dard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next- generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology- based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host pre- diction task. We propose a new approach for con- volutional filter visualization to disentangle the in- formation content of each nucleotide from its contri- bution to the final classification decision. Nucleotide- resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy- to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.Peer Reviewe

Publikationsserver des Robert Koch-Instituts

Factors That Affect Large Subunit Ribosomal DNA Amplicon Sequencing Studies of Fungal Communities: Classification Method, Primer Choice, and Error

Author: A Holst-Jensen
A Rosling
AE Arnold
AS Amend
B Michot
B Michot
B Michot
C Lozupone
C Lozupone
C Quince
C Quince
C Stubben
CP Kurtzman
CW Schadt
D Begerow
D Hibbett
D van Tuinen
DH Huson
DH Huson
DL Hawksworth
DL Hawksworth
DL Lindner
DL Taylor
DM Simon
DP Faith
DP Faith
DS Hibbett
DS Hibbett
DS Hibbett
E Bellemain
E Lara
E Lara
E Pruesse
F Lutzoni
FA Matsen
G. Brian Golding
GL Rosen
GL Rosen
GM Veldman
H Stockinger
J Kuczynski
J Reeder
J Reeder
J-M Moncalvo
J-M Moncalvo
J-M Moncalvo
Jason E. Stajich
JE Stajich
JR Bray
JR Cole
JW Fell
JW Spatafora
K Abarenkov
K Munch
K Munch
K O'Donnell
K-L Liu
KA Seifert
L Tedersoo
LB Koski
LG Nagy
M Krüger
MD Jones
ME Smith
MN Schnare
N Hassouna
O Kårén
O Ovaskainen
P Meinicke
PM Brock
Q Wang
R Kjøller
R Vilgalys
RC Edgar
S John
SA Berger
SA Rehner
SC Goslee
SF Altschul
SG Acinas
SM Huse
SM Huse
T Nagahama
T Urich
TD Bruns
Teresita M. Porter
TJ White
TM Gihring
TM Porter
TM Porter
TM Porter
TY James
TZ DeSantis
V Kunin
W Ludwig
Z Liu
Publication venue: Public Library of Science
Publication date: 27/04/2012
Field of study

Nuclear large subunit ribosomal DNA is widely used in fungal phylogenetics and to an increasing extent also amplicon-based environmental sequencing. The relatively short reads produced by next-generation sequencing, however, makes primer choice and sequence error important variables for obtaining accurate taxonomic classifications. In this simulation study we tested the performance of three classification methods: 1) a similarity-based method (BLAST + Metagenomic Analyzer, MEGAN); 2) a composition-based method (Ribosomal Database Project naïve Bayesian classifier, NBC); and, 3) a phylogeny-based method (Statistical Assignment Package, SAP). We also tested the effects of sequence length, primer choice, and sequence error on classification accuracy and perceived community composition. Using a leave-one-out cross validation approach, results for classifications to the genus rank were as follows: BLAST + MEGAN had the lowest error rate and was particularly robust to sequence error; SAP accuracy was highest when long LSU query sequences were classified; and, NBC runs significantly faster than the other tested methods. All methods performed poorly with the shortest 50–100 bp sequences. Increasing simulated sequence error reduced classification accuracy. Community shifts were detected due to sequence error and primer selection even though there was no change in the underlying community composition. Short read datasets from individual primers, as well as pooled datasets, appear to only approximate the true community composition. We hope this work informs investigators of some of the factors that affect the quality and interpretation of their environmental gene surveys

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central