Search CORE

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

A Database and Evaluation for Classification of RNA Molecules Using Graph Methods

Author: A Rybarczyk
B Shabash
G Chojnowski
J Huang
L Chen
M Antczak
N Shervashidze
P Klosterman
RC Wilson
RPW Duin
SB Needleman
SS Ray
Z Miao
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 13/06/2019
Field of study

In this paper, we introduce a new graph dataset based on the representation of RNA. The RNA dataset includes 3178 RNA chains which are labelled in 8 classes according to their reported biological functions. The goal of this database is to provide a platform for investigating the classication of RNA using graph-based methods. The molecules are represented by graphs representing the sequence and base-pairs of the RNA, with a number of labelling schemes using base labels and local shape. We report the results of a number of state-of-the-art graph based methods on this dataset as a baseline comparison and investigate how these methods can be used to categorise RNA molecules on their type and functions. The methods applied are Weisfeiler Lehman and optimal assignment kernels, shortest paths kernel and the all paths and cycle methods. We also compare to the standard Needleman-Wunsch algorithm used in bioinformatics for DNA and RNA comparison, and demonstrate the superiority of graph kernels even on a string representation. The highest classication rate is obtained by the WL-OA algorithm using base labels and base-pair connections

White Rose Research Online

How and why DNA barcodes underestimate the diversity of microbial eukaryotes

Author: Adam Eyre-Walker
AR Boyko
AZ Worden
AZ Worden
B Charlesworth
B Palenik
DT Jones
F Not
G Piganeau
Gwenael Piganeau
Hervé Moreau
J Coyne
J Crow
JJ Welch
K Romari
M Viprey
ML Cuvelier
Nigel Grimsley
P Flicek
P Lopez-Garcia
PD Keightley
Purification Lopez-Garcia
S Gourbiere
S Jancek
S Proost
SB Needleman
SJ Williamson
SL Baldauf
SY Moon-van der Staay
Z Yang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/02/2011
Field of study

Background: Because many picoplanktonic eukaryotic species cannot currently be maintained in culture, direct sequencing of PCR-amplified 18S ribosomal gene DNA fragments from filtered sea-water has been successfully used to investigate the astounding diversity of these organisms. The recognition of many novel planktonic organisms is thus based solely on their 18S rDNA sequence. However, a species delimited by its 18S rDNA sequence might contain many cryptic species, which are highly differentiated in their protein coding sequences. Principal Findings: Here, we investigate the issue of species identification from one gene to the whole genome sequence. Using 52 whole genome DNA sequences, we estimated the global genetic divergence in protein coding genes between organisms from different lineages and compared this to their ribosomal gene sequence divergences. We show that this relationship between proteome divergence and 18S divergence is lineage dependant. Unicellular lineages have especially low 18S divergences relative to their protein sequence divergences, suggesting that 18S ribosomal genes are too conservative to assess planktonic eukaryotic diversity. We provide an explanation for this lineage dependency, which suggests that most species with large effective population sizes will show far less divergence in 18S than protein coding sequences. Conclusions: There is therefore a trade-off between using genes that are easy to amplify in all species, but which by their nature are highly conserved and underestimate the true number of species, and using genes that give a better description of the number of species, but which are more difficult to amplify. We have shown that this trade-off differs between unicellular and multicellular organisms as a likely consequence of differences in effective population sizes. We anticipate that biodiversity of microbial eukaryotic species is underestimated and that numerous ''cryptic species'' will become discernable with the future acquisition of genomic and metagenomic sequences

Public Library of Science (PLOS)

Sussex Research Online

Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets

Author: Adam Hughes
Geoffrey Fox
J Ekanayake
J Ekanayake
Judy Qiu
Mina Rho
Qunfeng Dong
S Bae
Saliya Ekanayake
SB Needleman
Seung-Hee Bae
X Qiu
Y Sun
Y Ye
Yang Ruan
Publication venue: BioMed Central
Publication date: 01/03/2012
Field of study

Abstract Background Modern pyrosequencing techniques make it possible to study complex bacterial populations, such as <it>16S rRNA</it>, directly from environmental or clinical samples without the need for laboratory purification. Alignment of sequences across the resultant large data sets (100,000+ sequences) is of particular interest for the purpose of identifying potential gene clusters and families, but such analysis represents a daunting computational task. The aim of this work is the development of an efficient pipeline for the clustering of large sequence read sets. Methods Pairwise alignment techniques are used here to calculate genetic distances between sequence pairs. These methods are pleasingly parallel and have been shown to more accurately reflect accurate genetic distances in highly variable regions of <it>rRNA </it>genes than do traditional multiple sequence alignment (MSA) approaches. By utilizing Needleman-Wunsch (NW) pairwise alignment in conjunction with novel implementations of interpolative multidimensional scaling (MDS), we have developed an effective method for visualizing massive biosequence data sets and quickly identifying potential gene clusters. Results This study demonstrates the use of interpolative MDS to obtain clustering results that are qualitatively similar to those obtained through full MDS, but with substantial cost savings. In particular, the wall clock time required to cluster a set of 100,000 sequences has been reduced from seven hours to less than one hour through the use of interpolative MDS. Conclusions Although work remains to be done in selecting the optimal training set size for interpolative MDS, substantial computational cost savings will allow us to cluster much larger sequence sets in the future.</p

UNT Digital Library

CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

Author: A Bairoch
Giorgio Valle
KM Chao
M Farrar
M Gribskov
O Gotoh
S Henikoff
SB Needleman
SF Altschul
Svetlin A Manavski
T Rognes
TF Smith
W Liu
W Pearson
W Pearson
WR Pearson
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. Results In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware. Conclusions The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches

Archivio istituzionale della ricerca - Università di Padova

An optimized TOPS+ comparison method for enhanced TOPS models

Author: A Brazma
A Harrison
A Harrison
CA Orengo
CA Orengo
CA Orengo
CJ van Rijsbergen
D Gilbert
D Gilbert
D Westhead
David Gilbert
G Valiente
Gabriel Valiente
GJ Barton
GM Torrance
HM Berman
HM Grindley
I Koch
I Michalopoulos
IN Shindyalov
J Handl
J Viksna
K Mizuguchi
L Holm
LP Chew
M Veeramalai
M Veeramalai
M Veeramalai
Mallika Veeramalai
N Krasnogor
RB Russell
S Goldsmith-Fischman
SB Needleman
SS Krishna
T Madej
T Madej
TF Smith
VI Levenshtein
WR Taylor
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

This article has been made available through the Brunel Open Access Publishing Fund.Background Although methods based on highly abstract descriptions of protein structures, such as VAST and TOPS, can perform very fast protein structure comparison, the results can lack a high degree of biological significance. Previously we have discussed the basic mechanisms of our novel method for structure comparison based on our TOPS+ model (Topological descriptions of Protein Structures Enhanced with Ligand Information). In this paper we show how these results can be significantly improved using parameter optimization, and we call the resulting optimised TOPS+ method as advanced TOPS+ comparison method i.e. advTOPS+. Results We have developed a TOPS+ string model as an improvement to the TOPS [1-3] graph model by considering loops as secondary structure elements (SSEs) in addition to helices and strands, representing ligands as first class objects, and describing interactions between SSEs, and SSEs and ligands, by incoming and outgoing arcs, annotating SSEs with the interaction direction and type. Benchmarking results of an all-against-all pairwise comparison using a large dataset of 2,620 non-redundant structures from the PDB40 dataset [4] demonstrate the biological significance, in terms of SCOP classification at the superfamily level, of our TOPS+ comparison method. Conclusions Our advanced TOPS+ comparison shows better performance on the PDB40 dataset [4] compared to our basic TOPS+ method, giving 90 percent accuracy for SCOP alpha+beta; a 6 percent increase in accuracy compared to the TOPS and basic TOPS+ methods. It also outperforms the TOPS, basic TOPS+ and SSAP comparison methods on the Chew-Kedem dataset [5], achieving 98 percent accuracy. Software Availability: The TOPS+ comparison server is available at http://balabio.dcs.gla.ac.uk/mallika/WebTOPS/.This article is available through the Brunel Open Access Publishing Fun

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Brunel University Research Archive

FAAST: Flow-space Assisted Alignment Search Tool

Author: Bengt Persson
Björn Andersson
DJ Lipman
Fredrik Lysholm
J Jerlström-Hultqvist
M Droege
M Margulies
MO Dayhoff
O Gotoh
R Kofler
S Balzer
SB Needleman
SF Altschul
SF Altschul
TF Smith
V Vacic
WR Pearson
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background High throughput pyrosequencing (454 sequencing) is the major sequencing platform for producing long read high throughput data. While most other sequencing techniques produce reading errors mainly comparable with substitutions, pyrosequencing produce errors mainly comparable with gaps. These errors are less efficiently detected by most conventional alignment programs and may produce inaccurate alignments. Results We suggest a novel algorithm for calculating the optimal local alignment which utilises flowpeak information in order to improve alignment accuracy. Flowpeak information can be retained from a 454 sequencing run through interpretation of the binary SFF-file format. This novel algorithm has been implemented in a program named FAAST (Flow-space Assisted Alignment Search Tool). Conclusions We present and discuss the results of simulations that show that FAAST, through the use of the novel algorithm, can gain several percentage points of accuracy compared to Smith-Waterman-Gotoh alignments, depending on the 454 data quality. Furthermore, through an efficient multi-thread aware implementation, FAAST is able to perform these high quality alignments at high speed. The tool is available at <url>http://www.ifm.liu.se/bioinfo/</url></p

Publikationer från Linköpings universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

FAAST: Flow-space Assisted Alignment Search Tool

Author: Fredrik Lysholm
Björn Andersson
Bengt Persson
M Margulies
M Droege
SB Needleman
TF Smith
O Gotoh
DJ Lipman
WR Pearson
SF Altschul
SF Altschul
MO Dayhoff
V Vacic
R Kofler
S Balzer
J Jerlström-Hultqvist
Z Ning
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Publikationer från Linköpings universitet

Aston Publications Explorer

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Real-time selective sequencing using nanopore technology

Author: AL Greninger
AR Quinlan
AR Quinlan
CLC Ip
H Skutkova
J Quick
J Quick
M Watson
Matthew Loose
Michael Stout
NJ Loman
NJ Loman
PM Ashton
S Goodwin
SB Needleman
Sunir Malla
TF Smith
VI Levenshtein
W Timp
Z Miodonska
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/02/2016
Field of study

The Oxford Nanopore Technologies MinION sequencer enables the selection of specific DNA molecules for sequencing by reversing the driving voltage across individual nanopores. To directly select molecules for sequencing, we used dynamic time warping to match reads to reference sequences. We demonstrate our open-source Read Until software in real-time selective sequencing of regions within small genomes, individual amplicon enrichment and normalization of an amplicon set

Nottingham ePrints

Nottingham eTheses

Repository@Nottingham

An evolutionary technique to approximate multiple optimal alignments

Author: A Adriansyah
B Dongen van
B Vázquez-Barreiros
D Reißner
D Ruppert
F Mannhardt
F Taymouri
F Taymouri
J Munoz-Gama
M Koorneef
M Leoni de
R Neapolitan
SB Needleman
SJJ Leemans
T Murata
WMP Aalst van der
WMP Aalst van der
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

The alignment of observed and modeled behavior is an essential aid for organizations, since it opens the door for root-cause analysis and enhancement of processes. The state-of-the-art technique for computing alignments has exponential time and space complexity, hindering its applicability for medium and large instances. Moreover, the fact that there may be multiple optimal alignments is perceived as a negative situation, while in reality it may provide a more comprehensive picture of the model’s explanation of observed behavior, from which other techniques may benefit. This paper presents a novel evolutionary technique for approximating multiple optimal alignments. Remarkably, the memory footprint of the proposed technique is bounded, representing an unprecedented guarantee with respect to the state-of-the-art methods for the same task. The technique is implemented into a tool, and experiments on several benchmarks are provided.Peer ReviewedPostprint (author's final draft