Search CORE

59 research outputs found

Fast algorithms for computing sequence distances by exhaustive substring composition

Author: A Apostolico
A Kolmogorov
A Lempel
Alberto Apostolico
B Blaidsell
B Hao
H Otu
I Ulitsky
J Na
J Qi
JV Helden
L Brillouin
LL Gatlin
M Höhl
M Li
Olgert Denas
P Ferragina
R Edgar
R von Mises
S Vinga
TJ Wu
TM Cover
Publication venue: BioMed Central
Publication date: 01/10/2008
Field of study

The increasing throughput of sequencing raises growing needs for methods of sequence analysis and comparison on a genomic scale, notably, in connection with phylogenetic tree reconstruction. Such needs are hardly fulfilled by the more traditional measures of sequence similarity and distance, like string edit and gene rearrangement, due to a mixture of epistemological and computational problems. Alternative measures, based on the subword composition of sequences, have emerged in recent years and proved to be both fast and effective in a variety of tested cases. The common denominator of such measures is an underlying information theoretic notion of relative compressibility. Their viability depends critically on computational cost. The present paper describes as a paradigm the extension and efficient implementation of one of the methods in this class. The method is based on the comparison of the frequencies of all subwords in the two input sequences, where frequencies are suitably adjusted to take into account the statistical background

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

Author: Altschul
Altschul
Arun S. Konagurthu
Arunachalam
Bandyopadhyay
Bansal
Calabrese
Dehal
Dice
Edgar
Edgar
Flicek
Fukuhara
Geoffrey I. Webb
Gordân
Haas
Hachiya
James C. Whisstock
Jiangning Song
Jun
Khalid Mahmood
Koohy
Koonin
Kriventseva
Kuhn
Kärkkäinen
Li
Mahmood
Needleman
Papadimitriou
Pearson
Pruess
Remm
Sakarya
Sankoff
Santini
Sjolander
Smith
Smith
Sonnhammer
Sorensen
Swidan
Vandepoele
Vinga
Vingron
Widmann
Woolfe
Xu
Yu
Zhi
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2

Crossref

PubMed Central

Monash University Research Portal

University of Melbourne Institutional Repository

Automated smoother for the numerical decoupling of dynamics models

Author: A Benveniste
A Ramos
AJ Bell
Ana Tereza R Vasconcelos
BW Silverman
Carlos CH Borges
D Erdogmus
D Erdogmus
Eberhard O Voit
EO Voit
EO Voit
EO Voit
EO Voit
ET Whittaker
Helena Santos
I Santamaria
IC Chou
J Principe
JM Santos
Jonas S Almeida
JS Almeida
K Hornik
L Sardo
MA Savageau
MA Savageau
Marco Vilela
PH Eilers
S Kikuchi
S Kullback
Susana Vinga
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Structure identification of dynamic models for complex biological systems is the cornerstone of their reverse engineering. Biochemical Systems Theory (BST) offers a particularly convenient solution because its parameters are kinetic-order coefficients which directly identify the topology of the underlying network of processes. We have previously proposed a numerical decoupling procedure that allows the identification of multivariate dynamic models of complex biological processes. While described here within the context of BST, this procedure has a general applicability to signal extraction. Our original implementation relied on artificial neural networks (ANN), which caused slight, undesirable bias during the smoothing of the time courses. As an alternative, we propose here an adaptation of the Whittaker's smoother and demonstrate its role within a robust, fully automated structure identification procedure. Results In this report we propose a robust, fully automated solution for signal extraction from time series, which is the prerequisite for the efficient reverse engineering of biological systems models. The Whittaker's smoother is reformulated within the context of information theory and extended by the development of adaptive signal segmentation to account for heterogeneous noise structures. The resulting procedure can be used on arbitrary time series with a nonstationary noise process; it is illustrated here with metabolic profiles obtained from <it>in-vivo </it>NMR experiments. The smoothed solution that is free of parametric bias permits differentiation, which is crucial for the numerical decoupling of systems of differential equations. Conclusion The method is applicable in signal extraction from time series with nonstationary noise structure and can be applied in the numerical decoupling of system of differential equations into algebraic equations, and thus constitutes a rather general tool for the reverse engineering of mechanistic model descriptions from multivariate experimental time series.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Automatic structure classification of small proteins using random forest

Author: A Andreeva
A Andreeva
AG Murzin
AV Levitin
C Hadley
CHQ Ding
E Ie
G Zhanga
H Shen
HM Berman
I Chung
I Melvin
IH Witten
J Cheng
J Wu
JE Gewehr
JF Gibrat
Jonathan D Hirst
JR Quinlan
K Chen
KC Chou
L Breiman
L Holm
L Kurgan
M Gerstein
MB Swindells
MTA Shamim
O Çamoğlu
P Baldi
P Han
P Jain
P Klein
Pooja Jain
S Kim
S Mile
S Vinga
SE Brenner
SE Hamby
SF Altschul
SP Kanaan
SS Krishna
U Hobohm
V Sam
W Kabsch
X Chen
X Chen
XM Zhao
Y Cai
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. Results Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP <it>Class, Fold, Super-family </it>or <it>Family </it>levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. Conclusions The utility of random forest in classifying domains from the place-holder classes of SCOP to the true <it>Class, Fold, Super-family </it>or <it>Family </it>levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Organizational Heterogeneity of Vertebrate Genomes

Author: A Nekrutenko
A Paz
A Porceddu
A Woolfe
Abraham Korol
AE Vinogradov
AE Vinogradov
AV Smith
B Deng
B-Y Liao
BS Weir
C Chapus
C Dufraigne
C McLean
C Melodelima
C Nusbaum
C Schmegner
C Schmegner
CM Malcom
CP Ponting
D Sellis
E Bingham
E Buschiazzo
E Lieberman-Aiden
EN Trifonov
ET Dermitzakis
ET Dermitzakis
F Larsen
G Bejerano
G Bernardi
G Bernardi
G Rosen
GE Sims
GL Rosen
H Caron
H Wu
HeldenJ van
I Dunham
J Grimwood
J Healy
J Jurka
JR Chubb
K Jabbari
K Sivaraman
K Yamada
KJ Meaburn
L Chen
L Duret
L Eory
L Mariño-Ramírez
LW Hillier
M Costantini
M Costantini
M Costantini
M Costantini
M Costantini
M Csurös
M Gardiner-Garden
M Hattori
M Höhl
M Sémon
M Touchon
MC Zody
MI Jensen-Seaman
MJ Lercher
P Carpena
R Nussinov
R Versteeg
RK Azad
S De
S Karlin
S Karlin
S Karlin
S Katzman
S Katzman
S Pietrokovski
S Vinga
SB Hedges
SJ Bell
SJ Bell
Svetlana Frenkel
T Abe
T Cremer
T Ryba
V Kirzhner
V Kirzhner
V Kirzhner
V Kirzhner
Valery Kirzhner
Vincent Laudet
W Li
W Li
W Li
WJ Kent
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as “texts” using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter - GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

BacHBerry: BACterial Hosts for production of Bioactive phenolics from bERRY fruits

Author: A Chaovanalikit
A Fernandes
A Frelet-Barrand
A Hartmann
A Hartmann
A Hartmann
A Julien-Laferrière
A Koesnandar
A Kårlund
A Ruiz
A. Filipa Almeida
AA Song
Adelaide Braga
Alberto Marchetti-Spaccamela
Alexandre Foito
Alexey Dudnik
Alice Julien-Laferriere
Ana Rita Silva
Ana Rute Neves
Ana Solopova
Ana Vila-Santa
András Hartmann
André Veríssimo
AP Burgard
AP Oliveira
AR Neves
AR Zomorrodi
Armando Fernandes
Arnaud Mary
Artem Sorokin
B Bergdahl
B Catalgol
B Miladinović
Barbara Avila
BE Logan
Björn Hamberger
BL Halvorsen
BVV Ratnam
C Fredes
C Vance
C Yu
Camillo Meinhart
Carolina Jardim
Caroline Rousseau
Cathie Martin
CB Jendresen
Celine Chanforan
Chengyong Feng
CK Blomstedt
Claudia Nunes dos Santos
CNS Santos
CW Shults
D Ghosh
D Machado
D Segrè
D Vazquez-Albacete
D-K Ro
Dario Breitel
David Méndez Sevillano
Delphine Parrot
Derek Stewart
DF Tardiff
Diane Barbay
DM Linares
DS Pontes
E Carvalho
E Goldberg
E Morgera
E Simeonidis
ERS Kunji
F Rusnak
Finn T. Okkels
FJ Dumont
G Garcia
G McDougall
GA Manganaris
GF Pauli
GG Moulton
GM Cragg
Gonçalo Garcia
GS Waldo
H Dvora
H Li
H Zhang
H-W Heldt
Harald Heider
HJ Kim
I Aranaz
I Figueira
I Hernández
Inês Costa
Isabel Rocha
J Becker
J Becker
J Beekwilder
J Caldas
J Luo
J Marhuenda
J Marienhagen
J Milivojević
J Vera
J Wang
J Zheng
JA Jones
JA Kritzer
JA Stavang
Jan Marienhagen
Jean-Etienne Bassard
JK Hellström
JM Landete
JO Krömer
Joana Godinho-Pereira
Joana Oliveira
Jochen Forster
JR Lenihan
K Bodvard
K Duarte
K Goszcz
K Herrmann
K Patil
K Prince
KM Kasiotis
KR Määttä
KR Määttä
KR Watts
L Camont
L Miller-Fleming
L Pei
Laurent Bulteau
Leen Stougie
Lei Pei
Liangsheng Wang
Lijin Wang
LJ Su
Louise Shepherd
M Cox
M Funakoshi-Tago
M Hujanen
M Josuttis
M Kortmann
M Kula
M Pátek
M Vagiri
M-G Pan
Mahdi Doostmohammadi
Marcel Ottens
Marcelo Henriques da Silva
Marie-France Sagot
Markus Schmidt
Martin Trick
MB Pedersen
MG Weller
Michael Bott
Michael Naesby
Michael Vogt
MJ MacDonald
ML Falcone Ferreyra
MM Giusti
Morten H. H. Nørholm
Mounir Benkoulouche
MU Rani
N Bhan
N Jain
N Kallscheuer
N Kallscheuer
N Kallscheuer
N Kallscheuer
N Palmieri
N Tepper
Nicola Love
Nicolai Kallscheuer
NJ Stanford
Nuno Faria
O Choi
O Paredes-López
Olga Tikhonova
Olivier Simon
OP Kuipers
OP Kuipers
Oscar P. Kuipers
P Bharadwaj
P Borrill
P Duwat
P Gaspar
P Jeandet
P Maher
P Maher
P Xu
P Xu
Patricia Ferreira
Paula Gaspar
Philippe Vain
Pilar Bañados
PV Milreu
PV Summeren-Wesenhagen van
R Andrade
R Dobson
R Menezes
R Stracke
R Törrönen
R Zadernowski
RA Moyer
Rafael S. Costa
Regina Menezes
Rex Brennan
Ricardo Andrade
Rita Rosado-Ramos
RJ O’Brien
RM Zelle
Roberto Ferro
RP Pandey
RS Costa
RS Costa
S Bak
S Ju
S Krobitsch
S Maeda
S Quideau
S Renaud
Sabine Freitag
Sandra Youssef
SE Rasmussen
SG Stahlhut
SH Häkkinen
Shang Su
Shanshan Li
SK Jash
SP Mazur
Steen Gustav Stahlhut
Susana Vinga
T Jojima
T Tohge
T Vogt
TAK Prescott
Tatiana Shelenga
TF Outeiro
TJ Kwiatkowski
U Hartmann
Vera Thole
Vincent Mazurek
W Cao
W Weckwerth
Wei Wei
Wolfgang Kerbe
X Chen
X Li
X Shen
X Yang
X-H Shen
Y Shinfuku
Y Suh
Y Wang
Y Wang
Y-B Liu
Z Tayarani-Najaran
Z Wen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

BACterial Hosts for production of Bioactive phenolics from bERRY fruits (BacHBerry) was a 3-year project funded by the Seventh Framework Programme (FP7) of the European Union that ran between November 2013 and October 2016. The overall aim of the project was to establish a sustainable and economically-feasible strategy for the production of novel high-value phenolic compounds isolated from berry fruits using bacterial platforms. The project aimed at covering all stages of the discovery and pre-commercialization process, including berry collection, screening and characterization of their bioactive components, identification and functional characterization of the corresponding biosynthetic pathways, and construction of Gram-positive bacterial cell factories producing phenolic compounds. Further activities included optimization of polyphenol extraction methods from bacterial cultures, scale-up of production by fermentation up to pilot scale, as well as societal and economic analyses of the processes. This review article summarizes some of the key findings obtained throughout the duration of the project

Universidade do Minho: RepositoriUM

VU Research Portal

University of Strathclyde Institutional Repository

University of Groningen

HAL Descartes

Edinburgh Research Explorer

Juelich Shared Electronic Resources

of Botany,Chinese Academy Of Sciences

Online Research Database In Technology

Hal-Diderot

Crossref

Heriot Watt Pure