Search CORE

92 research outputs found

XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis

Author: Altmann Curtis R
Beckstette Michael
Brivanlou Ali H
Giegerich Robert
Sczyrba Alexander
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. DESCRIPTION: Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. CONCLUSION: The results of the analysis have been stored in a publicly available database XenDB . A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

Significant speedup of database searches with HMMs by search space reduction with PSSM family models

Author: Beckstette Michael
Giegerich Robert
Homann Robert
Kurtz Stefan
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive

CiteSeerX

PubMed Central

Publications at Bielefeld University

Structator: fast index-based search for RNA sequence-structure patterns

Author: Backofen Rolf
Beckstette Michael
Kurtz Stefan
Meyer Fernando
Will Sebastian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/12/2010
Field of study

Background The secondary structure of RNA molecules is intimately related to their function and often more conserved than the sequence. Hence, the important task of searching databases for RNAs requires to match sequence-structure patterns. Unfortunately, current tools for this task have, in the best case, a running time that is only linear in the size of sequence databases. Furthermore, established index data structures for fast sequence matching, like suffix trees or arrays, cannot benefit from the complementarity constraints introduced by the secondary structure of RNAs. Results We present a novel method and readily applicable software for time efficient matching of RNA sequence-structure patterns in sequence databases. Our approach is based on affix arrays, a recently introduced index data structure, preprocessed from the target database. Affix arrays support bidirectional pattern search, which is required for efficiently handling the structural constraints of the pattern. Structural patterns like stem-loops can be matched inside out, such that the loop region is matched first and then the pairing bases on the boundaries are matched consecutively. This allows to exploit base pairing information for search space reduction and leads to an expected running time that is sublinear in the size of the sequence database. The incorporation of a new chaining approach in the search of RNA sequence-structure patterns enables the description of molecules folding into complex secondary structures with multiple ordered patterns. The chaining approach removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our method runs up to two orders of magnitude faster than previous methods. Conclusions The presented method's sublinear expected running time makes it well suited for RNA sequence-structure pattern matching in large sequence databases. RNA molecules containing several stem-loop substructures can be described by multiple sequence-structure patterns and their matches are efficiently handled by a novel chaining method. Beyond our algorithmic contributions, we provide with Structator a complete and robust open-source software solution for index-based search of RNA sequence-structure patterns. The Structator software is available at http://www.zbh.uni-hamburg.de/Structator webcite.Deutsche Forschungsgemeinschaft (grant WI 3628/1-1

DSpace@MIT

Crossref

Springer - Publisher Connector

PubMed Central

Publications at Bielefeld University

Recommended from our members

Impact of process temperature and organic loading rate on cellulolytic/hydrolytic biofilm microbiomes during biomethanation of ryegrass silage revealed by genome-centered metagenomics and metatranscriptomics

Author: Beckstette Michael
Blom Jochen
Derenkó Jaqueline
Henke Christian
Jost Carsten
Klocke Michael
Maus Irena
Pühler Alfred
Rademacher Antje
Rumming Madis
Schlüter Andreas
Sczyrba Alexander
Stolze Yvonne
Wibberg Daniel
Willenbücher Katharina
Publication venue: London : BioMed Central
Publication date: 01/01/2020
Field of study

Background: Anaerobic digestion (AD) of protein-rich grass silage was performed in experimental two-stage two-phase biogas reactor systems at low vs. increased organic loading rates (OLRs) under mesophilic (37 °C) and thermophilic (55 °C) temperatures. To follow the adaptive response of the biomass-attached cellulolytic/hydrolytic biofilms at increasing ammonium/ammonia contents, genome-centered metagenomics and transcriptional profiling based on metagenome assembled genomes (MAGs) were conducted. Results: In total, 78 bacterial and archaeal MAGs representing the most abundant members of the communities, and featuring defined quality criteria were selected and characterized in detail. Determination of MAG abundances under the tested conditions by mapping of the obtained metagenome sequence reads to the MAGs revealed that MAG abundance profiles were mainly shaped by the temperature but also by the OLR. However, the OLR effect was more pronounced for the mesophilic systems as compared to the thermophilic ones. In contrast, metatranscriptome mapping to MAGs subsequently normalized to MAG abundances showed that under thermophilic conditions, MAGs respond to increased OLRs by shifting their transcriptional activities mainly without adjusting their proliferation rates. This is a clear difference compared to the behavior of the microbiome under mesophilic conditions. Here, the response to increased OLRs involved adjusting of proliferation rates and corresponding transcriptional activities. The analysis led to the identification of MAGs positively responding to increased OLRs. The most outstanding MAGs in this regard, obviously well adapted to higher OLRs and/or associated conditions, were assigned to the order Clostridiales (Acetivibrio sp.) for the mesophilic biofilm and the orders Bacteroidales (Prevotella sp. and an unknown species), Lachnospirales (Herbinix sp. and Kineothrix sp.) and Clostridiales (Clostridium sp.) for the thermophilic biofilm. Genome-based metabolic reconstruction and transcriptional profiling revealed that positively responding MAGs mainly are involved in hydrolysis of grass silage, acidogenesis and/or acetogenesis. Conclusions: An integrated-omics approach enabled the identification of new AD biofilm keystone species featuring outstanding performance under stress conditions such as increased OLRs. Genome-based knowledge on the metabolic potential and transcriptional activity of responsive microbiome members will contribute to the development of improved microbiological AD management strategies for biomethanation of renewable biomass. © 2020 The Author(s)

Repositorium für Naturwissenschaften und Technik

Fast index based algorithms and software for matching position specific scoring matrices

Author: A Kel
A Sandelin
B Dorohonceanu
D Weeks
G Castillo
H Gonnet
J Henikoff
J Henikoff
J Kärkkäinen
K Quandt
L Goldstein
LR Murphy
M Abouelhoda
M Beckstette
M Beckstette
M Gribskov
Michael Beckstette
N de Bruijn
N Hulo
P Embrechts
P Haverty
P Scordis
R Giegerich
R Staden
R Tatusov
Robert Giegerich
Robert Homann
S Kurtz
S Kurtz
S Rahmann
S Rajasekaran
Stefan Kurtz
T Kasai
T Li
T Wu
T Wu
TK Attwood
V Freschi
V Matys
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. RESULTS: We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. CONCLUSION: Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than | [Formula: see text] |(m )+ m - 1, where m is the length of the PSSM and [Formula: see text] a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

Lightweight comparison of RNAs based on exact sequence–structure matches

Author: Allali
Altschul
Backofen
Bafna
Bahr
Bauer
Blin
Cannone
Evans
Gardner
Griffiths-Jones
Havgaard
Hentze
Hofacker
Hofacker
Huttenhofer
Höchsmann
Jiang
Jiang
Lin
Martineau
Mathews
Mathews
Michael Beckstette
Otto
Rolf Backofen
Sankoff
Sebastian Will
Serganov
Steffen Heyne
Torarinsson
Will
Wilm
Wilting
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Motivation: Specific functions of ribonucleic acid (RNA) molecules are often associated with different motifs in the RNA structure. The key feature that forms such an RNA motif is the combination of sequence and structure properties. In this article, we introduce a new RNA sequence–structure comparison method which maintains exact matching substructures. Existing common substructures are treated as whole unit while variability is allowed between such structural motifs

CiteSeerX

Crossref

PubMed Central

Publications at Bielefeld University

Statistical significance of cis-regulatory modules

Author: A Kel
A Klingenhoff
A Sandelin
A Sosinsky
A Wagner
A Wagner
A Wagner
A Webber
AA Philippakis
Andrew D Smith
AP Lifanov
BP Berman
BP Berman
D GuhaThakurta
DS Johnson
Dustin E Schones
E Eskin
EM McCreight
F Tronche
G Hertz
GD Stormo
J van Helden
JM Claverie
JM Claverie
JS Liu
K Struhl
M Beckstette
M Beckstette
M Blanchette
M Gupta
MA Beer
MC Frith
MC Frith
Michael Q Zhang
N Munshi
N Nagarajan
N Rajewsky
O Johansson
P Leighton
Q Zhou
R Hoberman
R Hoberman
R Staden
RR Sokal
S Aerts
S Rahmann
S Sinha
TD Schneider
TL Bailey
TL Bailey
TL Baily
V Matys
W Kent
W Thompson
WB Alkema
WW Wasserman
YH Grad
Z Xuan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: It is becoming increasingly important for researchers to be able to scan through large genomic regions for transcription factor binding sites or clusters of binding sites forming cis-regulatory modules. Correspondingly, there has been a push to develop algorithms for the rapid detection and assessment of cis-regulatory modules. While various algorithms for this purpose have been introduced, most are not well suited for rapid, genome scale scanning. RESULTS: We introduce methods designed for the detection and statistical evaluation of cis-regulatory modules, modeled as either clusters of individual binding sites or as combinations of sites with constrained organization. In order to determine the statistical significance of module sites, we first need a method to determine the statistical significance of single transcription factor binding site matches. We introduce a straightforward method of estimating the statistical significance of single site matches using a database of known promoters to produce data structures that can be used to estimate p-values for binding site matches. We next introduce a technique to calculate the statistical significance of the arrangement of binding sites within a module using a max-gap model. If the module scanned for has defined organizational parameters, the probability of the module is corrected to account for organizational constraints. The statistical significance of single site matches and the architecture of sites within the module can be combined to provide an overall estimation of statistical significance of cis-regulatory module sites. CONCLUSION: The methods introduced in this paper allow for the detection and statistical evaluation of single transcription factor binding sites and cis-regulatory modules. The features described are implemented in the Search Tool for Occurrences of Regulatory Motifs (STORM) and MODSTORM software

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Critical Assessment of Metagenome Interpretation:A benchmark of metagenomics software

Author: A Mikheenko
Aaron E Darling
Adrian Fritz
Alexander Sczyrba
Alexey Gurevich
Alice C McHardy
Andreas Bremges
B Liu
Bernhard Y Renard
Bertrand Denis
Burton K H Chia
C Lozupone
Charles Deltel
Chirag Jain
Christopher Quince
Claire Lemaitre
D Coil
D Koslicki
D Koslicki
D Koslicki
D Li
D Turaev
Daniel A Cuevas
David Koslicki
DD Kang
DE Wood
DH Huson
Dmitrij Turaev
Dominique Lavenier
Dongwan Don Kang
E Pruesse
Edward M Rubin
Eik Dahms
Fernando Meyer
Genivaldo Gueiros Z Silva
GG Silva
Guillaume Rizk
H Klingenberg
Hans-Peter Klenk
Heiner Klingenberg
HH Lin
Hsin-Hung Lin
I Gregor
Ivan Gregor
J Alneberg
J Dröge
JA Chapman
Jeff L Froula
Jeffrey J Cook
Jessika Fiedler
Johannes Dröge
Julia A Vorholt
K Mavromatis
KT Konstantinidis
Lars Hestbjerg Hansen
M Arumugam
M Balvočiūtė
M Strous
M Yassour
Marc Strous
Markus Göker
Matthew Z DeMaere
Michael Beckstette
Michael D Barton
Mihai Pop
ML Bendall
Monika Balvočiūtė
N Kashtan
N Sangwan
N Segata
Nicole Shapiro
Nikos C Kyrpides
Niranjan Nagarajan
NP Nguyen
O Koren
P Belmann
Paul Schulze-Lefert
Peter Belmann
Peter Hofmann
Peter Meinicke
Philip D Blood
Pierre Peterlongo
R Chikhi
R Ounit
Rayan Chikhi
Robert A Edwards
Robert Egan
RR Miller
Ruben Garrido-Oter
S Boisvert
S Chatterjee
S Gao
S Lindgreen
S Sunagawa
Stefan Janssen
Stephan Majda
Steven W Singer
Surya Saha
Søren J Sørensen
T Thomas
Tanja Woyke
Thomas Lingner
Thomas Rattei
Tue Sparholt Jørgensen
V Marx
VC Piro
Vitor C Piro
Y Bai
Yang Bai
Yu-Chieh Liao
Yu-Wei Wu
YW Wu
Zhong Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

International audienceIn metagenome analysis, computational methods for assembly, taxonomic profilingand binning are key components facilitating downstream biological datainterpretation. However, a lack of consensus about benchmarking datasets andevaluation metrics complicates proper performance assessment. The CriticalAssessment of Metagenome Interpretation (CAMI) challenge has engaged the globaldeveloper community to benchmark their programs on datasets of unprecedentedcomplexity and realism. Benchmark metagenomes were generated from newlysequenced ~700 microorganisms and ~600 novel viruses and plasmids, includinggenomes with varying degrees of relatedness to each other and to publicly availableones and representing common experimental setups. Across all datasets, assemblyand genome binning programs performed well for species represented by individualgenomes, while performance was substantially affected by the presence of relatedstrains. Taxonomic profiling and binning programs were proficient at high taxonomicranks, with a notable performance decrease below the family level. Parametersettings substantially impacted performances, underscoring the importance ofprogram reproducibility. While highlighting current challenges in computationalmetagenomics, the CAMI results provide a roadmap for software selection to answerspecific research questions

Roskilde Universitet

HAL Descartes

Warwick Research Archives Portal Repository

MPG.PuRe

Hal-Diderot

Repository for Publications and Research Data

Crossref

National Health Research Institues

OPUS - University of Technology Sydney

INRIA a CCSD electronic archive server

Copenhagen University Research Information System

eScholarship - University of California

Publications at Bielefeld University

University of East Anglia digital repository

ScholarBank@NUS

HAL-Rennes 1

Index-based algorithms for motif search and their integration in a system for differential genome analysis

Author: Beckstette Michael
Publication venue: Bielefeld University
Publication date: 01/01/2007
Field of study

Beckstette M. Index-based algorithms for motif search and their integration in a system for differential genome analysis. Bielefeld (Germany): Bielefeld University; 2007.In this thesis, we present new efficient index-based algorithms for searching with position specific scoring matrices (PSSMs for short), a well known motif model, in large sequence sets, and their integration into an interactive system capable for large-scale differential comparative genome analyses. The newly developed and implemented index-based algorithms for searching with PSSMs clearly outperform existing methods in terms of running time. We also demonstrate how index based PSSM searching in combination with a fragment chaining approach can be used for efficient protein family classification, and for speeding up computation intensive database searching with hidden Markov models. With the PoSSuM software distribution, we also provide implementations of the presented algorithms in form of a flexible command line tool. We further integrated our newly developed algorithm possumsearch as a database search method in our integrated high-throughput sequence analysis system GENLIGHT, which is also a contribution of this work. GENLIGHT offers an interactive, biologist compatible, and user friendly environment for a variety of large-scale sequence analysis tasks with a special focus on (differential) comparative genome analyses. It employs a set oriented operational model, that allows to reuse generated results, and to perform complete analysis workflows in an interactive way. The system integrates several widely used sequence analysis methods and databases in a common environment, and is capable to perform analyses on a complete genome or proteome scale by employing a distributed client server approach, even for non index-based analysis methods. We demonstrate the practical usability of GENLIGHT with different case studies in which the system was used and which lead to substantial new scientific findings

Publications at Bielefeld University

Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns

Author: Fernando Meyer
Michael Beckstette
Stefan Kurtz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Meyer F, Kurtz S, Beckstette M. Fast online and index-based algorithms for approximate search of RNA sequence-structure patterns. BMC Bioinformatics. 2013;14(1): 226.Background It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequence-structure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequence-structure alignment problem. Results We present new fast index-based and online algorithms for approximate matching of RNA sequence-structure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semi-global alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a user-defined sequence-structure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of non-matching substrings. Our new index-based methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequence-structure patterns, we use fast algorithms for the local or global chaining of approximate sequence-structure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new index-based algorithm achieves a speedup of factor 560. Conclusions The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented index-based algorithms, allows for the first time approximate matching of RNA sequence-structure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented open-source software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.uni-hamburg.de/ralignator

Crossref

Springer - Publisher Connector

PubMed Central

Publications at Bielefeld University