Search CORE

arXiv.org e-Print Archive

Systematic identification of gene families for use as markers for phylogenetic and phylogeny- driven ecological studies of bacteria and archaea and their major subgroups

Author: Eisen Jonathan A.
Jospin Guillaume
Wu Dongying
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 02/07/2013
Field of study

With the astonishing rate that the genomic and metagenomic sequence data sets are accumulating, there are many reasons to constrain the data analyses. One approach to such constrained analyses is to focus on select subsets of gene families that are particularly well suited for the tasks at hand. Such gene families have generally been referred to as marker genes. We are particularly interested in identifying and using such marker genes for phylogenetic and phylogeny-driven ecological studies of microbes and their communities. We therefore refer to these as PhyEco (for phylogenetic and phylogenetic ecology) markers. The dual use of these PhyEco markers means that we needed to develop and apply a set of somewhat novel criteria for identification of the best candidates for such markers. The criteria we focused on included universality across the taxa of interest, ability to be used to produce robust phylogenetic trees that reflect as much as possible the evolution of the species from which the genes come, and low variation in copy number across taxa. We describe here an automated protocol for identifying potential PhyEco markers from a set of complete genome sequences. The protocol combines rapid searching, clustering and phylogenetic tree building algorithms to generate protein families that meet the criteria listed above. We report here the identification of PhyEco markers for different taxonomic levels including 40 for all bacteria and archaea, 114 for all bacteria, and much more for some of the individual phyla of bacteria. This new list of PhyEco markers should allow much more detailed automated phylogenetic and phylogenetic ecology analyses of these groups than possible previously.Comment: 24 pages, 3 figure

FigShare

Potential conservation of circadian clock proteins in the phylum Nematoda as revealed by bioinformatic searches

Author: A Claridge-Chang
A Golden
A Romanowski
A Sancar
A Sidow
A Ward
AL Gotter
AL Gotter
AM Aguinaldo
AM van der Linden
Andrés Romanowski
B LeBoeuf
BD Aronson
C Benna
C Trent
CH Ko
CL Baker
CR Gissendanner
D Banerjee
D Weinshenker
Diego Andrés Golombek
DS Fay
E Engelen
E Meelkop
E Munoz
E Petrillo
E Quevillon
EM Schwarz
ET Kipreos
F Sandrelli
G Dong
GC Monsalve
GJ Hendriks
GM Leclerc
H Hao
H Jia
H Jiang
H Kageyama
H Qin
H Qin
HF Gu
HG McWatters
HR Ueda
I Ebersberger
J Hatzold
J Liu
J Yan
JA Powell-Coffman
JD Plautz
JM Tennessen
JN Andersen
JS O'Neill
JS O'Neill
JW Barnes
K Hasegawa
K Tamura
K Tomioka
K Unsal-Kacmaz
K Unsal-Kacmaz
L Dreier
L Temmerman
LS Johnson
M Ishiura
M Jeon
M Kostrouchova
M Miskei
M Olmedo
M Ukai-Tadenuma
María Eugenia Goya
Matías Javier Garavaglia
MF Ceriani
ML Migliori
ML Migliori
ML Migliori
N Mehta
N Ooe
P Erdelyi
Pablo Daniel Ghiringhelli
PD Wes
PE Hardin
PE Hardin
PT Cohen
Q Yuan
RC Chan
RD Finn
RJ Kelly
RJ McFarlane
RS Edgar
S Arur
SA Brown
SE Sanchez
SG Rhee
SH Simonetta
SH Simonetta
SH Simonetta
SJ Romney
SK Hanks
SL Edwards
SR Eddy
T Fiedler
T Janssen
T Takumi
TK Darlington
Urs Albrecht
W Sudhaus
X Wang
X Yang
Y Kumaki
Y Shemesh
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Although several circadian rhythms have been described in C. elegans, its molecular clock remains elusive. In this work we employed a novel bioinformatic approach, applying probabilistic methodologies, to search for circadian clock proteins of several of the best studied circadian model organisms of different taxa (Mus musculus, Drosophila melanogaster, Neurospora crassa, Arabidopsis thaliana and Synechoccocus elongatus) in the proteomes of C. elegans and other members of the phylum Nematoda. With this approach we found that the Nematoda contain proteins most related to the core and accessory proteins of the insect and mammalian clocks, which provide new insights into the nematode clock and the evolution of the circadian system.Fil: Romanowski, Andrés. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Investigaciones Bioquímicas de Buenos Aires. Fundación Instituto Leloir. Instituto de Investigaciones Bioquímicas de Buenos Aires; Argentina. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Cronobiología; ArgentinaFil: Garavaglia, Matías Javier. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ing.genética y Biolog.molecular y Celular. Area Virus de Insectos; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Goya, María Eugenia. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Cronobiología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Ghiringhelli, Pablo Daniel. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Ing.genética y Biolog.molecular y Celular. Area Virus de Insectos; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Golombek, Diego Andres. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología. Laboratorio de Cronobiología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

CiteSeerX

CONICET Digital

arXiv.org e-Print Archive

FigShare

MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

Author: Ma Jianzhu
Wang Sheng
Wang Zhiyong
Xu Jinbo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog

Structural evolution drives diversification of the large LRR-RLK gene family

Author: Arendsee Z
Dufayard J‐F
Krishnakumar V
Lanfear R
Liang Z
Monnahan PJ
R Core Team
Song W
Wilke CO
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2020
Field of study

Cells are continuously exposed to chemical signals that they must discriminate between and respond to appropriately. In embryophytes, the leucine‐rich repeat receptor‐like kinases (LRR‐RLKs) are signal receptors critical in development and defense. LRR‐RLKs have diversified to hundreds of genes in many plant genomes. Although intensively studied, a well‐resolved LRR‐RLK gene tree has remained elusive. To resolve the LRR‐RLK gene tree, we developed an improved gene discovery method based on iterative hidden Markov model searching and phylogenetic inference. We used this method to infer complete gene trees for each of the LRR‐RLK subclades and reconstructed the deepest nodes of the full gene family. We discovered that the LRR‐RLK gene family is even larger than previously thought, and that protein domain gains and losses are prevalent. These structural modifications, some of which likely predate embryophyte diversification, led to misclassification of some LRR‐RLK variants as members of other gene families. Our work corrects this misclassification. Our results reveal ongoing structural evolution generating novel LRR‐RLK genes. These new genes are raw material for the diversification of signaling in development and defense. Our methods also enable phylogenetic reconstruction in any large gene family

ScholarWorks@UMass Amherst

Alignment of helical membrane protein sequences using AlignMe

Author: Forrest Lucy R.
Khafizov Kamil
Stamm Marcus
Staritzbichler René
Publication venue
Publication date: 01/01/2013
Field of study

Few sequence alignment methods have been designed specifically for integral membrane proteins, even though these important proteins have distinct evolutionary and structural properties that might affect their alignments. Existing approaches typically consider membrane-related information either by using membrane-specific substitution matrices or by assigning distinct penalties for gap creation in transmembrane and non-transmembrane regions. Here, we ask whether favoring matching of predicted transmembrane segments within a standard dynamic programming algorithm can improve the accuracy of pairwise membrane protein sequence alignments. We tested various strategies using a specifically designed program called AlignMe. An updated set of homologous membrane protein structures, called HOMEP2, was used as a reference for optimizing the gap penalties. The best of the membrane-protein optimized approaches were then tested on an independent reference set of membrane protein sequence alignments from the BAliBASE collection. When secondary structure (S) matching was combined with evolutionary information (using a position-specific substitution matrix (P)), in an approach we called AlignMePS, the resultant pairwise alignments were typically among the most accurate over a broad range of sequence similarities when compared to available methods. Matching transmembrane predictions (T), in addition to evolutionary information, and secondary-structure predictions, in an approach called AlignMePST, generally reduces the accuracy of the alignments of closely-related proteins in the BAliBASE set relative to AlignMePS, but may be useful in cases of extremely distantly related proteins for which sequence information is less informative. The open source AlignMe code is available at https://sourceforge.net/projects/alignme/, and at http://www.forrestlab.org, along with an online server and the HOMEP2 data set