16 research outputs found
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
CaMKII binds both substrates and activators at the active site [preprint]
Ca2+/calmodulin dependent protein kinase II (CaMKII) is a signaling protein that is required for long-term memory formation. Ca2+/CaM activates CaMKII by binding to its regulatory segment, thereby freeing the substrate binding site. Despite having a large variety of interaction partners, the specificity of CaMKII interactions have not been structurally well-characterized. One exceptional feature of this kinase is that interaction with specific binding partners persistently activates CaMKII. To address the molecular details of this, we solved X-ray crystal structures of the CaMKII kinase domain bound to four different binding partners that modulate CaMKII activity in different ways. We show that all four partners bind in the same manner across the substrate binding site. We generated a sequence alignment based on our structural observations, which revealed conserved interactions. Using biochemistry and molecular dynamics simulations, we propose a mechanistic model that persistent CaMKII activity is facilitated by high affinity binding partners, which compete with the regulatory segment to allow substrate phosphorylation
Sloutsky and Naegle, High-resolution SDP Identification
Python module implementing group-conservation-weighted<br>specificity determining position (SDP) scoring.<br><br>Specificity group definitions, reference sequences, and sub-sampled alignment sets for AscG, CcpA, FruR, LacI, PurR, TreR specificity groups of the LacI bacterial transcription factor family.<br><br
High-Resolution Identification of Specificity Determining Positions in the LacI Protein Family Using Ensembles of Sub-Sampled Alignments.
Since the advent of large-scale genomic sequencing, and the consequent availability of large numbers of homologous protein sequences, there has been burgeoning development of methods for extracting functional information from multiple sequence alignments (MSAs). One type of analysis seeks to identify specificity determining positions (SDPs) based on the assumption that such positions are highly conserved within groups of sequences sharing functional specificity, but conserved to different amino acids in different specificity groups. This unsupervised approach to utilizing evolutionary information may elucidate mechanisms of specificity in protein-protein interactions, catalytic activity of enzymes, sensitivity to allosteric regulation, and other types of protein functionality. We present an analysis of SDPs in the LacI family of transcriptional regulators in which we 1) relax the constraint that all specificity groups must contribute to SDP signal, and 2) use a novel approach to robust treatment of sequence alignment uncertainty based on sub-sampling. We find that the vast majority of SDP signal occurs at positions with a conservation pattern that significantly complicates detection by previously described methods. This pattern, which we term "partial SDP", consists of the commonly accepted SDP conservation pattern among a subset of specificity groups and strong degeneracy among the rest. An upshot of this fact is that the SDP complement of every specificity group appears to be unique. Additionally, sub-sampling gives us the ability to assign a confidence interval to the SDP score, as well as increase fidelity, as compared to analysis of a single, comprehensive alignment-the current standard in multiple sequence alignment methodologies
Simulation Materials for ASPEN
This supplement contains the phylogenies and the simulated family sequences as a test set for phylogenetic reconstruction
SDP score distributions vs comprehensive alignment scores.
<p>Score distributions and the comprehensive alignment scores for specificity groups with a score falling in the top 1% are plotted for positions 29, 81, and 110. Score distributions shown as box plots, with medians indicated by white lines and means indicated by yellow dots. Boxes cover middle two quartiles of score distributions, while whiskers cover middle 95%. Comprehensive alignment scores shown as red dots. These can fall below (position 29), above (position 81), or within (position 110) the middle two quartiles of the ensemble distributions. Some ortholog sets (IdnR, RbsR-A, ScrR-A at position 29, LacI at position 81) can be substantially more sensitive to alignment variability than other ortholog sets at the same position. This fact is reflected in their ensemble score (distribution average—yellow dot), but not in the comprehensive alignment score.</p
Structural evidence of partial SDP at LacI position 101.
<p>Interactions of TreR (A), FruR (B), CcpA (C), LacI (D), and AscG (E) positions corresponding to LacI position 101, according to the structural alignment. The side chain at the position homologous to LacI 101 is shown in light blue. Side chains at neighboring positions are shown in salmon, if those positions are SDPs, and in gray otherwise. Amino acid composition of the ortholog set is represented by sequence logo. Packing interaction of TreR F102 with F127 and hydrogen bonding interaction of FruR D101 with R149 are highly specific. CcpA Q101 and LacI R101 do not form specific interactions, although CcpA Q101 does participate in a single hydrogen bond. Glutamic acid and asparagine, capable of making the same interaction, also occur among CcpA orthologs. LacI R101 is exposed to solvent, and several other polar amino acids occur at the position. AscG H101 participates in two different interactions. (E), top: hydrogen bonding with cis-monomer backbone (gray) and coordinated water molecule (red dot). (E), bottom: hydrogen bond network with cis-monomeric N68, trans-monomeric E88 (light violet backbone), and another coordinated water.</p
Projection into conservation-agreement space.
<p>In every panel, the color gradient represents strength of SDP signal, as quantified by average group-wise conservation minus average between-group agreement. Dark red (bottom right quadrant) represents maximal SDP signal. (A) Projections of hypothetical alignment columns for illustration: Column II has maximal SDP signal, while columns I and III have low signal. (B,C,D,E) Projections of LacI reference sequence positions with group-wise conservation and between-group agreement computed either (B,D) over every specificity group or (C,E) over conserved groups only, where group conservation is >0.6. (D) Points corresponding to LacI positions are colored in grayscale corresponding to the red color gradient of (B). (E) Points are positioned according their SDP signal calculated over conserved groups only, but using the grayscale of (D) for illustration of the shift individual sequence positions undergo as a result of the altered scoring scheme of (C).</p
Amino acid composition at heterogeneously conserved positions.
<p>Historgrams at left show group conservation distributions at position 88 over the MSA ensemble for each family member. The dotted line indicates threshold for “conserved” designation, separating high conservation in blue from low conservation in gray. Amino acid content of each of the 20 ortholog sets is represented by sequence logos for three positions that demonstrate heterogeneous conservation. Rows correspond to LacI family members. Sequence logos for ortholog sets with average group conservation above the conservation cutoff are outlined in maroon.</p
Group-specific SDP signal undetected by SDPPred or Speer.
<p>Marker size and color corresponds to group-specific score according to color bar in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0162579#pone.0162579.g003" target="_blank">Fig 3</a>.</p