48 research outputs found

    A horizontal alignment tool for numerical trend discovery in sequence data: application to protein hydropathy.

    Get PDF
    PMC3794901An algorithm is presented that returns the optimal pairwise gapped alignment of two sets of signed numerical sequence values. One distinguishing feature of this algorithm is a flexible comparison engine (based on both relative shape and absolute similarity measures) that does not rely on explicit gap penalties. Additionally, an empirical probability model is developed to estimate the significance of the returned alignment with respect to randomized data. The algorithm's utility for biological hypothesis formulation is demonstrated with test cases including database search and pairwise alignment of protein hydropathy. However, the algorithm and probability model could possibly be extended to accommodate other diverse types of protein or nucleic acid data, including positional thermodynamic stability and mRNA translation efficiency. The algorithm requires only numerical values as input and will readily compare data other than protein hydropathy. The tool is therefore expected to complement, rather than replace, existing sequence and structure based tools and may inform medical discovery, as exemplified by proposed similarity between a chlamydial ORFan protein and bacterial colicin pore-forming domain. The source code, documentation, and a basic web-server application are available.JH Libraries Open Access Fun

    Empirically determined probability model for protein hydropathy.

    No full text
    <p><b>A.. </b><b>Inverse Chi-Squared model for the distribution of observed scores.</b> Distributions of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e004" target="_blank">Equation 4</a> scores for <i>HePCaT</i> alignments of length <i>L</i> = 100 obtained from parameters <i>W</i> = 5 residues, <i>GapMax</i> = 4 residues, <i>C</i> = 0.4. Pairs of random sequences were generated, their Kyte-Doolittle amino acid hydropathies averaged over a 15-residue window, and subjected to optimal alignment using <i>HePCaT</i>, as described in the text. Binned data in each case was reasonably fit to the Inverse Chi-Squared probability distribution function (PDF, <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e005" target="_blank">Equation 5</a>), as described in Methods and tabulated in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-t001" target="_blank">Table 1</a>. <b>B.. </b><b>Analytical parameters to estimate statistical significance.</b> Parameters <i>ν</i> and <i>σ<sup>2</sup></i> for the PDF were observed to vary smoothly as a function of <i>HePCaT</i> alignment length, allowing the parameters, and thus alignment significance, to be analytically estimated for arbitrary alignment length using <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e006" target="_blank">Equations 6</a> and <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e007" target="_blank">7</a> and parameters in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-t002" target="_blank">Table 2</a>. Discrete best-fit parameters for <i>ν</i> and <i>σ<sup>2</sup></i> are given in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-t001" target="_blank">Table 1</a>. Equations for displayed best-fit curves are as follows: y = 0.497609x (Hydropathy, <i>ν</i>), y = 0.160379–1.04167 ln(x+38.9045) (Hydropathy, <i>σ<sup>2</sup></i>).</p

    Parameters used in Equations 6 and 7 to estimate length-dependent random protein data probability distributions based on the Inverse Chi-Squared Distribution.

    No full text
    <p>Parameters used in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e006" target="_blank">Equations 6</a> and <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e007" target="_blank">7</a> to estimate length-dependent random protein data probability distributions based on the Inverse Chi-Squared Distribution.</p

    Observed hydropathy and predicted structure similarity between ORFan <i>C. muridarum TC0624</i> and bacterial colicin pore-forming domain.

    No full text
    <p><b>A.. </b><b>Significant similarity between hydropathy of <i>TC0624</i> and <i>E. coli</i> colicin A (SCOP domain d1cola_).</b> The likelihood of obtaining this match by chance is <i>p</i> = 1.5×10<sup>−5</sup>. The blue cylinders indicate PSIPRED confidently predicted helical secondary structure of TC0624, the red cylinders indicate the actual helical secondary structure of d1cola_ domain as assessed by DSSP <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247-Kabsch1" target="_blank">[69]</a>. Numbers indicate the functionally important helical elements, as annotated by Cramer, et al. <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247-Cramer1" target="_blank">[65]</a> Reasonable correspondence between the type and locations of secondary structure elements is observed. Gapped regions of colicin helices are connected with dotted lines to guide the eye. <b>B.. </b><b>Tertiary structure location of the hydrophobic similarity (left) and the sequence similarity (right) matches between </b><b><i>TC0624</i></b><b> and colicin.</b> In both molecular cartoons, helices are colored red, strands yellow, and loops green. Locations of a match between <i>TC0624</i> and colicin are colored blue. The left figure is based on d1cola_, colored according to the <i>HePCaT</i> alignment in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-g006" target="_blank">Figure 6A</a>, and the right figure is based on the homolog d1rh1a2 SCOP domain observed in the marginally significant <i>HHPred </i><a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247-Soeding2" target="_blank">[50]</a> hidden Markov model sequence match. Both matches independently link the sequence and hydrophobicity of the ORFan to the functionally important structural core region of colicin. The extensive structure, sequence, and chemical similarities suggest the medically important hypothesis that <i>TC0624</i> could also be a pore-forming protein facilitating chlamydia survival.</p

    Pairwise sequence alignment does not detect significant similarity between human A2a and Taste Receptor Type 2, Member 19, yet a similar structure can be modeled based on the <i>HePCaT</i> match.

    No full text
    <p><b>A... </b><b><i>FASTA</i> pairwise sequence alignment between human adenosine receptor A2a and its known homolog human adenosine receptor A2b.</b> Alignment was extracted from a sequence search of the human proteome. Sequence similarity is 59% over 330 amino acids, with a highly significant E-value of 6.6e-53. Note that the hydropathy similarity between these two proteins is also significant, as given in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-g004" target="_blank">Figure 4</a>. <b>B... </b><b><i>FASTA</i></b><b> pairwise sequence alignment between human A2a and human taste receptor type 2, member 19.</b> Sequence similarity is 21% over 305 amino acids. Although extensive, the similarity is not significant, with an E-value of 5.1e+3, in contrast to the significant hydropathy similarity displayed in <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-g004" target="_blank">Figure 4</a>. This result suggests that hydropathy similarity, as assessed by <i>HePCaT</i>, may be able to detect remote relationships in the absence of sequence similarity. <b>C... </b><b>Model of Taste Receptor Type 2, Member 19 is similar to the experimental structure of A2a.</b> Experimental structure of A2a (left panel) is based on PDB identifier 3rey. I-TASSER <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247-Roy1" target="_blank">[45]</a> model of Taste Receptor Type 2, Member 19 (right panel) achieved an I-TASSER C-score of 0.67 and a DALI Z-Score <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247-Holm3" target="_blank">[46]</a> of 24.9 against the 3rey structure, indicating a confident model that is significantly similar to A2a. Rainbow colored helices follow the colors of <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi-1003247-g004" target="_blank">Figure 4</a>, indicating the seven structurally aligned transmembrane spanning helices. The RMSD of the 269 DALI-aligned residues is 3.1 Ã… between modeled and experimental structures.</p

    Overview of the Horizontal Protein Comparison Tool (<i>HePCaT</i>) algorithm.

    No full text
    <p>The hydropathy profiles of two hypothetical proteins, each of length <i>M</i> = <i>N</i> = 20 residues, are shown (Step 1). Intraprotein signed distances are computed within each protein according to <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e001" target="_blank">Equation 1</a> in the main text (Step 2). Positive distances, <i>e.g</i>. measured from a residue with a local minimum value to a residue with a local maximum value, are indicated in red, negative distances in blue. The signed distance matrices are therefore square and symmetrically reflected across the diagonal. Distances for protein 1 and protein 2 correspond to matrices <b><i>D<sub>1</sub></i></b> and <b><i>D<sub>2</sub></i></b>, respectively. The similarity matrix <b><i>S</i></b> that ultimately compares the two proteins is constructed from the average absolute distance differences of <i>W</i> = 5 residue blocks between <b><i>D<sub>1</sub></i></b> and <b><i>D<sub>2</sub></i></b>, according to <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e002" target="_blank">Equation 2</a> (Step 3). In <b><i>S</i></b>, light colored squares indicate blocks of <i>W</i> = 5 residues starting at residue <i>i</i> in protein 1 and residue <i>j</i> in protein 2 with similarly shaped hydropathy, dark squares indicate dissimilar shapes. (<b><i>S</i></b><i><sub>i = 1,j = 1</sub></i> is the lower left corner in the figure.) As described in the text, <b><i>S</i></b> is exhaustively searched and all longest alignments with up to possibly <i>GapMax</i> gaps, whose squares (average path distance, <i>APD</i>) pass a user-defined average similarity cutoff <i>C</i>, are kept in a list (set of colored arrows). The alignment of this list with the closest absolute shape (lowest <i>RMSD</i>) is defined as the optimal match (Step 5). An Optimal Path Score (<i>OPS</i>), defined by <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1003247#pcbi.1003247.e004" target="_blank">Equation 4</a>, is assigned to the alignment and its significance is computed with respect to the score distribution of random alignments of identical length (Step 6). Note that the example alignment, while a reasonable visual match, is only marginally significant with respect to random alignments of identical length, due to its short length of 10 residues.</p

    Goodness of fit statistics between Scaled Inverse Chi Squared probability distribution function (Equation 5) and <i>OPS</i> score distributions of various length optimal <i>HePCaT</i> alignments of random amino acid sequences.

    No full text
    a<p>Blank rows for certain alignment lengths indicate that the null hypothesis (<i>i.e.</i> that the distribution of <i>OPS</i> scores for randomly generated sequences was drawn from an underlying inverse chi square distribution) was rejected at the <i>p</i><0.05 level.</p
    corecore