11 research outputs found

    N-gram analysis of 970 microbial organisms reveals presence of biological language models

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It has been suggested previously that genome and proteome sequences show characteristics typical of natural-language texts such as "signature-style" word usage indicative of authors or topics, and that the algorithms originally developed for natural language processing may therefore be applied to genome sequences to draw biologically relevant conclusions. Following this approach of 'biological language modeling', statistical n-gram analysis has been applied for comparative analysis of whole proteome sequences of 44 organisms. It has been shown that a few particular amino acid n-grams are found in abundance in one organism but occurring very rarely in other organisms, thereby serving as genome signatures. At that time proteomes of only 44 organisms were available, thereby limiting the generalization of this hypothesis. Today nearly 1,000 genome sequences and corresponding translated sequences are available, making it feasible to test the existence of biological language models over the evolutionary tree.</p> <p>Results</p> <p>We studied whole proteome sequences of 970 microbial organisms using n-gram frequencies and cross-perplexity employing the Biological Language Modeling Toolkit and Patternix Revelio toolkit. Genus-specific signatures were observed even in a simple unigram distribution. By taking statistical n-gram model of one organism as reference and computing cross-perplexity of all other microbial proteomes with it, cross-perplexity was found to be predictive of branch distance of the phylogenetic tree. For example, a 4-gram model from proteome of <it>Shigellae flexneri 2a</it>, which belongs to the <it>Gammaproteobacteria </it>class showed a self-perplexity of 15.34 while the cross-perplexity of other organisms was in the range of 15.59 to 29.5 and was proportional to their branching distance in the evolutionary tree from <it>S. flexneri</it>. The organisms of this genus, which happen to be pathotypes of <it>E.coli</it>, also have the closest perplexity values with <it>E. coli.</it></p> <p>Conclusion</p> <p>Whole proteome sequences of microbial organisms have been shown to contain particular n-gram sequences in abundance in one organism but occurring very rarely in other organisms, thereby serving as proteome signatures. Further it has also been shown that perplexity, a statistical measure of similarity of n-gram composition, can be used to predict evolutionary distance within a genus in the phylogenetic tree.</p

    VIBRATIONAL SPECTRA, CONFORMATIONAL STABILITY, AND AB INITIO CALCULATIONS OF FLUOROMETHYL PHOSPHONIC DIFLUORIDE

    No full text
    Author Institution: Department of Chemistry and Biochemistry, University of South Carolina; Rijksuniversitair Centrum Antwerpen, Laboratorium voor Anorganische ScheikundeThe Raman (3100 to 10cm−110 cm^{-1}) and infrared (3100 to 30cm−130 cm^{-1}) spectra of fluoromethyl phosphonic difluoride, FCH2P(O)F2FCH_{2}P(O)F_{2}, in the gas and solid phases have been recorded. Additionally, the Raman spectrum of the liquid along with qualitative depolarization ratios have also been obtained. These data have been interpreted on the basis of an equilibrium between the trans (fluorine atom trans to the oxygen atom) and gauche conformers in the gas and liquid phases, with the trans conformer being the more stable form in both of these physical states and the only form present in the crystalline solid. A Δ\DeltaH value has been determined from a study of the Raman spectrum for the liquid. Utilizing the trans torsional frequency, the gauche dihedral angle, and the enthalpy difference between the conformers, the potential function governing the interconversion of the rotamers has been calculated. A complete vibrational assignment is proposed for both conformers based on infrared band contours, Raman depolarization data, group frequencies, and normal coordinate calculations. The conformational stabilities, barriers to internal rotation, force constants, infrared and Raman intensities and fundamental vibrational frequencies, along with the structural parameters, have been obtain from ab initio Hartree-Fock gradient calculations employing either the RHF/3-21G* or RHF/6-31G* basis sets. The calculated Raman intensities with the 3-21G* basis set reproduce the observed Raman spectrum remarkably well

    Ab Initio potential grid based docking: From High Performance Computing to In Silico Screening

    No full text
    We present a new method for the generation of potential grids for protein-ligand docking. The potential of the docking target structure is obtained directly from the electron density derived through an ab initio computation. A large subregion was selected to allow the full ab initio treatment of a the Isocitrate Lyase enzyme. The electrostatic potential is tested by docking a small charged molecule (succinate) into the binding site. The ab initio grid yields a superior result by producing the best binding orientation and position, and by recognizing it as the best. In contrast the same docking procedure, but using a classical point-charge based potential, produces a number of additional incorrect binding poses, and does not recognize the correct pose as the best solution
    corecore