22 research outputs found
Statistical mechanics of transcription-factor binding site discovery using Hidden Markov Models
Hidden Markov Models (HMMs) are a commonly used tool for inference of
transcription factor (TF) binding sites from DNA sequence data. We exploit the
mathematical equivalence between HMMs for TF binding and the "inverse"
statistical mechanics of hard rods in a one-dimensional disordered potential to
investigate learning in HMMs. We derive analytic expressions for the Fisher
information, a commonly employed measure of confidence in learned parameters,
in the biologically relevant limit where the density of binding sites is low.
We then use techniques from statistical mechanics to derive a scaling principle
relating the specificity (binding energy) of a TF to the minimum amount of
training data necessary to learn it.Comment: 25 pages, 2 figures, 1 table V2 - typos fixed and new references
adde
Susceptibility calculations for alternating antiferromagnetic chains
Earlier work of Duffy and Barr consisting of exact calculations on alternating antiferromagnetic Heisenberg spinā1/2 chains is extended to longer chains of up to 12 spins, and subsequent extrapolations of thermodynamic properties, particularly the susceptibility, are extended to the weak alternation region close to the uniform limit. This is the region of interest in connection with the recent experimental discovery of spināPeierls systems. The extrapolated susceptibility curves are compared with corresponding curves calculated from the model of Bulaevskii, which has been used extensively in approximate theoretical treatments of a variety of phenomena. Qualitative agreement is observed in the uniform limit and persists for all degrees of alternation, but quantitative differences of about 10% are present over the whole range, including the isolated dimer limit. Potential application of the new susceptibility calculations to experiment is discussed
An FPT Approach for Predicting Protein Localization from Yeast Genomic Data
Accurately predicting the localization of proteins is of paramount importance in the quest to determine their respective functions within the cellular compartment. Because of the continuous and rapid progress in the fields of genomics and proteomics, more data are available now than ever before. Coincidentally, data mining methods been developed and refined in order to handle this experimental windfall, thus allowing the scientific community to quantitatively address long-standing questions such as that of protein localization. Here, we develop a frequent pattern tree (FPT) approach to generate a minimum set of rules (mFPT) for predicting protein localization. We acquire a series of rules according to the features of yeast genomic data. The mFPT prediction accuracy is benchmarked against other commonly used methods such as Bayesian networks and logistic regression under various statistical measures. Our results show that mFPT gave better performance than other approaches in predicting protein localization. Meanwhile, setting 0.65 as the minimum hit-rate, we obtained 138 proteins that mFPT predicted differently than the simple naive bayesian method (SNB). In our analysis of these 138 proteins, we present novel predictions for the location for 17 proteins, which currently do not have any defined localization. These predictions can serve as putative annotations and should provide preliminary clues for experimentalists. We also compared our predictions against the eukaryotic subcellular localization database and related predictions by others on protein localization. Our method is quite generalized and can thus be applied to discover the underlying rules for protein-protein interactions, genomic interactions, and structure-function relationships, as well as those of other fields of research
A method to improve protein subcellular localization prediction by integrating various biological data sources
<p>Abstract</p> <p>Background</p> <p>Protein subcellular localization is crucial information to elucidate protein functions. Owing to the need for large-scale genome analysis, computational method for efficiently predicting protein subcellular localization is highly required. Although many previous works have been done for this task, the problem is still challenging due to several reasons: the number of subcellular locations in practice is large; distribution of protein in locations is imbalanced, that is the number of protein in each location remarkably different; and there are many proteins located in multiple locations. Thus it is necessary to explore new features and appropriate classification methods to improve the prediction performance.</p> <p>Results</p> <p>In this paper we propose a new predicting method which combines two key ideas: 1) Information of neighbour proteins in a probabilistic gene network is integrated to enrich the prediction features. 2) Fuzzy k-NN, a classification method based on fuzzy set theory is applied to predict protein locating in multiple sites. Experiment was conducted on a dataset consisting of 22 locations from Budding yeast proteins and significant improvement was observed.</p> <p>Conclusion</p> <p>Our results suggest that the neighbourhood information from functional gene networks is predictive to subcellular localization. The proposed method thus can be integrated and complementary to other available prediction methods.</p
Integrative Identification of Arabidopsis Mitochondrial Proteome and Its Function Exploitation through Protein Interaction Network
Mitochondria are major players on the production of energy, and host several key reactions involved in basic metabolism and biosynthesis of essential molecules. Currently, the majority of nucleus-encoded mitochondrial proteins are unknown even for model plant Arabidopsis. We reported a computational framework for predicting Arabidopsis mitochondrial proteins based on a probabilistic model, called Naive Bayesian Network, which integrates disparate genomic data generated from eight bioinformatics tools, multiple orthologous mappings, protein domain properties and co-expression patterns using 1,027 microarray profiles. Through this approach, we predicted 2,311 candidate mitochondrial proteins with 84.67% accuracy and 2.53% FPR performances. Together with those experimental confirmed proteins, 2,585 mitochondria proteins (named CoreMitoP) were identified, we explored those proteins with unknown functions based on protein-protein interaction network (PIN) and annotated novel functions for 26.65% CoreMitoP proteins. Moreover, we found newly predicted mitochondrial proteins embedded in particular subnetworks of the PIN, mainly functioning in response to diverse environmental stresses, like salt, draught, cold, and wound etc. Candidate mitochondrial proteins involved in those physiological acitivites provide useful targets for further investigation. Assigned functions also provide comprehensive information for Arabidopsis mitochondrial proteome
Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration
BACKGROUND: The automation of many common molecular biology techniques has resulted in the accumulation of vast quantities of experimental data. One of the major challenges now facing researchers is how to process this data to yield useful information about a biological system (e.g. knowledge of genes and their products, and the biological roles of proteins, their molecular functions, localizations and interaction networks). We present a technique called Global Mapping of Unknown Proteins (GMUP) which uses the Gene Ontology Index to relate diverse sources of experimental data by creation of an abstraction layer of evidence data. This abstraction layer is used as input to a neural network which, once trained, can be used to predict function from the evidence data of unannotated proteins. The method allows us to include almost any experimental data set related to protein function, which incorporates the Gene Ontology, to our evidence data in order to seek relationships between the different sets. RESULTS: We have demonstrated the capabilities of this method in two ways. We first collected various experimental datasets associated with yeast (Saccharomyces cerevisiae) and applied the technique to a set of previously annotated open reading frames (ORFs). These ORFs were divided into training and test sets and were used to examine the accuracy of the predictions made by our method. Then we applied GMUP to previously un-annotated ORFs and made 1980, 836 and 1969 predictions corresponding to the GO Biological Process, Molecular Function and Cellular Component sub-categories respectively. We found that GMUP was particularly successful at predicting ORFs with functions associated with the ribonucleoprotein complex, protein metabolism and transportation. CONCLUSION: This study presents a global and generic gene knowledge discovery approach based on evidence integration of various genome-scale data. It can be used to provide insight as to how certain biological processes are implemented by interaction and coordination of proteins, which may serve as a guide for future analysis. New data can be readily incorporated as it becomes available to provide more reliable predictions or further insights into processes and interactions
One-Dimensional Model Systems: Theoretical Survey
In the early 1960ās one-dimensional model systems were regarded as amusing toys with the advantage of being far more easily solvable than their āārealāā three-dimensional counterparts. Now essentially 1-D (quasi-1-D) magnets can be āātailor-madeāā in the laboratory. Even more popular is the field of organic conductors like TTFā
TCNQ, which are naturally quasi-1-D. Currently solitons and related solutions of non-linear, dispersive 1-D differential equations are ubiquitous in physics, including the area of 1-D magnetism. These developments are discussed in the Introduction. The rest of this paper is concerned with model Hamiltonians, model comparisons, critical singularities in 1-D (quasi-1-D) systems, accuracy of numerical techniques in comparison with exact solutions, brief accounts of dilute and disordered 1-D systems, and 1-D spin dynamics. Finally, a comment is made on a variety of interesting isomorphisms between 1-D magnets and phenomena in several other areas of physics, for example 2-D ferroelectrics, field-theoretic models, and realistic fluids. Comparison of theory and experiment has been the subject of several excellent reviews and is therefore not discussed here
Integrative Analysis of the Mitochondrial Proteome in Yeast
In this study yeast mitochondria were used as a model system to apply, evaluate, and integrate different genomic approaches to define the proteins of an organelle. Liquid chromatography mass spectrometry applied to purified mitochondria identified 546 proteins. By expression analysis and comparison to other proteome studies, we demonstrate that the proteomic approach identifies primarily highly abundant proteins. By expanding our evaluation to other types of genomic approaches, including systematic deletion phenotype screening, expression profiling, subcellular localization studies, protein interaction analyses, and computational predictions, we show that an integration of approaches moves beyond the limitations of any single approach. We report the success of each approach by benchmarking it against a reference set of known mitochondrial proteins, and predict approximately 700 proteins associated with the mitochondrial organelle from the integration of 22 datasets. We show that a combination of complementary approaches like deletion phenotype screening and mass spectrometry can identify over 75% of the known mitochondrial proteome. These findings have implications for choosing optimal genome-wide approaches for the study of other cellular systems, including organelles and pathways in various species. Furthermore, our systematic identification of genes involved in mitochondrial function and biogenesis in yeast expands the candidate genes available for mapping Mendelian and complex mitochondrial disorders in humans