12 research outputs found
Discovery of Functional Motifs from the Interface Region of Oligomeric Proteins using Frequent Subgraph Mining
Modeling the interface region of a protein complex paves the way for understanding its dynamics and functionalities. Existing works model the interface region of a complex by using different approaches, such as, the residue composition at the interface region, the geometry of the interface residues, or the structural alignment of interface regions. These approaches are useful for ranking a set of docked conformation or for building scoring function for protein-protein docking, but they do not provide a generic and scalable technique for the extraction of interface patterns leading to functional motif discovery. In this work, we model the interface region of a protein complex by graphs and extract interface patterns of the given complex in the form of frequent subgraphs. To achieve this we develop a scalable algorithm for frequent subgraph mining. We show that a systematic review of the mined subgraphs provides an effective method for the discovery of functional motifs that exist along the interface region of a given protein complex
Latent Representation and Sampling in Network: Application in Text Mining and Biology.
In classical machine learning, hand-designed features are used for learning a mapping from raw data. However, human involvement in feature design makes the process expensive. Representation learning aims to learn abstract features directly from data without direct human involvement. Raw data can be of various forms. Network is one form of data that encodes relational structure in many real-world domains. Therefore, learning abstract features for network units is an important task. In this dissertation, we propose models for incorporating temporal information given as a collection of networks from subsequent time-stamps. The primary objective of our models is to learn a better abstract feature representation of nodes and edges in an evolving network. We show that the temporal information in the abstract feature improves the performance of link prediction task substantially. Besides applying to the network data, we also employ our models to incorporate extra-sentential information in the text domain for learning better representation of sentences. We build a context network of sentences to capture extra-sentential information. This information in abstract feature representation of sentences improves various text-mining tasks substantially over a set of baseline methods. A problem with the abstract features that we learn is that they lack interpretability. In real-life applications on network data, for some tasks, it is crucial to learn interpretable features in the form of graphical structures. For this we need to mine important graphical structures along with their frequency statistics from the input dataset. However, exact algorithms for these tasks are computationally expensive, so scalable algorithms are of urgent need. To overcome this challenge, we provide efficient sampling algorithms for mining higher-order structures from network(s). We show that our sampling-based algorithms are scalable. They are also superior to a set of baseline algorithms in terms of retrieving important graphical sub-structures, and collecting their frequency statistics. Finally, we show that we can use these frequent subgraph statistics and structures as features in various real-life applications. We show one application in biology and another in security. In both cases, we show that the structures and their statistics significantly improve the performance of knowledge discovery tasks in these domains
Protein functional features extracted from primary sequences: A focus on disordered sequences.
In this thesis we implement an ensemble of sequence analysis strategies aimed at identifying functional and structural protein features. The first part of this work was dedicated to two case studies of specific proteins analyzed to provide candidate func
Development of Computer-aided Concepts for the Optimization of Single-Molecules and their Integration for High-Throughput Screenings
In the field of synthetic biology, highly interdisciplinary approaches for the
design and modelling of functional molecules using computer-assisted methods
have become established in recent decades. These computer-assisted methods are
mainly used when experimental approaches reach their limits, as computer models
are able to e.g., elucidate the temporal behaviour of nucleic acid polymers or
proteins by single-molecule simulations, as well as to illustrate the functional
relationship of amino acid residues or nucleotides to each other. The knowledge
raised by computer modelling can be used continuously to influence the further
experimental process (screening), and also shape or function
(rational design) of the considered molecule. Such an optimization of the
biomolecules carried out by humans is often necessary, since the observed
substrates for the biocatalysts and enzymes are usually synthetic (``man-made
materials'', such as PET) and the evolution had no time to provide efficient
biocatalysts.
With regard to the computer-aided design of single-molecules, two fundamental paradigms
share the supremacy in the field of synthetic biology. On the one hand,
probabilistic experimental methods (e.g., evolutionary design processes such as
directed evolution) are used in combination with High-Throughput
Screening (HTS), on the other hand, rational, computer-aided single-molecule design
methods are applied.
For both topics, computer models/concepts were developed, evaluated and
published.
The first contribution in this thesis describes a computer-aided design approach
of the Fusarium Solanie Cutinase (FsC). The activity loss of the enzyme during a
longer incubation period was investigated in detail (molecular) with PET. For
this purpose, Molecular Dynamics (MD) simulations of the spatial structure of
FsC and a water-soluble degradation product of the
synthetic substrate PET (ethylene glycol) were computed. The existing model was
extended by combining it with Reduced Models. This simulation study has
identified certain areas of FsC which interact very
strongly with PET (ethylene glycol) and thus have a significant influence on the
flexibility and structure of the enzyme.
The subsequent original publication establishes a new method for the selection
of High-Throughput assays for the use in protein chemistry. The selection is
made via a meta-optimization of the assays to be analyzed. For this purpose,
control reactions are carried out for the respective assay. The distance of the
control distributions is evaluated using classical static methods such as the
Kolmogorov-Smirnov test. A performance is then assigned to each assay. The
described control experiments are performed before the actual experiment
(screening), and the assay with the highest performance is used for further
screening. By applying this generic method, high success rates can be achieved.
We were able to demonstrate this experimentally using
lipases and esterases as an example.
In the area of green chemistry, the above-mentioned processes can be useful for finding
enzymes for the degradation of synthetic materials more quickly or modifying
enzymes that occur naturally in such a way that these enzymes can
efficiently convert synthetic substrates after successful optimization. For this
purpose, the experimental effort (consumption of materials) is kept to a minimum
during the practical implementation. Especially for large-scale screenings, a
prior consideration or restriction of the possible sequence-space can contribute significantly to
maximizing the success rate of screenings and minimizing the total
time they require.
In addition to classical methods such as MD simulations in combination with
reduced models, new graph-based methods for the presentation and analysis of MD
simulations have been developed. For this purpose, simulations were converted
into distance-dependent dynamic graphs. Based on this reduced representation,
efficient algorithms for analysis were developed and tested. In particular,
network motifs were investigated to determine whether this type of
semantics is more suitable for describing molecular structures and interactions
within MD simulations than spatial coordinates. This concept was evaluated for
various MD simulations of molecules, such as water, synthetic pores, proteins,
peptides and RNA structures. It has been shown that this novel form of semantics
is an excellent way to describe (bio)molecular structures and their dynamics.
Furthermore, an algorithm (StreAM-Tg) has been developed for the creation of
motif-based Markov models, especially for the analysis of single molecule
simulations of nucleic acids. This algorithm is used for the design of RNAs. The
insights obtained from the analysis with StreAM-Tg (Markov models) can
provide useful design recommendations for the (re)design of functional RNA.
In this context, a new method was developed to quantify the environment (i.e.
water; solvent context) and its influence on biomolecules in MD simulations. For
this purpose, three vertex motifs were used to describe the structure of the
individual water molecules. This new method offers many advantages. With this
method, the structure and dynamics of water can be accurately described. For
example, we were able to reproduce the thermodynamic entropy of water in the
liquid and vapor phase along the vapor-liquid equilibrium curve from the
triple point to the critical point.
Another major field covered in this thesis is the development of new
computer-aided approaches for HTS for the design of
functional RNA. For the production of functional RNA (e.g., aptamers and riboswitches), an experimental,
round-based HTS (like SELEX) is typically used. By using
Next Generation Sequencing (NGS) in combination with the SELEX process,
this design process can be studied at the nucleotide and secondary structure
levels for the first time. The special feature of small RNA molecules compared
to proteins is that the secondary structure (topology), with a minimum free
energy, can be determined directly from the nucleotide sequence, with a high
degree of certainty.
Using the combination of M. Zuker's algorithm, NGS and the SELEX method, it was
possible to quantify the structural diversity of individual RNA molecules under
consideration of the genetic context. This combination of methods allowed the
prediction of rounds in which the first ciprofloxacin-riboswitch emerged.
In this example, only a simple structural comparison was made for the
quantification (Levenshtein distance) of the diversity of each round.
To improve this, a new representation of the RNA structure as a directed graph
was modeled, which was then compared with a probabilistic subgraph isomorphism.
Finally, the NGS dataset (ciprofloxacin-riboswitch) was modeled as a dynamic
graph and analyzed after the occurrence of defined seven-vertex motifs. For this
purpose, motif-based semantics were integrated into HTS
for RNA molecules for the first time. The identified motifs could be assigned to
secondary structural elements that were identified experimentally in the
ciprofloxacin aptamer R10k6.
Finally, all the algorithms presented were integrated into an R library,
published and made available to scientists from all over the world
Selected Works in Bioinformatics
This book consists of nine chapters covering a variety of bioinformatics subjects, ranging from database resources for protein allergens, unravelling genetic determinants of complex disorders, characterization and prediction of regulatory motifs, computational methods for identifying the best classifiers and key disease genes in large-scale transcriptomic and proteomic experiments, functional characterization of inherently unfolded proteins/regions, protein interaction networks and flexible protein-protein docking. The computational algorithms are in general presented in a way that is accessible to advanced undergraduate students, graduate students and researchers in molecular biology and genetics. The book should also serve as stepping stones for mathematicians, biostatisticians, and computational scientists to cross their academic boundaries into the dynamic and ever-expanding field of bioinformatics
Delineating Structural Characteristics of Viral Capsid Proteins Critical for Their Functional Assembly.
Viral capsids exhibit elaborate and symmetrical architectures of defined sizes and remarkable mechanical properties not seen with cellular macromolecular complexes. The limited coding capacity of viral genome necessitates economization upon one or a few identical gene products known as capsid proteins for shell assembly. The functional uniqueness of this class of proteins prompts questions on structural features critically important for their higher order organization. In this thesis, I develop the statistical framework and computational tools to pinpoint the structural characteristics of viral capsid proteins exclusive to the virosphere by testing a series of hypotheses, providing understanding of the physical principles governing molecular self-association that can inform rational design of nanomaterials and therapeutics. In the first chapter, I compare the folds of capsid proteins with those of generic proteins, and establish that capsid proteins are segregated in structural fold space, highlighting the geometric constraints of these building blocks for tiling into a closed shell. Second, I develop a software program, PCalign, for quantifying the physicochemical similarity between protein-protein interfaces. This tool overcomes the major limitation of current methods by using a reduced representation of structural information, greatly expanding the structural interface space that can be investigated through inclusion of large macromolecular assemblies that are often not amenable to high resolution experimental techniques. As an application of this method, I propose a computational framework for
template-based protein inhibitor design, leading to the prediction of putative binders for a therapeutic target, the influenza hemagglutinin. In silico evaluations of these candidate drugs parallel those of known protein binders, offering great promise in expanding therapeutic options in the clinic. Lastly, I examine protein-protein interfaces using PCalign, and find strong statistical evidence for the disconnectivity between capsid proteins and cellular proteins in structural interface space. I thus conclude that the basic shape and the sticky edges of these Lego pieces act concertedly to create the sophisticated shell architecture. In summary, the novel tools contributed by this dissertation work lead to delineation of structural features of viral capsid proteins that make them functionally unique, providing an understanding that will serve as the basis for prediction and design.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/110375/1/sscheng_1.pd
Text Mining for Protein-Protein Docking
Scientific publications are a rich but underutilized source of structural and functional information on proteins and protein interactions. Although scientific literature is intended for human audience, text mining makes it amenable to algorithmic processing. It can focus on extracting information relevant to protein binding modes, providing specific residues that are likely be at the binding site for a given pair of proteins. The knowledge of such residues is a powerful guide for the structural modeling of protein-protein complexes. This work combines and extends two well-established areas of research: the non-structural identification of protein-protein interactors, and structure-based detection of functional (small-ligand) sites on proteins. Text-mining based constraints for protein-protein docking is a unique research direction, which has not been explored prior to this study. Although text mining by itself is unlikely to produce docked models, it is useful in scoring of the docking predictions. Our results show that despite presence of false positives, text mining significantly improves the docking quality. To purge false positives in the mined residues, along with the basic text-mining, this work explores enhanced text mining techniques, using various language processing tools, from simple dictionaries, to WordNet (a generic word ontology), parse trees, word vectors and deep recursive neural networks. The results significantly increase confidence in the generated docking constraints and provide guidelines for the future development of this modeling approach. With the rapid growth of the body of publicly available biomedical literature, and new evolving text-mining methodologies, the approach will become more powerful and adequate to the needs of biomedical community
Knob-socket Investigation of Stability and Specificity in Alpha-helical Secondary and Quaternary Packing Structure
The novel knob-socket (KS) model provides a construct to interpret and analyze the direct contributions of amino acid residues to the stability in α-helical protein structures. Based on residue preferences derived from a set of protein structures, the KS construct characterizes intra- and inter-helical packing into regular patterns of simple motifs. The KS model was used in the de novo design of an α-helical homodimer, KSα1.1. Using site-directed mutagenesis, KSα1.1 point mutants were designed to selectively increase and decrease stability by relating KS propensities with changes to α-helical structure. This study suggests that the sockets from the KS Model can be used as a measure of α-helical structure and stability.
The KS model was also used to investigate coiled-coil specificity in bZIP proteins. Identifying and characterizing the interactions that determine the dimerization specificity between bZIP proteins is a crucial factor in better understanding disease formation and proliferation, as well as developing drugs or therapeutics to combat these diseases. Knob-Socket mapping methods identified Asn residues at a positions within the helices, and were determined to be crucial factors in coiled-coil specificity. Site-directed mutagenesis was conducted to investigate the role of the Asn residues, as well as the role played by the neighboring residues at the g and b positions. The results indicate that the Asn at the a position defines coiled-coil specificity, and that the Knob-Socket model can be used to determine bZIP protein quaternary interactions
Knowledge derivation and data mining strategies for probabilistic functional integrated networks
PhDOne of the fundamental goals of systems biology is the experimental verification of the interactome: the entire complement of molecular interactions occurring in the cell. Vast amounts of high-throughput data have been produced to aid this effort. However these data are incomplete and contain high levels of both false positives
and false negatives. In order to combat these limitations in data quality, computational techniques have been
developed to evaluate the datasets and integrate them in a systematic fashion using graph theory. The result is an integrated network which can be analysed using a variety of network analysis techniques to draw new inferences about biological questions and to guide laboratory experiments.
Individual research groups are interested in specific biological problems and, consequently, network analyses are normally performed with regard to a specific question. However, the majority of existing data integration techniques are global and do not focus on specific areas of biology. Currently this issue is addressed by using
known annotation data (such as that from the Gene Ontology) to produce process-specific subnetworks.
However, this approach discards useful information and is of limited use in poorly annotated areas of the interactome.
Therefore, there is a need for network integration techniques that produce process-specific networks without loss of data. The work described here addresses this requirement by extending one of the most powerful integration techniques, probabilistic functional integrated networks (PFINs), to incorporate a concept of
biological relevance.
Initially, the available functional data for the baker’s yeast Saccharomyces cerevisiae was evaluated to identify areas of bias and specificity which could be exploited during network integration. This information was used to develop an integration technique which emphasises interactions relevant to specific biological questions, using
yeast ageing as an exemplar. The integration method improves performance during network-based protein functional prediction in relation to this process. Further, the process-relevant networks complement classical network integration techniques and significantly improve network analysis in a wide range of biological processes.
The method developed has been used to produce novel predictions for 505 Gene Ontology biological processes.
Of these predictions 41,610 are consistent with existing computational annotations, and 906 are consistent with known expert-curated annotations. The approach significantly reduces the hypothesis space for experimental
validation of genes hypothesised to be involved in the oxidative stress response. Therefore, incorporation of biological relevance into network integration can significantly improve network analysis with regard to individual biological questions
Recommended from our members
Scoring functions for protein docking and drug design
textPredicting the structure of complexes formed by two interacting proteins is an important problem in computation structural biology. Proteins perform many of their functions by binding to other proteins. The structure of protein-protein complexes provides atomic details about protein function and biochemical pathways, and can help in designing drugs that inhibit binding. Docking computationally models the structure of protein-protein complexes, given three-dimensional structures of the individual chains. Protein docking methods have two phases. In the first phase, a comprehensive, coarse search is performed for optimally docked models. In the second refinement and reranking phase, the models from the first phase are refined and reranked, with the expectation of extracting a small set of accurate models from the pool of thousands of models obtained from the first phase. In this thesis, new algorithms are developed for the refinement and reranking phase of docking. New scoring functions, or potentials, that rank models are developed. These potentials are learnt using large-scale machine learning methods based on mathematical programming. The procedure for learning these potentials involves examining hundreds of thousands of correct and incorrect models. In this thesis, hierarchical constraints were introduced into the learning algorithm. First, an atomic potential was developed using this learning procedure. A refinement procedure involving side-chain remodeling and conjugate gradient-based minimization was introduced. The refinement procedure combined with the atomic potential was shown to improve docking accuracy significantly. Second, a hydrogen bond potential, was developed. Molecular dynamics-based sampling combined with the hydrogen bond potential improved docking predictions. Third, mathematical programming compared favorably to SVMs and neural networks in terms of accuracy, training and test time for the task of designing potentials to rank docking models. The methods described in this thesis are implemented in the docking package DOCK/PIERR. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer's disease.R. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer’s disease.Computer Science