31 research outputs found
RSLpred: an integrative system for predicting subcellular localization of rice proteins combining compositional and evolutionary information
The attainment of complete map-based sequence for rice (Oryza sativa) is clearly a major milestone for the research community. Identifying the localization of encoded proteins is the key to understanding their functional characteristics and facilitating their purification. Our proposed method, RSLpred, is an effort in this direction for genome-scale subcellular prediction of encoded rice proteins. First, the support vector machine (SVM)-based modules have been developed using traditional amino acid-, dipeptide- (i+1) and four parts-amino acid composition and achieved an overall accuracy of 81.43, 80.88 and 81.10%, respectively. Secondly, a similarity search-based module has been developed using position-specific iterated-basic local alignment search tool and achieved 68.35% accuracy. Another module developed using evolutionary information of a protein sequence extracted from position-specific scoring matrix achieved an accuracy of 87.10%. In this study, a large number of modules have been developed using various encoding schemes like higher-order dipeptide composition, N- and C-terminal, splitted amino acid composition and the hybrid information. In order to benchmark RSLpred, it was tested on an independent set of rice proteins where it outperformed widely used prediction methods such as TargetP, Wolf-PSORT, PA-SUB, Plant-Ploc and ESLpred. To assist the plant research community, an online web tool 'RSLpred' has been developed for subcellular prediction of query rice proteins, which is freely accessible at http://www.imtech.res.in/raghava/rslpred
Cheminformatics Tools to Explore the Chemical Space of Peptides and Natural Products
Cheminformatics facilitates the analysis, storage, and collection of large quantities of chemical data, such as molecular structures and molecules' properties and biological activity, and it has revolutionized medicinal chemistry for small molecules. However, its application to larger molecules is still underrepresented. This thesis work attempts to fill this gap and extend the cheminformatics approach towards large molecules and peptides.
This thesis is divided into two parts. The first part presents the implementation and application of two new molecular descriptors: macromolecule extended atom pair fingerprint (MXFP) and MinHashed atom pair fingerprint of radius 2 (MAP4). MXFP is an atom pair fingerprint suitable for large molecules, and here, it is used to explore the chemical space of non-Lipinski molecules within the widely used PubChem and ChEMBL databases. MAP4 is a MinHashed hybrid of substructure and atom pair fingerprints suitable for encoding small and large molecules. MAP4 is first benchmarked against commonly used atom pairs and substructure fingerprints, and then it is used to investigate the chemical space of microbial and plants natural products with the aid of machine learning and chemical space mapping.
The second part of the thesis focuses on peptides, and it is introduced by a review chapter on approaches to discover novel peptide structures and describing the known peptide chemical space. Then, a genetic algorithm that uses MXFP in its fitness function is described and challenged to generate peptide analogs of peptidic or non-peptidic queries. Finally, supervised and unsupervised machine learning is used to generate novel antimicrobial and non-hemolytic peptide sequences
Post-Translational Protein Modifications involved in Exo- and Endocytosis of Synaptic Vesicles
Neurotransmitter release is a key step that enables information flow between the pre- and
post-synapse. However, regulation of the neurotransmitter release remains an intricate and
widely unexplored matter despite recent advances in the understanding of the
neurotransmitter release machinery and the analysis of the synaptic proteome and protein
modifications. Indeed, post-translational protein modifications such as phosphorylation are
suitable to quickly fine-tune the neurotransmitter release “in place” via affecting tertiary protein
structures and protein-protein interactions, and globally, via modulating signaling pathways.
Here, the investigations were focused on the dependence of protein phosphorylation in
synaptosomes on the synaptic vesicle (SV) cycling, determining kinase-substrate interactions,
and modulatory effects of selected sites on exo- and endocytosis.
The analysis of synaptic phosphoproteome was conducted using TiO2-based enrichment of
phosphorylated peptides with subsequent chemical labeling by isobaric mass tags (TMT) and
a mass spectrometry-based quantification. Synaptosomes were employed as a functional
model of a synapse as they contain the required neurotransmitter release machinery and
respond to stimulation. First, the applicability of electrical stimulation was tested. The field-
stimulation evoked reproducible glutamate release that was significantly suppressed in the
absence of Ca2+, though it remained uncertain, to which degree the release is governed by
exocytosis. Therefore, another approach using a KCl-induced depolarization and treatment
with botulinum neurotoxins (BoNTs) was used to identify phosphorylation events that depend
on SV cycling. BoNTs cleave specifically SNARE proteins and thus block exocytosis and SV
cycling, but do not impede Ca2+-influx evoked by the plasma membrane depolarization.
Comparison of phosphorylation events in synaptosomes stimulated in the presence of Ca2+,
EGTA (0 net Ca2+) or pre-treated with BoNTs identified sites that were differentially
phosphorylated following BoNT treatment, i.e., SV-cycling-dependent sites, and sites that
were differentially phosphorylated when comparing Ca and EGTA conditions, but did not
change under BoNT treatment, i.e., primarily Ca2+-dependent sites. Further differential
expression analysis revealed that BoNT-treatment mostly caused de-phosphorylation of
synaptic proteins. A kinase-substrate analysis showed that >25% of BoNT-responsive sites
are predicted MAPK substrates and 20% of
primarily Ca2+-dependent sites are presumably regulated by CaMKII, which corroborates Ca2+-
dependence of these phosphorylation events. SV-cycling-dependent phosphorylation sites on
syntaxin-1 (T21/T23-Stx1), synaptobrevin (S75-Vamp2), and cannabinoid receptor-1
(S314/T322-Cnr1) were further investigated for their impact on exo- and endocytosis. In
collaboration with Dr. Eugenio Fornasiero and Prof. Dr. Silvio O. Rizzoli, corresponding
phosphomimetic and non-phosphorylatable variants of the proteins were expressed in
cultured hippocampal neurons. Imaging of the pH-sensor pHluorine coupled to
synaptobrevin-2 revealed that the expression of phosphomimetic and non-phosphorylatable
sites affected exo- and endocytosis in neurons.
This work is first to investigate the electrical stimulation in relation to the Ca2+-dependent
neurotransmitter release and exocytosis in synaptosomes. It further provides a
comprehensive draft of synaptosomal phosphoproteome and is first to demonstrate its global
dependence on an active SV cycling. The analysis of cultured hippocampal neurons
expressing non-phosphorylatable and phosphomimetic mutants of pre-synaptic proteins
syntaxin-1, synaptobrevin-2, and cannabinoid receptor-1 further demonstrates that the
identified SV-cycling-dependent sites affect exo- and endocytosis.2021-11-0
VARIATIONS IN MICROARRAY BASED GENE EXPRESSION PROFILING: IDENTIFYING SOURCES AND IMPROVING RESULTS
Two major issues hinder the application of microarray based gene expression profiling in clinical laboratories as a diagnostic or prognostic tool. The first issue is the sheer volume and high-dimensionality of gene expression data from microarray experiments, which require advanced algorithms to extract meaningful gene expression patterns that correlate with biological impact. The second issue is the substantial amount of variation in microarray gene expression data, which impairs the performance of analysis method and makes sharing or integrating microarray data very difficult. Variations can be introduced by all possible sources including the DNA microarray technology itself and the experimental procedures. Many of these variations have not been characterized, measured, or linked to the sources. In the first part of this dissertation, a decision tree learning method was demonstrated to perform as well as more popularly accepted classification methods in partitioning cancer samples with microarray data. More importantly, results demonstrate that variation introduced into microarray data by tissue sampling and tissue handling compromised the performance of classification methods. In the second part of this dissertation, variations introduced by the T7 based in vitro transcription labeling methods were investigated in detail. Results demonstrated that individual amplification methods significantly biased gene expression data even though the methods compared in this study were all derivatives of the T7 RNA polymerase based in vitro transcription labeling approach. Variations observed can be partially explained by the number of biotinylated nucleotides used for labeling and the incubation time of the in vitro transcription experiments. These variations can generate discordant gene expression results even using the same RNA samples and cannot be corrected by post experiment analysis including advanced normalization techniques. Studies in this dissertation stress the concept that experimental and analytical methods must work together. This dissertation also emphasizes the importance of standardizing the DNA microarray technology and experimental procedures in order to optimize gene expression analysis and create quality standards compatible with the clinical application of this technology. These findings should be taken into account especially when comparing data from different platforms, and in standardizing protocols for clinical applications in pathology
Protein Domain Linker Prediction: A Direction for Detecting Protein – Protein Interactions
Protein chains are generally long and consist of multiple domains. Domains are the basic of elements of protein structures that can exist, evolve and function independently. The accurate and reliable identification of protein domains and their interactions has very important impacts in several protein research areas. The accurate prediction of protein domains is a fundamental stage in both experimental and computational proteomics. The knowledge is an initial stage of protein tertiary structure prediction which can give insight into the way in which protein works. The knowledge of domains is also useful in classifying the proteins, understanding their structures, functions and evolution, and predicting protein-protein interactions (PPI). However, predicting structural domains within proteins is a challenging task in computational biology. A promising direction of domain prediction is detecting inter-domain linkers and then predicting the reigns of the protein sequence in which the structural domains are located accordingly.
Protein-protein interactions occur at almost every level of cell function. The identification of interaction among proteins and their associated domains provide a global picture of cellular functions and biological processes. It is also an essential step in the construction of PPI networks for human and other organisms. PPI prediction has been considered as a promising alternative to the traditional drug design techniques. The identification of possible viral-host protein interaction can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization.
In this work, a compact and accurate approach for inter-domain linker prediction is developed based solely on protein primary structure information. Then, inter-domain linker knowledge is used in predicting structural domains and detecting PPI. The research work in this dissertation can be summarized in three main contributions. The first contribution is predicting protein inter-domain linker regions by introducing the concept of amino acid compositional index and refining the prediction by using the Simulated Annealing optimization technique. The second contribution is identifying structural domains based on inter-domain linker knowledge. The inter-domain linker knowledge, represented by the compositional index, is enhanced by the in cooperation of biological knowledge, represented by amino acid physiochemical properties. To develop a well optimized Random Forest classifier for predicting novel domain and inter-domain linkers. In the third contribution, the domain information knowledge is utilized to predict protein-protein interactions. This is achieved by characterizing structural domains within protein sequences, analyzing their interactions, and predicting protein interaction based on their interacting domains. The experimental studies and the higher accuracy achieved is a valid argument in favor of the proposed framework
Recommended from our members
Computational Toxinology
Venoms are complex mixtures of biological macromolecules and other compounds that are used for predatory and defensive purposes by hundreds of thousands of known species worldwide. Throughout human history, venoms and venom components have been used to treat a vast array of illnesses, causing them to be of great clinical, economic, and academic interest to the drug discovery and toxinology communities. In spite of major computational advances that facilitate data-driven drug discovery, most therapeutic venom effects are still discovered via tedious trial-and-error, or simply by accident. In this dissertation, I describe a body of work that aims to establish a new subdiscipline of translational bioinformatics, which I name “computational toxinology”.
To accomplish this goal, I present three integrated components that span a wide range of informatics techniques: (1) VenomKB, (2) VenomSeq, and (3) VenomKB’s Semantic API. To provide a platform for structuring, representing, retrieving, and integrating venom data relevant to drug discovery, VenomKB provides a database-backed web application and knowledge base for computational toxinology. VenomKB is structured according to a fully-featured ontology of venoms, and provides data aggregated from many popular web re- sources. VenomSeq is a biotechnology workflow that is designed to generate new high-throughput sequencing data for incorporation into VenomKB. Specifically, we expose human cells to controlled doses of crude venoms, conduct RNA-Sequencing, and build profiles of differential gene expression, which we then compare to publicly-available differential expression data for known dis- eases and drugs with known effects, and use those comparisons to hypothesize ways that the venoms could act in a therapeutic manner, as well. These data are then integrated into VenomKB, where they can be effectively retrieved and evaluated using existing data and known therapeutic associations. VenomKB’s Semantic API further develops this functionality by providing an intelligent, powerful, and user-friendly interface for querying the complex underlying data in VenomKB in a way that reflects the intuitive, human-understandable mean- ing of those data. The Semantic API is designed to cater to the needs of advanced users as well as laypersons and bench scientists without previous expertise in computational biology and semantic data analysis.
In each chapter of the dissertation, I describe how we evaluated these 3 components through various approaches. We demonstrate the utility of VenomKB and the Semantic API by testing a number of practical use-cases for each, designed to highlight their ability to rediscover existing knowledge as well as suggesting potential areas for future exploration. We use statistics and data science techniques to evaluate VenomSeq on 25 diverse species of venomous animals, and propose biologically feasible explanations for significant findings. In evaluating the Semantic API, I show how observations on VenomSeq data can be interpreted and placed into the context of past research by members of the larger toxinology community.
Computational toxinology is a toolbox designed to be used by multiple stakeholders (toxinologists, computational biologists, and systems pharmacologists, among others) to improve the return rate of clinically-significant findings from manual experimentation. It aims to achieve this goal by enabling access to data, providing means for easy validation of results, and suggesting specific hypotheses that are preliminarily supported by rigorous inferential statistics. All components of the research I describe are open-access and publicly available, to improve reproducibility and encourage widespread adoptio
Relation Prediction over Biomedical Knowledge Bases for Drug Repositioning
Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying other essential relations (e.g., causation, prevention) between biomedical entities is also critical to understand biomedical processes. Hence, it is crucial to develop automated relation prediction systems that can yield plausible biomedical relations to expedite the discovery process. In this dissertation, we demonstrate three approaches to predict treatment relations between biomedical entities for the drug repositioning task using existing biomedical knowledge bases. Our approaches can be broadly labeled as link prediction or knowledge base completion in computer science literature. Specifically, first we investigate the predictive power of graph paths connecting entities in the publicly available biomedical knowledge base, SemMedDB (the entities and relations constitute a large knowledge graph as a whole). To that end, we build logistic regression models utilizing semantic graph pattern features extracted from the SemMedDB to predict treatment and causative relations in Unified Medical Language System (UMLS) Metathesaurus. Second, we study matrix and tensor factorization algorithms for predicting drug repositioning pairs in repoDB, a general purpose gold standard database of approved and failed drug–disease indications. The idea here is to predict repoDB pairs by approximating the given input matrix/tensor structure where the value of a cell represents the existence of a relation coming from SemMedDB and UMLS knowledge bases. The essential goal is to predict the test pairs that have a blank cell in the input matrix/tensor based on the shared biomedical context among existing non-blank cells. Our final approach involves graph convolutional neural networks where entities and relation types are embedded in a vector space involving neighborhood information. Basically, we minimize an objective function to guide our model to concept/relation embeddings such that distance scores for positive relation pairs are lower than those for the negative ones. Overall, our results demonstrate that recent link prediction methods applied to automatically curated, and hence imprecise, knowledge bases can nevertheless result in high accuracy drug candidate prediction with appropriate configuration of both the methods and datasets used
Bioinformatic analysis of bacterial and eukaryotic amino- terminal signal peptides
Ph.DDOCTOR OF PHILOSOPH
Functional prediction of bioactive toxins in scorpion venom through bioinformatics
Ph.DDOCTOR OF PHILOSOPH