341 research outputs found
New Methods to Study Proline-Rich Disordered Regions and Their Structural Ensembles in Protein Signaling Pathways
Ph.DDOCTOR OF PHILOSOPH
Recommended from our members
Solving Challenging Structures using Single-Particle Cryogenic Electron Microscopy
Single-particle cryogenic electron microscopy (cryo-EM) has become a powerful mainstay tool in high resolution structural biology thanks to advances in hardware, software and sample preparation technology. In my thesis, I utilized this technique to unravel the function of various challenging biological macromolecules.
My first focus was bacterial ribosomal biogenesis: understanding how bacteria assemble their ribosomes. Ribosomes are the factories of the cell, responsible for manufacturing all proteins. Ribosomes themselves are huge, with the bacterial version made of 52 proteins and 4566 RNA nucleotides. How these components assemble has long been a mystery. Early groundbreaking work sketched out a biogenesis pathway using purified components in vitro – but under non-physiological conditions. We sought to understand how the bacterial ribosome – specifically the large subunit 50S – is built inside the cell. To achieve this, we engineered a conditional knock-out bacterial strain that lacked one specific ribosomal protein (L17). This caused the cells to accumulate incomplete intermediates along the 50S biogenesis pathway. These intermediates were purified and examined with mass spectrometry and single-particle cryo-EM.
Two major hurdles arose in this project: firstly, the biogenesis intermediates exhibited a preferred orientation when vitrified for cryo-EM analysis. This means that instead of showing many different views required for reconstruction of the 3D structure, the intermediates only adopted one view on the cryo-EM grid. To overcome this problem, we engineered a method to induce additional views on the microscope by tilting the stage. Using another test protein that also exhibited preferred orientation (hemagglutinin), we optimized and characterized this new tilt methodology and showed it was generally applicable to overcoming preferred orientation, regardless of type of specimen. We also created a software tool, called 3DFSC (3dfsc.salk.edu), for other microscopists to calculate the degree of directional anisotropy in their structures due to preferred orientation. Using this tilt strategy finally enabled the structural elucidation of our 50S intermediates. The second challenge in the project was the large amount of heterogeneity present in the sample. Through hierarchical 3D classification schemes using the latest software tools, we obtained 14 different 50S intermediate structures, all from imaging a single cryo-EM grid. By analyzing the missing components of each intermediate, and corroborating these observations with mass spectrometry data, we outlined the first in vivo 50S assembly pathway, and showed that ribosome assembly occurs step-wise and in parallel pathways.
My second focus was on pushing the resolution limits of single-particle cryo-EM using adeno-associated virus (AAV) serotype 2 homogeneous virus-like particles (VLPs) that lack DNA. Exploiting several technical advances to improve resolution, including use of gold grids, per-particle CTF refinement, and correction for Ewald sphere curvature, we managed to obtain a 1.86 Å resolution reconstruction of the AAV2L336C variant VLP, the highest resolution icosahedral virus reconstruction solved by single-particle cryo-EM to date. Using our structure, we were able to show improvements using Ewald sphere curvature correction and shed light on the mechanistic basis as to why the L336C mutation resulted in defects in genome packaging and infectivity compared to the WT viral particles.
My third focus was the understanding of small membrane proteins involved in infectious diseases. Membrane proteins are a challenge to work with due to the need for them to be extracted from the lipid bilayer for studies as compared to soluble proteins. Infectious diseases have a huge burden on society, with the top three infectious agents accounting for 2.7 million deaths in 2016. The third most deadly infectious disease is malaria, a mosquito-borne parasite which kills 450,000 people annually. One drug used early on for treating malaria was chloroquine but its usefulness waned due to development of resistance. Chloroquine resistance is mediated by the chloroquine resistance transporter (PfCRT). Although small (49 kDa) for single-particle cryo-EM, we solved its structure by using fragment antibody technology to add mass and help with image alignment and 3D reconstruction. The 3.2 Å structure resembles other drug metabolite transporters, and the chloroquine resistance mutations map to a ring around the central cavity, suggesting this central pore as the drug binding site.
Tuberculosis (TB) is the top killer, above malaria and HIV/AIDS, being responsible for 1.3 million deaths. In TB, a common antibiotic target is the bacterium’s cell wall synthesis machinery. One family of such enzymes is the arabinosyltransferases, which synthesize the critical arabinose sugars. Using single-particle cryo-EM, we solved two high resolution structures of one such essential enzyme, AftD. Due to the low yield of the protein, a picoliter automated sample dispensing robot was crucial to allow for initial cryo-EM analysis. We then performed mutagenesis studies in M. smegmatis, a TB model organism, which uncovered the critical amino acid residues in the active site and determined that a bound acyl-carrier-protein was likely involved in allosteric inhibition of AftD’s active site. Another member of the family, EmbB, is the target of a widely used frontline TB drug called ethambutol. We have solved the high resolution structures of the apo and putative drug-bound states of EmbB, allowing us to map out, for the first time, both the active site and drug-resistance mutations of this crucial enzyme. The atomic structures of the functional pockets of Mycobacterial AftD and malarial PfCRT will hopefully enable structure-based drug design to improve existing drugs or potentially even develop new treatments against these infectious maladies.
In conclusion, the continual and breathtaking improvements in single-particle cryo-EM methodology has been instrumental in allowing the elucidation of the aforementioned biological macromolecules from ribosome biogenesis intermediates, to AAV2 vehicle, Plasmodium drug resistance transporter to mycobacterial glycosyltransferases – structures of which help explain biological function
The role of dynamic hydrogen bond networks in protonation coupled dynamics of retinal proteins
Hydrogen bonds (H-bonds) are an essential interaction in membrane proteins. Embedded in complex hydrated lipid bilayers, intramolecular interactions through the means of hydrogen bonding networks are often crucial for the function of the protein. Internal water molecules that occupy stable sites inside the protein, or water molecules that visit transiently from the bulk, can play an important role in shaping local conformational dynamics forming complex networks that bridge regions of the protein via water-mediated hydrogen bonds that can function as wires for the transferring of protons as a part of the protein’s function. For example, the membrane-embedded channelrhodopsins which are found in archaea are proteins that couple light induced isomerization of a retinal chromophore with proton transfer reactions and passive flow of cations through their pore. I contributed to the development of a new algorithm package that features a unique approach to H-bond analyses. I performed analyses of long Molecular Dynamics (MD) trajectories of channelrhodopsin variants embedded in hydrated lipid membranes and large data sets of static structures, to detect and dissect dynamic hydrogen-bond networks. The photocycle of channelrhodopsins begins with absorption and isomerization of the retinal from an all-trans state to a 13-cis state and followed by the deprotonation of the Schiff base. Thus, the retinal is found in the epicenter of the analyses. Through the use of 2-dimensional graphs of the protein H-bond networks I identified protein groups potentially important for the proton transfer activity. Local dynamics are highly affected by point mutations of amino acids important for function. The interior of channelrhodopsin C1C2 hosts extensive networks of protein and H-bonded-water molecules, and a never reported before, network that can bridge transiently the two retinal chromophores in channelrhodopsin dimers.
In a recently identified inward proton pump, AntR, I applied centrality measures on MD trajectories of the homology model I generated, to assess the communication of the amino acid residues within the networks. I detected a frequently sampled long water chain that connects the retinal with a candidate proton acceptor, as well as a conserved serine in the vicinity of the retinal chromophore plays a significant role in the connectivity and communication of the H-bond networks upon isomerization. A similar water bridge is sampled in independent simulations of ChR2, where a participant for the proton donor group connects to the 13-cis,15-anti retinal. Proton transfer reactions often take place through certain amino acids, forming patterns. I analyzed H-bond patterns or motifs in large hand-curated datasets of static structures of α-transmembrane helix proteins, organized according to the superfamilies they belong, their function and an alternative classification method. The presence of motifs in TM proteins is tightly related to their families/superfamilies of the host protein and their position along the membrane normal.Wasserstoffbrücken (H-Brücke) sind eine wesentliche Wechselwirkung in Membranproteinen. Eingebettet in komplexe hydratisierte Lipiddoppelschichten sind intramolekulare Wechselwirkungen über Wasserstoffbrückenbindungsnetzwerke oft entscheidend für die Funktion des Proteins. Interne Wassermoleküle, die stabile Stellen im Inneren des Proteins besetzen, oder Wassermoleküle, die vorübergehend aus der Masse zu Besuch kommen, können eine wichtige Rolle bei der Gestaltung der lokalen Konformationsdynamik spielen, indem sie komplexe Netzwerke bilden, die Regionen des Proteins über wasservermittelte Wasserstoffbrückenbindungen überbrücken, die als Drähte für den Transfer von Protonen als Teil der Proteinfunktion funktionieren können. Die in Archaeen vorkommenden, in die Membran eingebetteten Kanalrhodopsine sind beispielsweise Proteine, die die lichtinduzierte Isomerisierung eines Retinachromophors mit Protonentransferreaktionen und dem passiven Fluss von Kationen durch ihre Pore verbinden. Ich habe an der Entwicklung eines neuen Algorithmuspakets mitgewirkt, das einen einzigartigen Ansatz für H-Bindungsanalysen bietet. Ich habe lange Molekulardynamik-Trajektorien von Kanalrhodopsine-Varianten, die in hydratisierte Lipidmembranen eingebettet sind, sowie große Datensätze statischer Strukturen analysiert, um dynamische Wasserstoffbrücken-bindungsnetzwerke zu erkennen und zu zerlegen. Der Photozyklus der Kanalrhodopsine beginnt mit der Absorption und Isomerisierung des Retinals von einem all-trans-Zustand zu einem 13-cis-Zustand, gefolgt von der Deprotonierung der Schiff-Base. Somit steht das Retinal im Mittelpunkt der Analysen. Durch die Verwendung von 2-dimensionalen Graphen der Protein- H-Brückenetzwerke identifizierte ich Proteingruppen, die für die Protonentransferaktivität wichtig sein könnten. Die lokale Dynamik wird durch Punktmutationen der für die Funktion wichtigen Aminosäuren stark beeinflusst. Das Innere von Kanalrhodopsine C1C2 beherbergt ausgedehnte Netzwerke von Protein- und H-Brücke-Wassermolekülen und ein bisher unbekanntes Netzwerk, das die beiden retinalen Chromophore in Kanalrhodopsine-Dimeren vorübergehend überbrücken kann.
In einer kürzlich identifizierten Protonenpumpe, AntR, wendete ich Zentralitätsmaße auf MD-Trajektorien des von mir erstellten Homologiemodells an, um die Kommunikation der Aminosäurereste innerhalb der Netzwerke zu bewerten. Ich fand, dass eine häufig gesampelte lange Wasserkette, die das Retinal mit einem Protonenakzeptor verbindet, sowie ein konserviertes Serin in der Nähe des Retinal-Chromophors eine wichtige Rolle bei der Konnektivität und Kommunikation der H-Brückesnetzwerke bei der Isomerisierung spielt. Eine ähnliche Wasserbrücke ist in unabhängigen Simulationen von Kanalrhodopsine-2 zu finden, wo ein Teilnehmer für die Protonendonorgruppe mit dem 13-cis,15-anti-Retinal verbunden ist. Protonenübertragungsreaktionen finden oft über bestimmte Aminosäuren statt und bilden Muster. Ich analysierte H-Brückemuster oder -motive in großen, von Hand kuratierten Datensätzen statischer Strukturen von α-Transmembranhelix-Proteinen, geordnet nach den Superfamilien, zu denen sie gehören, ihrer Funktion und einer alternativen Klassifizierungsmethode. Das Vorhandensein von Motiven in TM-Proteinen steht in engem Zusammenhang mit ihren Familien/Superfamilien des Wirtsproteins und ihrer Position entlang der Membrannormale
New evolutionary approaches to protein structure prediction
Programa de doctorado en Biotecnología y Tecnología QuímicaThe problem of Protein Structure Prediction (PSP) is one of the principal topics in Bioinformatics. Multiple approaches have been developed in order to predict the protein structure of a protein. Determining the three dimensional structure of proteins is necessary to understand the functions of molecular protein level. An useful, and commonly used, representation for protein 3D structure is the protein contact map, which represents binary proximities (contact or non-contact) between each pair of amino acids of a protein. This thesis work, includes a compilation of the soft computing techniques for the protein structure prediction problem (secondary and tertiary structures). A novel evolutionary secondary structure predictor is also widely described in this work. Results obtained confirm the validity of our proposal. Furthermore, we also propose a multi-objective evolutionary approach for contact map prediction based on physico-chemical properties of amino acids. The evolutionary algorithm produces a set of decision rules that identifies contacts between amino acids. The rules obtained by the algorithm impose a set of conditions based on amino acid properties in order to predict contacts. Results obtained by our approach on four different protein data sets are also presented. Finally, a statistical study was performed to extract valid conclusions from the set of prediction rules generated by our algorithm.Universidad Pablo de Olavide. Centro de Estudios de Postgrad
Modelling the structure and interactions of leukocyte integrins
Heterodimeric transmembrane protein structure is complex and insufficient structural information exists, concerning leukocyte integrin proteins. To determine protein structure, homology modelling was conducted and modelling software was evaluated. Leukocyte integrin homologs were obtained from the PDB and models were generated using online servers and MODELLER. Template homologs were fewer in number and of lower quality in comparison to monomeric extracellular proteins. Models were docked using ClusPro, HADDOCK2.2 and AutoDock vina. Models were evaluated using PROSA, Verify-3D and PROSESS. Higher quality models were generated when using MODELLER to separately model monomeric subunits in three defined domain regions (extracellular, transmembrane and cytoplasmic). Template selection concerning these proteins is critical as an intricate relationship exists between model quality, template quality, template quantity, template resolution, target-template identity and template sequence coverage. Docking monomeric subunits was challenging when using ClusPro and the best ligand docking procedures were completed using AutoDock vina. PROSESS provided the most accurate evaluation of protein models, in comparison to PROSA and Verify-3D. These results indicate that although homology modelling is a powerful tool there is much room for improvement. Experimentally obtained templates should be expanded upon within the PDB and energy functions should cater for both monomeric and transmembrane heterodimeric proteins. Leukocyte integrins appear to adopt a closed conformation, which may still facilitate LDV ligand association within the α/β interface. The α3β1 integrin may interact with laminin-5 through the ELV sequence within the G-domain of the α laminin subuni
Characterization of Coenzyme Q Biosynthesis Proteins through Integrative Modeling at the Protein-Membrane Interface
Integral and peripheral membrane proteins account for one-third of the human proteome, and they are estimated to represent the target for over 50% of modern medicinal drugs. Despite their central role in medicine, the complex, heterogeneous and dynamic nature of biological membranes complicates the investigation of their mechanism of action by both experimental and computational techniques. Among the different membrane bound compartments in eukaryotic cells, mitochondria are highly complex in form and function, and they harbor a unique proteome that remains largely unexplored. A growing number of inherited metabolic diseases are associated with mitochondrial dysfunction, which necessitates the structural and functional elucidation of mitochondrial proteins. In this thesis, we combine experimental and computational methods to explore the activity of COQ8 and COQ9, two functionally elusive proteins of the biosynthetic complex that produces coenzyme Q, a redox-active lipid component of the mitochondrial electron transport chain.
(i) Conserved Lipid Modulation of Ancient Kinase-Like UbiB Family Member COQ8.
We demonstrate that COQ8 has an ATPase function that is activated when it specifically associates with cardiolipin-containing membranes. We identify its interaction surface with the inner mitochondrial membrane, which gives hints about the possible interaction surfaces with other members of the coenzyme Q synthesis machinery and has implications on how it mediates functional interactions with lipids. Collectively, this work reveals how the positioning of COQ8 on the inner mitochondrial membrane is key to its activation, and therefore advances our understanding of the COQ8 function.
(ii) Membrane, Lipid, and Protein Interactions of Coenzyme Q Biosynthesis Protein COQ9.
We explore the lipid binding activity of COQ9, and we reveal that COQ9 repurposes an ancient bacterial fold to selectively bind aromatic isoprenes, including CoQ intermediates that reside within the bilayer. We elucidate the mechanistic details of its membrane binding process, by which COQ9 warps the membrane surface and creates a tightly sealed hydrophobic region to access its lipid cargo. Finally, we establish a potential molecular interface between COQ9 and COQ7, the enzyme that catalyzes the penultimate step in CoQ biosynthesis, suggesting a model whereby COQ9 presents intermediates to CoQ enzymes to overcome the hydrophobic barrier of the membrane. Collectively, our results provide a mechanism for how a lipid binding protein might access, select, and extract specific cargo from a membrane and present it to a peripheral membrane enzyme.
In conclusion, our work is a good illustration of the interplay between experiment and modeling in protein research and specifically in understanding how proteins perform their action in direct synergy with membrane environments. We anticipate our integrative methodologies and mechanistic findings will prove relevant to other membrane proteins, whose fine functional modulation at the membrane-water interface has been historically challenging to characterize
Bionano-Interfaces through Peptide Design
The clinical success of restoring bone and tooth function through implants critically depends on the maintenance of an infection-free, integrated interface between the host tissue and the biomaterial surface. The surgical site infections, which are the infections within one year of surgery, occur in approximately 160,000-300,000 cases in the US annually. Antibiotics are the conventional treatment for the prevention of infections. They are becoming ineffective due to bacterial antibiotic-resistance from their wide-spread use. There is an urgent need both to combat bacterial drug resistance through new antimicrobial agents and to limit the spread of drug resistance by limiting their delivery to the implant site. This work aims to reduce surgical site infections from implants by designing of chimeric antimicrobial peptides to integrate a novel and effective delivery method. In recent years, antimicrobial peptides (AMPs) have attracted interest as natural sources for new antimicrobial agents. By being part of the immune system in all life forms, they are examples of antibacterial agents with successfully maintained efficacy across evolutionary time. Both natural and synthetic AMPs show significant promise for solving the antibiotic resistance problems. In this work, AMP1 and AMP2 was shown to be active against three different strains of pathogens in Chapter 4. In the literature, these peptides have been shown to be effective against multi-drug resistant bacteria. However, their effective delivery to the implantation site limits their clinical use. In recent years, different groups adapted covalent chemistry-based or non-specific physical adsorption methods for antimicrobial peptide coatings on implant surfaces. Many of these procedures use harsh chemical conditions requiring multiple reaction steps. Furthermore, none of these methods allow the orientation control of these molecules on the surfaces, which is an essential consideration for biomolecules. In the last few decades, solid binding peptides attracted high interest due to their material specificity and self-assembly properties. These peptides offer robust surface adsorption and assembly in diverse applications. In this work, a design method for chimeric antimicrobial peptides that can self-assemble and self-orient onto biomaterial surfaces was demonstrated. Three specific aims used to address this two-fold strategy of self-assembly and self-orientation are: 1) Develop classification and design methods using rough set theory and genetic algorithm search to customize antibacterial peptides; 2) Develop chimeric peptides by designing spacer sequences to improve the activity of antimicrobial peptides on titanium surfaces; 3) Verify the approach as an enabling technology by expanding the chimeric design approach to other biomaterials. In Aim 1, a peptide classification tool was developed because the selection of an antimicrobial peptide for an application was difficult among the thousands of peptide sequences available. A rule-based rough-set theory classification algorithm was developed to group antimicrobial peptides by chemical properties. This work is the first time that rough set theory has been applied to peptide activity analysis. The classification method on benchmark data sets resulted in low false discovery rates. The novel rough set theory method was combined with a novel genetic algorithm search, resulting in a method for customizing active antibacterial peptides using sequence-based relationships. Inspired by the fact that spacer sequences play critical roles between functional protein domains, in Aim 2, chimeric peptides were designed to combine solid binding functionality with antimicrobial functionality. To improve how these functions worked together in the same peptide sequence, new spacer sequences were engineered. The rough set theory method from Aim 1 was used to find structure-based relationships to discover new spacer sequences which improved the antimicrobial activity of the chimeric peptides. In Aim 3, the proposed approach is demonstrated as an enabling technology. In this work, calcium phosphate was tested and verified the modularity of the chimeric antimicrobial self-assembling peptide approach. Other chimeric peptides were designed for common biomaterials zirconia and urethane polymer. Finally, an antimicrobial peptide was engineered for a dental adhesive system toward applying spacer design concepts to optimize the antimicrobial activity
Machine learning applications for the topology prediction of transmembrane beta-barrel proteins
The research topic for this PhD thesis focuses on the topology prediction of beta-barrel transmembrane proteins. Transmembrane proteins adopt various conformations that are about the functions that they provide. The two most predominant classes are alpha-helix bundles and beta-barrel transmembrane proteins. Alpha-helix proteins are present in larger numbers than beta-barrel transmembrane proteins in structure databases. Therefore, there is a need to find computational tools that can predict and detect the structure of beta-barrel transmembrane proteins. Transmembrane proteins are used for active transport across the membrane or signal transduction. Knowing the importance of their roles, it becomes essential to understand the structures of the proteins. Transmembrane proteins are also a significant focus for new drug discovery. Transmembrane beta-barrel proteins play critical roles in the translocation machinery, pore formation, membrane anchoring, and ion exchange. In bioinformatics, many years of research have been spent on the topology prediction of transmembrane alpha-helices. The efforts to TMB (transmembrane beta-barrel) proteins topology prediction have been overshadowed, and the prediction accuracy could be improved with further research. Various methodologies have been developed in the past to predict TMB proteins topology. Methods developed in the literature that are available include turn identification, hydrophobicity profiles, rule-based prediction, HMM (Hidden Markov model), ANN (Artificial Neural Networks), radial basis function networks, or combinations of methods. The use of cascading classifier has never been fully explored. This research presents and evaluates approaches such as ANN (Artificial Neural Networks), KNN (K-Nearest Neighbors, SVM (Support Vector Machines), and a novel approach to TMB topology prediction with the use of a cascading classifier. Computer simulations have been implemented in MATLAB, and the results have been evaluated. Data were collected from various datasets and pre-processed for each machine learning technique. A deep neural network was built with an input layer, hidden layers, and an output. Optimisation of the cascading classifier was mainly obtained by optimising each machine learning algorithm used and by starting using the parameters that gave the best results for each machine learning algorithm. The cascading classifier results show that the proposed methodology predicts transmembrane beta-barrel proteins topologies with high accuracy for randomly selected proteins. Using the cascading classifier approach, the best overall accuracy is 76.3%, with a precision of 0.831 and recall or probability of detection of 0.799 for TMB topology prediction. The accuracy of 76.3% is achieved using a two-layers cascading classifier. By constructing and using various machine-learning frameworks, systems were developed to analyse the TMB topologies with significant robustness. We have presented several experimental findings that may be useful for future research. Using the cascading classifier, we used a novel approach for the topology prediction of TMB proteins
Biological Protein Patterning Systems across the Domains of Life: from Experiments to Modelling
Distinct localisation of macromolecular structures relative to cell shape is a common feature across the domains of life. One mechanism for achieving spatiotemporal intracellular organisation is the Turing reaction-diffusion system (e.g. Min system in the bacterium Escherichia coli controlling in cell division). In this thesis, I explore potential Turing systems in archaea and eukaryotes as well as the effects of subdiffusion. Recently, a MinD homologue, MinD4, in the archaeon Haloferax volcanii was found to form a dynamic spatiotemporal pattern that is distinct from E. coli in its localisation and function. I investigate all four archaeal Min paralogue systems in H. volcanii by identifying four putative MinD activator proteins based on their genomic location and show that they alter motility but do not control MinD4 patterning. Additionally, one of these proteins shows remarkably fast dynamic motion with speeds comparable to eukaryotic molecular motors, while its function appears to be to control motility via interaction with the archaellum. In metazoa, neurons are highly specialised cells whose functions rely on the proper segregation of proteins to the axonal and somatodendritic compartments. These compartments are bounded by a structure called the axon initial segment (AIS) which is precisely positioned in the proximal axonal region during early neuronal development. How neurons control these self-organised localisations is poorly understood. Using a top-down analysis of developing neurons in vitro, I show that the AIS lies at the nodal plane of the first non-homogeneous spatial harmonic of the neuron shape while a key axonal protein, Tau, is distributed with a concentration that matches the same harmonic. These results are consistent with an underlying Turing patterning system which remains to be identified. The complex intracellular environment often gives rise to the subdiffusive dynamics of molecules that may affect patterning. To simulate the subdiffusive transport of biopolymers, I develop a stochastic simulation algorithm based on the continuous time random walk framework, which is then applied to a model of a dimeric molecular motor. This provides insight into the effects of subdiffusion on motor dynamics, where subdiffusion reduces motor speed while increasing the stall force. Overall, this thesis makes progress towards understanding intracellular patterning systems in different organisms, across the domains of life
Development of a deep learning-based computational framework for the classification of protein sequences
Dissertação de mestrado em BioinformaticsProteins are one of the more important biological structures in living organisms, since they
perform multiple biological functions. Each protein has different characteristics and properties,
which can be employed in many industries, such as industrial biotechnology, clinical applications,
among others, demonstrating a positive impact.
Modern high-throughput methods allow protein sequencing, which provides the protein
sequence data. Machine learning methodologies are applied to characterize proteins using
information of the protein sequence. However, a major problem associated with this method
is how to properly encode the protein sequences without losing the biological relationship
between the amino acid residues. The transformation of the protein sequence into a numeric
representation is done by encoder methods. In this sense, the main objective of this project is to
study different encoders and identify the methods which yield the best biological representation
of the protein sequences, when used in machine learning (ML) models to predict different labels
related to their function.
The methods were analyzed in two study cases. The first is related to enzymes, since
they are a well-established case in the literature. The second used transporter sequences, a
lesser studied case in the literature. In both cases, the data was collected from the curated
database Swiss-Prot. The encoders that were tested include: calculated protein descriptors;
matrix substitution methods; position-specific scoring matrices; and encoding by pre-trained
transformer methods. The use of state-of-the-art pretrained transformers to encode protein
sequences proved to be a good biological representation for subsequent application in state-of-the-art ML methods. Namely, the ESM-1b transformer achieved a Mathews correlation coefficient
above 0.9 for any multiclassification task of the transporter classification system.As proteínas são estruturas biológicas importantes dos organismos vivos, uma vez que estas desempenham múltiplas funções biológicas. Cada proteína tem características e propriedades diferentes, que podem ser aplicadas em diversas indústrias, tais como a biotecnologia industrial, aplicações clínicas, entre outras, demonstrando um impacto positivo. Os métodos modernos de alto rendimento permitem a sequenciação de proteínas, fornecendo dados da sequência proteica. Metodologias de aprendizagem de máquinas tem sido aplicada para caracterizar as proteínas utilizando informação da sua sequência. Um problema associado a este método e como representar adequadamente as sequências proteicas sem perder a relação biológica entre os resíduos de aminoácidos. A transformação da sequência de proteínas numa representação numérica é feita por codificadores. Neste sentido, o principal objetivo deste projeto é estudar diferentes codificadores e identificar os métodos que produzem a melhor representação biológica das sequências proteicas, quando utilizados em modelos de aprendizagem mecânica para prever a classificação associada à sua função a sua função. Os métodos foram analisados em dois casos de estudo. O primeiro caso foi baseado em enzimas, uma vez que são um caso bem estabelecido na literatura. O segundo, na utilização de proteínas de transportadores, um caso menos estudado na literatura. Em ambos os casos, os dados foram recolhidos a partir da base de dados curada Swiss-Prot. Os codificadores testados incluem: descritores de proteínas calculados; métodos de substituição por matrizes; matrizes de pontuação específicas da posição; e codificação por modelos de transformadores pré-treinados. A utilização de transformadores de última geração para codificar sequências de proteínas demonstrou ser uma boa representação biológica para aplicação subsequente em métodos ML de última geração. Nomeadamente, o transformador ESM-1b atingiu um coeficiente de correlação de Matthews acima de 0,9 para multiclassificação do sistema de classificação de proteínas transportadoras
- …