51 research outputs found
PARROT is a flexible recurrent neural network framework for analysis of large protein datasets
The rise of high-throughput experiments has transformed how scientists approach biological questions. The ubiquity of large-scale assays that can test thousands of samples in a day has necessitated the development of new computational approaches to interpret this data. Among these tools, machine learning approaches are increasingly being utilized due to their ability to infer complex nonlinear patterns from high-dimensional data. Despite their effectiveness, machine learning (and in particular deep learning) approaches are not always accessible or easy to implement for those with limited computational expertise. Here we present PARROT, a general framework for training and applying deep learning-based predictors on large protein datasets. Using an internal recurrent neural network architecture, PARROT is capable of tackling both classification and regression tasks while only requiring raw protein sequences as input. We showcase the potential uses of PARROT on three diverse machine learning tasks: predicting phosphorylation sites, predicting transcriptional activation function of peptides generated by high-throughput reporter assays, and predicting the fibrillization propensity of amyloid beta with data generated by deep mutational scanning. Through these examples, we demonstrate that PARROT is easy to use, performs comparably to state-of-the-art computational tools, and is applicable for a wide array of biological problems
SARS-CoV-2 requires cholesterol for viral entry and pathological syncytia formation
Many enveloped viruses induce multinucleated cells (syncytia), reflective of membrane fusion events caused by the same machinery that underlies viral entry. These syncytia are thought to facilitate replication and evasion of the host immune response. Here, we report that co-culture of human cells expressing the receptor ACE2 with cells expressing SARS-CoV-2 spike, results in synapse-like intercellular contacts that initiate cell-cell fusion, producing syncytia resembling those we identify in lungs of COVID-19 patients. To assess the mechanism of spike/ACE2-driven membrane fusion, we developed a microscopy-based, cell-cell fusion assay to screen ~6000 drugs and \u3e30 spike variants. Together with quantitative cell biology approaches, the screen reveals an essential role for biophysical aspects of the membrane, particularly cholesterol-rich regions, in spike-mediated fusion, which extends to replication-competent SARS-CoV-2 isolates. Our findings potentially provide a molecular basis for positive outcomes reported in COVID-19 patients taking statins and suggest new strategies for therapeutics targeting the membrane of SARS-CoV-2 and other fusogenic viruses
SHEPHARD: A modular and extensible software architecture for analyzing and annotating large protein datasets
MOTIVATION: The emergence of high-throughput experiments and high-resolution computational predictions has led to an explosion in the quality and volume of protein sequence annotations at proteomic scales. Unfortunately, sanity checking, integrating, and analyzing complex sequence annotations remains logistically challenging and introduces a major barrier to entry for even superficial integrative bioinformatics.
RESULTS: To address this technical burden, we have developed SHEPHARD, a Python framework that trivializes large-scale integrative protein bioinformatics. SHEPHARD combines an object-oriented hierarchical data structure with database-like features, enabling programmatic annotation, integration, and analysis of complex datatypes. Importantly SHEPHARD is easy to use and enables a Pythonic interrogation of largescale protein datasets with millions of unique annotations. We use SHEPHARD to examine three orthogonal proteome-wide questions relating protein sequence to molecular function, illustrating its ability to uncover novel biology.
AVAILABILITY AND IMPLEMENTATION: We provided SHEPHARD as both a stand-alone software package (https://github.com/holehouse-lab/shephard), and as a Google Colab notebook with a collection of precomputed proteome-wide annotations (https://github.com/holehouse-lab/shephard-colab)
Sequence determinants of in cell condensate morphology, dynamics, and oligomerization as measured by number and brightness analysis
BACKGROUND: Biomolecular condensates are non-stoichiometric assemblies that are characterized by their capacity to spatially concentrate biomolecules and play a key role in cellular organization. Proteins that drive the formation of biomolecular condensates frequently contain oligomerization domains and intrinsically disordered regions (IDRs), both of which can contribute multivalent interactions that drive higher-order assembly. Our understanding of the relative and temporal contribution of oligomerization domains and IDRs to the material properties of in vivo biomolecular condensates is limited. Similarly, the spatial and temporal dependence of protein oligomeric state inside condensates has been largely unexplored in vivo.
METHODS: In this study, we combined quantitative microscopy with number and brightness analysis to investigate the aging, material properties, and protein oligomeric state of biomolecular condensates in vivo. Our work is focused on condensates formed by AUXIN RESPONSE FACTOR 19 (ARF19), a transcription factor integral to the auxin signaling pathway in plants. ARF19 contains a large central glutamine-rich IDR and a C-terminal Phox Bem1 (PB1) oligomerization domain and forms cytoplasmic condensates.
RESULTS: Our results reveal that the IDR amino acid composition can influence the morphology and material properties of ARF19 condensates. In contrast the distribution of oligomeric species within condensates appears insensitive to the IDR composition. In addition, we identified a relationship between the abundance of higher- and lower-order oligomers within individual condensates and their apparent fluidity.
CONCLUSIONS: IDR amino acid composition affects condensate morphology and material properties. In ARF condensates, altering the amino acid composition of the IDR did not greatly affect the oligomeric state of proteins within the condensate. Video Abstract
Clustering of aromatic residues in prion-like domains can tune the formation, state, and organization of biomolecular condensates
In immature oocytes, Balbiani bodies are conserved membraneless condensates implicated in oocyte polarization, the organization of mitochondria, and long-term organelle and RNA storage. I
Intrinsically disordered regions are poised to act as sensors of cellular chemistry
Intrinsically disordered proteins and protein regions (IDRs) are abundant in eukaryotic proteomes and play a wide variety of essential roles. Instead of folding into a stable structure, IDRs exist in an ensemble of interconverting conformations whose structure is biased by sequence-dependent interactions. The absence of a stable 3D structure, combined with high solvent accessibility, means that IDR conformational biases are inherently sensitive to changes in their environment. Here, we argue that IDRs are ideally poised to act as sensors and actuators of cellular physicochemistry. We review the physical principles that underlie IDR sensitivity, the molecular mechanisms that translate this sensitivity to function, and recent studies where environmental sensing by IDRs may play a key role in their downstream function
RNA-induced conformational switching and clustering of G3BP drive stress granule assembly by condensation
Stressed cells shut down translation, release mRNA molecules from polysomes, and form stress granules (SGs) via a network of interactions that involve G3BP. Here we focus on the mechanistic underpinnings of SG assembly. We show that, under non-stress conditions, G3BP adopts a compact auto-inhibited state stabilized by electrostatic intramolecular interactions between the intrinsically disordered acidic tracts and the positively charged arginine-rich region. Upon release from polysomes, unfolded mRNAs outcompete G3BP auto-inhibitory interactions, engendering a conformational transition that facilitates clustering of G3BP through protein-RNA interactions. Subsequent physical crosslinking of G3BP clusters drives RNA molecules into networked RNA/protein condensates. We show that G3BP condensates impede RNA entanglement and recruit additional client proteins that promote SG maturation or induce a liquid-to-solid transition that may underlie disease. We propose that condensation coupled to conformational rearrangements and heterotypic multivalent interactions may be a general principle underlying RNP granule assembly
Direct prediction of intrinsically disordered protein conformational properties from sequence
Intrinsically disordered regions (IDRs) are ubiquitous across all domains of life and play a range of functional roles. While folded domains are generally well described by a stable three-dimensional structure, IDRs exist in a collection of interconverting states known as an ensemble. This structural heterogeneity means that IDRs are largely absent from the Protein Data Bank, contributing to a lack of computational approaches to predict ensemble conformational properties from sequence. Here we combine rational sequence design, large-scale molecular simulations and deep learning to develop ALBATROSS, a deep-learning model for predicting ensemble dimensions of IDRs, including the radius of gyration, end-to-end distance, polymer-scaling exponent and ensemble asphericity, directly from sequences at a proteome-wide scale. ALBATROSS is lightweight, easy to use and accessible as both a locally installable software package and a point-and-click-style interface via Google Colab notebooks. We first demonstrate the applicability of our predictors by examining the generalizability of sequence-ensemble relationships in IDRs. Then, we leverage the high-throughput nature of ALBATROSS to characterize the sequence-specific biophysical behavior of IDRs within and between proteomes
Unfolded states under folding conditions accommodate sequence-specific conformational preferences with random coil-like dimensions
Proteins are marginally stable molecules that fluctuate between folded and unfolded states. Here, we provide a high-resolution description of unfolded states under refolding conditions for the N-terminal domain of the L9 protein (NTL9). We use a combination of time-resolved Forster resonance energy transfer (FRET) based on multiple pairs of minimally perturbing labels, time-resolved small-angle X-ray scattering (SAXS), all-atom simulations, and polymer theory. Upon dilution from high denaturant, the unfolded state undergoes rapid contraction. Although this contraction occurs before the folding transition, the unfolded state remains considerably more expanded than the folded state and accommodates a range of local and nonlocal contacts, including secondary structures and native and nonnative interactions. Paradoxically, despite discernible sequence-specific conformational preferences, the ensemble-averaged properties of unfolded states are consistent with those of canonical random coils, namely polymers in indifferent (theta) solvents. These findings are concordant with theoretical predictions based on coarse-grained models and inferences drawn from single-molecule experiments regarding the sequence-specific scaling behavior of unfolded proteins under folding conditions
Directed mutational scanning reveals a balance between acidic and hydrophobic residues in strong human activation domains
Acidic activation domains are intrinsically disordered regions of the transcription factors that bind coactivators. The intrinsic disorder and low evolutionary conservation of activation domains have made it difficult to identify the sequence features that control activity. To address this problem, we designed thousands of variants in seven acidic activation domains and measured their activities with a high-throughput assay in human cell culture. We found that strong activation domain activity requires a balance between the number of acidic residues and aromatic and leucine residues. These findings motivated a predictor of acidic activation domains that scans the human proteome for clusters of aromatic and leucine residues embedded in regions of high acidity. This predictor identifies known activation domains and accurately predicts previously unidentified ones. Our results support a flexible acidic exposure model of activation domains in which the acidic residues solubilize hydrophobic motifs so that they can interact with coactivators. A record of this paper\u27s transparent peer review process is included in the supplemental information
- …