2 research outputs found
Machine Learning Models to Interrogate Proteome-Wide Covalent Ligandabilities Directed at Cysteines
Machine learning (ML) identification of covalently ligandable
sites
may accelerate targeted covalent inhibitor design and help expand
the druggable proteome space. Here, we report the rigorous development
and validation of the tree-based models and convolutional neural networks
(CNNs) trained on a newly curated database (LigCys3D) of over 1000
liganded cysteines in nearly 800 proteins represented by over 10,000
three-dimensional structures in the protein data bank. The unseen
tests yielded 94 and 93% area under the receiver operating characteristic
curves for the tree models and CNNs, respectively. Based on the AlphaFold2
predicted structures, the ML models recapitulated the newly liganded
cysteines in the PDB with over 90% recall values. To assist the community
of covalent drug discoveries, we report the predicted ligandable cysteines
in 392 human kinases and their locations in the sequence-aligned kinase
structure, including the PH and SH2 domains. Furthermore, we disseminate
a searchable online database LigCys3D (https://ligcys.computchem.org/) and a web prediction server DeepCys (https://deepcys.computchem.org/), both of which will be continuously updated and improved by including
newly published experimental data. The present work represents an
important step toward the ML-led integration of big genome data and
structure models to annotate the human proteome space for the next-generation
covalent drug discoveries
Machine Learning Models to Interrogate Proteome-Wide Covalent Ligandabilities Directed at Cysteines
Machine learning (ML) identification of covalently ligandable
sites
may accelerate targeted covalent inhibitor design and help expand
the druggable proteome space. Here, we report the rigorous development
and validation of the tree-based models and convolutional neural networks
(CNNs) trained on a newly curated database (LigCys3D) of over 1000
liganded cysteines in nearly 800 proteins represented by over 10,000
three-dimensional structures in the protein data bank. The unseen
tests yielded 94 and 93% area under the receiver operating characteristic
curves for the tree models and CNNs, respectively. Based on the AlphaFold2
predicted structures, the ML models recapitulated the newly liganded
cysteines in the PDB with over 90% recall values. To assist the community
of covalent drug discoveries, we report the predicted ligandable cysteines
in 392 human kinases and their locations in the sequence-aligned kinase
structure, including the PH and SH2 domains. Furthermore, we disseminate
a searchable online database LigCys3D (https://ligcys.computchem.org/) and a web prediction server DeepCys (https://deepcys.computchem.org/), both of which will be continuously updated and improved by including
newly published experimental data. The present work represents an
important step toward the ML-led integration of big genome data and
structure models to annotate the human proteome space for the next-generation
covalent drug discoveries
