Massive experimental quantification allows interpretable deep learning of protein aggregation

Abstract

Protein aggregation is a pathological hallmark of more than 50 human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the aggregation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts aggregation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict aggregation.This work received support from the following: “La Caixa” Foundation (ID 100010434) under grant agreement LCF/PR/HR21/52410004 (B.L.); European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme grant agreement 883742 (B.L.); AXA Research Fund AXA Chair in Risk prediction in age-related diseases (B.L.); Secretariat of Universities and Research, Ministry of Enterprise and Knowledge of the Government of Catalonia and the European Social Funds 2017 SGR 1322 (B.L.); Bettencourt Schueller Foundation (B.L.); PID2023-146685NB-I00 funded by MCIN/AEI/10.13039/501100011033/ FEDER, UE; Wellcome 220540/Z/20/A, “Wellcome Sanger Institute Quinquennial Review 2021-2026” (B.L.); Spanish Ministry of Science, Innovation and Universities PID2021-127761OB-I00 (B.B.) RYC2020-028861-I funded by MCIN/AEI/ 10.13039/501100011033 “ERDF A way of making Europe” and “ESF Investing in your future” (B.B.); European Union (ERC Consolidator, Glam-MAP, 101125484) (B.B.); EMBO Fellowship ALTF 266-2023 (M.T.); NIH grant R01HG012131 (P.K. and C.R.); and NIH grant R01GM149921 (P.K. and C.R.)

Similar works

Full text

thumbnail-image

UPF Digital Repository

redirect
Last time updated on 08/11/2025

This paper was published in UPF Digital Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: info:eu-repo/semantics/openAccess