98 research outputs found

    Best practices for machine learning in antibody discovery and development

    Full text link
    Over the past 40 years, the discovery and development of therapeutic antibodies to treat disease has become common practice. However, as therapeutic antibody constructs are becoming more sophisticated (e.g., multi-specifics), conventional approaches to optimisation are increasingly inefficient. Machine learning (ML) promises to open up an in silico route to antibody discovery and help accelerate the development of drug products using a reduced number of experiments and hence cost. Over the past few years, we have observed rapid developments in the field of ML-guided antibody discovery and development (D&D). However, many of the results are difficult to compare or hard to assess for utility by other experts in the field due to the high diversity in the datasets and evaluation techniques and metrics that are across industry and academia. This limitation of the literature curtails the broad adoption of ML across the industry and slows down overall progress in the field, highlighting the need to develop standards and guidelines that may help improve the reproducibility of ML models across different research groups. To address these challenges, we set out in this perspective to critically review current practices, explain common pitfalls, and clearly define a set of method development and evaluation guidelines that can be applied to different types of ML-based techniques for therapeutic antibody D&D. Specifically, we address in an end-to-end analysis, challenges associated with all aspects of the ML process and recommend a set of best practices for each stage

    Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics

    Full text link
    Machine learning is increasingly used to discover diagnostic and prognostic biomarkers from high-dimensional molecular data. However, a variety of factors related to experimental design may affect the ability to learn generalizable and clinically applicable diagnostics. Here, we argue that a causal perspective improves the identification of these challenges, and formalizes their relation to the robustness and generalization of machine learning-based diagnostics. To make for a concrete discussion, we focus on a specific, recently established high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). We discuss how the main biological and experimental factors of the AIRR domain may influence the learned biomarkers and provide easily adjustable simulations of such effects. In conclusion, we find that causal modeling improves machine learning-based biomarker robustness by identifying stable relations between variables and by guiding the adjustment of the relations and variables that vary between populations

    ImmunoLingo: Linguistics-based formalization of the antibody language

    Full text link
    Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.Comment: 19 pages, 3 figure

    Linguistically inspired roadmap for building biologically reliable protein language models

    Full text link
    Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.Comment: 27 pages, 4 figure

    A minimal model of peptide binding predicts ensemble properties of serum antibodies

    Get PDF
    <p/> <p>Background</p> <p>The importance of peptide microarrays as a tool for serological diagnostics has strongly increased over the last decade. However, interpretation of the binding signals is still hampered by our limited understanding of the technology. This is in particular true for arrays probed with antibody mixtures of unknown complexity, such as sera. To gain insight into how signals depend on peptide amino acid sequences, we probed random-sequence peptide microarrays with sera of healthy and infected mice. We analyzed the resulting antibody binding profiles with regression methods and formulated a minimal model to explain our findings.</p> <p>Results</p> <p>Multivariate regression analysis relating peptide sequence to measured signals led to the definition of amino acid-associated weights. Although these weights do not contain information on amino acid position, they predict up to 40-50% of the binding profiles' variation. Mathematical modeling shows that this position-independent ansatz is only adequate for highly diverse random antibody mixtures which are not dominated by a few antibodies. Experimental results suggest that sera from healthy individuals correspond to that case, in contrast to sera of infected ones.</p> <p>Conclusions</p> <p>Our results indicate that position-independent amino acid-associated weights predict linear epitope binding of antibody mixtures only if the mixture is random, highly diverse, and contains no dominant antibodies. The discovered ensemble property is an important step towards an understanding of peptide-array serum-antibody binding profiles. It has implications for both serological diagnostics and B cell epitope mapping.</p

    The TCR Repertoire Reconstitution in Multiple Sclerosis: Comparing One-Shot and Continuous Immunosuppressive Therapies

    Get PDF
    Natalizumab (NTZ) and autologous hematopoietic stem cell transplantation (AHSCT) are two successful treatments for relapsing-remitting multiple sclerosis (RRMS), an autoimmune T-cell-driven disorder affecting the central nervous system that is characterized by relapses interspersed with periods of complete or partial recovery. Both RRMS treatments have been documented to impact T-cell subpopulations and the T-cell receptor (TCR) repertoire in terms of clone frequency, but, so far, the link between T-cell naive and memory populations, autoimmunity, and treatment outcome has not yet been established hindering insight into the post-treatment TCR landscape of MS patients. To address this important knowledge gap, we tracked peripheral T-cell subpopulations (naïve and memory CD4+ and CD8+) across 15 RRMS patients before and after two years of continuous treatment (NTZ) and a single treatment course (AHSCT) by high-throughput TCRß sequencing. We found that the two MS treatments left treatment-specific multidimensional traces in patient TCRß repertoire dynamics with respect to clonal expansion, clonal diversity and repertoire architecture. Comparing MS TCR sequences with published datasets suggested that the majority of public TCRs belonged to virus-associated sequences. In summary, applying multi-dimensional computational immunology to a TCRß dataset of treated MS patients, we show that qualitative changes of TCRß repertoires encode treatment-specific information that may be relevant for future clinical trials monitoring and personalized MS follow-up, diagnosis and treatment regimes. Natalizumab (NTZ) and autologous hematopoietic stem cell transplantation (AHSCT) are two successful treatments for relapsing–remitting multiple sclerosis (RRMS), an autoimmune T-cell–driven disorder affecting the central nervous system that is characterized by relapses interspersed with periods of complete or partial recovery. Both RRMS treatments have been documented to impact T-cell subpopulations and the T-cell receptor (TCR) repertoire in terms of clone frequency, but, so far, the link between T-cell naive and memory populations, autoimmunity, and treatment outcome has not yet been established hindering insight into the posttreatment TCR landscape of MS patients. To address this important knowledge gap, we tracked peripheral T-cell subpopulations (naive and memory CD4+ and CD8+) across 15 RRMS patients before and after 2 years of continuous treatment (NTZ) and a single treatment course (AHSCT) by high-throughput TCRβ sequencing. We found that the two MS treatments left treatment-specific multidimensional traces in patient TCRβ repertoire dynamics with respect to clonal expansion, clonal diversity, and repertoire architecture. Comparing MS TCR sequences with published datasets suggested that the majority of public TCRs belonged to virus-associated sequences. In summary, applying multidimensional computational immunology to a TCRβ dataset of treated MS patients, we show that qualitative changes of TCRβ repertoires encode treatment-specific information that may be relevant for future clinical trials monitoring and personalized MS follow-up, diagnosis, and treatment regimens

    Toward real-world automated antibody design with combinatorial Bayesian optimization

    Get PDF
    Antibodies are multimeric proteins capable of highly specific molecular recognition. The complementarity determining region 3 of the antibody variable heavy chain (CDRH3) often dominates antigen-binding specificity. Hence, it is a priority to design optimal antigen-specific CDRH3 to develop therapeutic antibodies. The combinatorial structure of CDRH3 sequences makes it impossible to query binding-affinity oracles exhaustively. Moreover, antibodies are expected to have high target specificity and developability. Here, we present AntBO, a combinatorial Bayesian optimization framework utilizing a CDRH3 trust region for an in silico design of antibodies with favorable developability scores. The in silico experiments on 159 antigens demonstrate that AntBO is a step toward practically viable in vitro antibody design. In under 200 calls to the oracle, AntBO suggests antibodies outperforming the best binding sequence from 6.9 million experimentally obtained CDRH3s. Additionally, AntBO finds very-high-affinity CDRH3 in only 38 protein designs while requiring no domain knowledge

    Apyrase-mediated amplification of secretory IgA promotes intestinal homeostasis.

    Get PDF
    Secretory immunoglobulin A (SIgA) interaction with commensal bacteria conditions microbiota composition and function. However, mechanisms regulating reciprocal control of microbiota and SIgA are not defined. Bacteria-derived adenosine triphosphate (ATP) limits T follicular helper (Tfh) cells in the Peyer's patches (PPs) via P2X7 receptor (P2X7R) and thereby SIgA generation. Here we show that hydrolysis of extracellular ATP (eATP) by apyrase results in amplification of the SIgA repertoire. The enhanced breadth of SIgA in mice colonized with apyrase-releasing Escherichia coli influences topographical distribution of bacteria and expression of genes involved in metabolic versus immune functions in the intestinal epithelium. SIgA-mediated conditioning of bacteria and enterocyte function is reflected by differences in nutrient absorption in mice colonized with apyrase-expressing bacteria. Apyrase-induced SIgA improves intestinal homeostasis and attenuates barrier impairment and susceptibility to infection by enteric pathogens in antibiotic-induced dysbiosis. Therefore, amplification of SIgA by apyrase can be leveraged to restore intestinal fitness in dysbiotic conditions
    corecore