98 research outputs found
Best practices for machine learning in antibody discovery and development
Over the past 40 years, the discovery and development of therapeutic
antibodies to treat disease has become common practice. However, as therapeutic
antibody constructs are becoming more sophisticated (e.g., multi-specifics),
conventional approaches to optimisation are increasingly inefficient. Machine
learning (ML) promises to open up an in silico route to antibody discovery and
help accelerate the development of drug products using a reduced number of
experiments and hence cost. Over the past few years, we have observed rapid
developments in the field of ML-guided antibody discovery and development
(D&D). However, many of the results are difficult to compare or hard to assess
for utility by other experts in the field due to the high diversity in the
datasets and evaluation techniques and metrics that are across industry and
academia. This limitation of the literature curtails the broad adoption of ML
across the industry and slows down overall progress in the field, highlighting
the need to develop standards and guidelines that may help improve the
reproducibility of ML models across different research groups. To address these
challenges, we set out in this perspective to critically review current
practices, explain common pitfalls, and clearly define a set of method
development and evaluation guidelines that can be applied to different types of
ML-based techniques for therapeutic antibody D&D. Specifically, we address in
an end-to-end analysis, challenges associated with all aspects of the ML
process and recommend a set of best practices for each stage
Improving generalization of machine learning-identified biomarkers with causal modeling: an investigation into immune receptor diagnostics
Machine learning is increasingly used to discover diagnostic and prognostic
biomarkers from high-dimensional molecular data. However, a variety of factors
related to experimental design may affect the ability to learn generalizable
and clinically applicable diagnostics. Here, we argue that a causal perspective
improves the identification of these challenges, and formalizes their relation
to the robustness and generalization of machine learning-based diagnostics. To
make for a concrete discussion, we focus on a specific, recently established
high-dimensional biomarker - adaptive immune receptor repertoires (AIRRs). We
discuss how the main biological and experimental factors of the AIRR domain may
influence the learned biomarkers and provide easily adjustable simulations of
such effects. In conclusion, we find that causal modeling improves machine
learning-based biomarker robustness by identifying stable relations between
variables and by guiding the adjustment of the relations and variables that
vary between populations
ImmunoLingo: Linguistics-based formalization of the antibody language
Apparent parallels between natural language and biological sequence have led
to a recent surge in the application of deep language models (LMs) to the
analysis of antibody and other biological sequences. However, a lack of a
rigorous linguistic formalization of biological sequence languages, which would
define basic components, such as lexicon (i.e., the discrete units of the
language) and grammar (i.e., the rules that link sequence well-formedness,
structure, and meaning) has led to largely domain-unspecific applications of
LMs, which do not take into account the underlying structure of the biological
sequences studied. A linguistic formalization, on the other hand, establishes
linguistically-informed and thus domain-adapted components for LM applications.
It would facilitate a better understanding of how differences and similarities
between natural language and biological sequences influence the quality of LMs,
which is crucial for the design of interpretable models with extractable
sequence-functions relationship rules, such as the ones underlying the antibody
specificity prediction problem. Deciphering the rules of antibody specificity
is crucial to accelerating rational and in silico biotherapeutic drug design.
Here, we formalize the properties of the antibody language and thereby
establish not only a foundation for the application of linguistic tools in
adaptive immune receptor analysis but also for the systematic immunolinguistic
studies of immune receptor specificity in general.Comment: 19 pages, 3 figure
Linguistically inspired roadmap for building biologically reliable protein language models
Deep neural-network-based language models (LMs) are increasingly applied to
large-scale protein sequence data to predict protein function. However, being
largely black-box models and thus challenging to interpret, current protein LM
approaches do not contribute to a fundamental understanding of
sequence-function mappings, hindering rule-based biotherapeutic drug
development. We argue that guidance drawn from linguistics, a field specialized
in analytical rule extraction from natural language data, can aid with building
more interpretable protein LMs that are more likely to learn relevant
domain-specific rules. Differences between protein sequence data and linguistic
sequence data require the integration of more domain-specific knowledge in
protein LMs compared to natural language LMs. Here, we provide a
linguistics-based roadmap for protein LM pipeline choices with regard to
training data, tokenization, token embedding, sequence embedding, and model
interpretation. Incorporating linguistic ideas into protein LMs enables the
development of next-generation interpretable machine-learning models with the
potential of uncovering the biological mechanisms underlying sequence-function
relationships.Comment: 27 pages, 4 figure
A minimal model of peptide binding predicts ensemble properties of serum antibodies
<p/> <p>Background</p> <p>The importance of peptide microarrays as a tool for serological diagnostics has strongly increased over the last decade. However, interpretation of the binding signals is still hampered by our limited understanding of the technology. This is in particular true for arrays probed with antibody mixtures of unknown complexity, such as sera. To gain insight into how signals depend on peptide amino acid sequences, we probed random-sequence peptide microarrays with sera of healthy and infected mice. We analyzed the resulting antibody binding profiles with regression methods and formulated a minimal model to explain our findings.</p> <p>Results</p> <p>Multivariate regression analysis relating peptide sequence to measured signals led to the definition of amino acid-associated weights. Although these weights do not contain information on amino acid position, they predict up to 40-50% of the binding profiles' variation. Mathematical modeling shows that this position-independent ansatz is only adequate for highly diverse random antibody mixtures which are not dominated by a few antibodies. Experimental results suggest that sera from healthy individuals correspond to that case, in contrast to sera of infected ones.</p> <p>Conclusions</p> <p>Our results indicate that position-independent amino acid-associated weights predict linear epitope binding of antibody mixtures only if the mixture is random, highly diverse, and contains no dominant antibodies. The discovered ensemble property is an important step towards an understanding of peptide-array serum-antibody binding profiles. It has implications for both serological diagnostics and B cell epitope mapping.</p
The TCR Repertoire Reconstitution in Multiple Sclerosis: Comparing One-Shot and Continuous Immunosuppressive Therapies
Natalizumab (NTZ) and autologous hematopoietic stem cell transplantation (AHSCT) are two successful treatments for relapsing-remitting multiple sclerosis (RRMS), an autoimmune T-cell-driven disorder affecting the central nervous system that is characterized by relapses interspersed with periods of complete or partial recovery. Both RRMS treatments have been documented to impact T-cell subpopulations and the T-cell receptor (TCR) repertoire in terms of clone frequency, but, so far, the link between T-cell naive and memory populations, autoimmunity, and treatment outcome has not yet been established hindering insight into the post-treatment TCR landscape of MS patients. To address this important knowledge gap, we tracked peripheral T-cell subpopulations (naïve and memory CD4+ and CD8+) across 15 RRMS patients before and after two years of continuous treatment (NTZ) and a single treatment course (AHSCT) by high-throughput TCRß sequencing. We found that the two MS treatments left treatment-specific multidimensional traces in patient TCRß repertoire dynamics with respect to clonal expansion, clonal diversity and repertoire architecture. Comparing MS TCR sequences with published datasets suggested that the majority of public TCRs belonged to virus-associated sequences. In summary, applying multi-dimensional computational immunology to a TCRß dataset of treated MS patients, we show that qualitative changes of TCRß repertoires encode treatment-specific information that may be relevant for future clinical trials monitoring and personalized MS follow-up, diagnosis and treatment regimes. Natalizumab (NTZ) and autologous hematopoietic stem cell transplantation (AHSCT) are two successful treatments for relapsing–remitting multiple sclerosis (RRMS), an autoimmune T-cell–driven disorder affecting the central nervous system that is characterized by relapses interspersed with periods of complete or partial recovery. Both RRMS treatments have been documented to impact T-cell subpopulations and the T-cell receptor (TCR) repertoire in terms of clone frequency, but, so far, the link between T-cell naive and memory populations, autoimmunity, and treatment outcome has not yet been established hindering insight into the posttreatment TCR landscape of MS patients. To address this important knowledge gap, we tracked peripheral T-cell subpopulations (naive and memory CD4+ and CD8+) across 15 RRMS patients before and after 2 years of continuous treatment (NTZ) and a single treatment course (AHSCT) by high-throughput TCRβ sequencing. We found that the two MS treatments left treatment-specific multidimensional traces in patient TCRβ repertoire dynamics with respect to clonal expansion, clonal diversity, and repertoire architecture. Comparing MS TCR sequences with published datasets suggested that the majority of public TCRs belonged to virus-associated sequences. In summary, applying multidimensional computational immunology to a TCRβ dataset of treated MS patients, we show that qualitative changes of TCRβ repertoires encode treatment-specific information that may be relevant for future clinical trials monitoring and personalized MS follow-up, diagnosis, and treatment regimens
Toward real-world automated antibody design with combinatorial Bayesian optimization
Antibodies are multimeric proteins capable of highly specific molecular recognition. The complementarity determining region 3 of the antibody variable heavy chain (CDRH3) often dominates antigen-binding specificity. Hence, it is a priority to design optimal antigen-specific CDRH3 to develop therapeutic antibodies. The combinatorial structure of CDRH3 sequences makes it impossible to query binding-affinity oracles exhaustively. Moreover, antibodies are expected to have high target specificity and developability. Here, we present AntBO, a combinatorial Bayesian optimization framework utilizing a CDRH3 trust region for an in silico design of antibodies with favorable developability scores. The in silico experiments on 159 antigens demonstrate that AntBO is a step toward practically viable in vitro antibody design. In under 200 calls to the oracle, AntBO suggests antibodies outperforming the best binding sequence from 6.9 million experimentally obtained CDRH3s. Additionally, AntBO finds very-high-affinity CDRH3 in only 38 protein designs while requiring no domain knowledge
Apyrase-mediated amplification of secretory IgA promotes intestinal homeostasis.
Secretory immunoglobulin A (SIgA) interaction with commensal bacteria conditions microbiota composition and function. However, mechanisms regulating reciprocal control of microbiota and SIgA are not defined. Bacteria-derived adenosine triphosphate (ATP) limits T follicular helper (Tfh) cells in the Peyer's patches (PPs) via P2X7 receptor (P2X7R) and thereby SIgA generation. Here we show that hydrolysis of extracellular ATP (eATP) by apyrase results in amplification of the SIgA repertoire. The enhanced breadth of SIgA in mice colonized with apyrase-releasing Escherichia coli influences topographical distribution of bacteria and expression of genes involved in metabolic versus immune functions in the intestinal epithelium. SIgA-mediated conditioning of bacteria and enterocyte function is reflected by differences in nutrient absorption in mice colonized with apyrase-expressing bacteria. Apyrase-induced SIgA improves intestinal homeostasis and attenuates barrier impairment and susceptibility to infection by enteric pathogens in antibiotic-induced dysbiosis. Therefore, amplification of SIgA by apyrase can be leveraged to restore intestinal fitness in dysbiotic conditions
- …