11 research outputs found

    Detection of enriched T cell epitope specificity in full T cell receptor sequence repertoires

    No full text
    High-throughput T cell receptor (TCR) sequencing allows the characterization of an individual's TCR repertoire and directly queries their immune state. However, it remains a non-trivial task to couple these sequenced TCRs to their antigenic targets. In this paper, we present a novel strategy to annotate full TCR sequence repertoires with their epitope specificities. The strategy is based on a machine learning algorithm to learn the TCR patterns common to the recognition of a specific epitope. These results are then combined with a statistical analysis to evaluate the occurrence of specific epitope-reactive TCR sequences per epitope in repertoire data. In this manner, we can directly study the capacity of full TCR repertoires to target specific epitopes of the relevant vaccines or pathogens. We demonstrate the usability of this approach on three independent datasets related to vaccine monitoring and infectious disease diagnostics by independently identifying the epitopes that are targeted by the TCR repertoire. The developed method is freely available as a web tool for academic use at tcrex.biodatamining.be

    Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification

    No full text
    The prediction of epitope recognition by T-cell receptors (TCRs) has seen many advancements in recent years, with several methods now available that can predict recognition for a specific set of epitopes. However, the generic case of evaluating all possible TCR-epitope pairs remains challenging, mainly due to the high diversity of the interacting sequences and the limited amount of currently available training data. In this work, we provide an overview of the current state of this unsolved problem. First, we examine appropriate validation strategies to accurately assess the generalization performance of generic TCR-epitope recognition models when applied to both seen and unseen epitopes. In addition, we present a novel feature representation approach, which we call ImRex (interaction map recognition). This approach is based on the pairwise combination of physicochemical properties of the individual amino acids in the CDR3 and epitope sequences, which provides a convolutional neural network with the combined representation of both sequences. Lastly, we highlight various challenges that are specific to TCR-epitope data and that can adversely affect model performance. These include the issue of selecting negative data, the imbalanced epitope distribution of curated TCR-epitope datasets and the potential exchangeability of TCR alpha and beta chains. Our results indicate that while extrapolation to unseen epitopes remains a difficult challenge, ImRex makes this feasible for a subset of epitopes that are not too dissimilar from the training data. We show that appropriate feature engineering methods and rigorous benchmark standards are required to create and validate TCR-epitope predictive models

    Data for "Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification" (ImRex)

    No full text
    Repository containing the different experiments described in the manuscript titled: "Current challenges for epitope-agnostic TCR interaction prediction and a new perspective derived from image classification". Publication DOI: TBA Originally appeared as a preprint on bioRxiv: https://doi.org/10.1101/2019.12.18.880146. Contains: Trained model files (.h5) Associated train and validation datasets for each model. Learning curves and evaluation metrics. Log files with training and data arguments (full training scripts are available in GitHub repository). Comparisons between different models. Complete raw and processed datasets (also available in the associated GitHub repository). Please refer to the associated GitHub repository (https://github.com/pmoris/ImRex) for more information on the directory structure and contents, as well as the scripts that generated these output files. Contents: data.zip: Contains raw and preprocessed datasets. READMEs in subdirectory describe the data sources and preprocessing steps. Please refer to the associated GitHub repository for the specific scripts that generated these files. Note that the full training and test sets (i.e. containing both positive and negative examples) are stored separately for each model/CV iteration in the models archives. models-main.zip: contains the trained models and evaluation metrics for the main different experiments described in the bash and pbs scripts in ./src/scripts/hpc_scripts. Log files for the experiments outlined here can be found in ./src/scripts/hpc_scripts. models-full.zip: contains models that were trained on the complete VDJdb dataset without cross-validation, filtered on human TRB data, no 10x data and restricted to 10-20 (CDR3) or 8-11 (epitope) amino acid residues, with negatives that were generated by shuffling (i.e. sampling an negative epitope for each positive CDR3 sequence). One set of models uses downsampling to reduce the most abundant epitopes down to 400 pairs each, the other one does not use any downsampling. These models were also used for evaluating on the external Adaptive dataset, as outlined in ./src/scripts/evaluate/evaluate_adaptive.sh, and the TRA subset of sequences (./src/scripts/evaluate/evaluate_tra.sh). models-decoyfit.zip: contains models that were trained on true data, but evaluated on data where epitopes were replaced by decoys. models-padded-epitoperatio.zip: contains a quick test of trained models (padded/interaction map) that use a different type of negative shuffling, see docstrings in ./src/processing/negative_sampler.py for more info. models-repeat-local.zip: contains a number of repeated runs from models-main, used to estimate variability in model performance for multiple identical runs. comparisons.zip: contains comparison directories, each consisting of two or more model output directories, that contrast the performance metrics of the models. These outputs were generated by using the ./src/scripts/evaluate/visualize.py script, or by using the oneliners in ./src/scripts/evaluate/visualise.sh, which can operate on the entire comparisons directory at once. Note that any file paths described here are in reference to the associated GitHub repository (https://github.com/pmoris/ImRex). Overview of different experiments: Two main architectures were compared: the interaction map (or padded) CNN and a dual input CNN based on NetTCR (nettcr). Two different cross-validation strategies were used: a 5x repeated 5-fold CV (repeated5fold) and an epitope-grouped CV (epitope_grouped). The different dataset subsets are labelled as follows. Check the Makefile's preprocess-vdjdb-aug-2019 command (and the underlying script ./src/scripts/preprocessing/preprocess_vdjdb.py) for a more thorough overview of the different filtering options. mhci: only MHCI class presented epitopes. trb: only TRB CDR3 sequences. tra: only TRA CDR3 sequences. tratrb: both types of CDR3 sequences. down: moderate downsampling of most abundant epitopes to 1000 pairs. down400: strong downsampling of most abundant epitopes to 400 pairs. decoy: decoy epitope data. reg001: regularization factor 0.01 (only for padded/interaction type models, fixed value) Two different methods of generating negative TCR-epitope pairs were used: shuffling of positive pairs, i.e. sampling a single epitope from the positive pairs for each CDR3 sequence (shuffle), and sampling CDR3s from a reference repertoire (negref). The batch size is labelled as b32 = a batch size of 32. The learning rate was always 0.0001 (lre4) or 0.001 (lre3)
    corecore