Data for "Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification" (ImRex)

Abstract

Repository containing the different experiments described in the manuscript titled: "Current challenges for epitope-agnostic TCR interaction prediction and a new perspective derived from image classification". Publication DOI: TBA Originally appeared as a preprint on bioRxiv: https://doi.org/10.1101/2019.12.18.880146. Contains: Trained model files (.h5) Associated train and validation datasets for each model. Learning curves and evaluation metrics. Log files with training and data arguments (full training scripts are available in GitHub repository). Comparisons between different models. Complete raw and processed datasets (also available in the associated GitHub repository). Please refer to the associated GitHub repository (https://github.com/pmoris/ImRex) for more information on the directory structure and contents, as well as the scripts that generated these output files. Contents: data.zip: Contains raw and preprocessed datasets. READMEs in subdirectory describe the data sources and preprocessing steps. Please refer to the associated GitHub repository for the specific scripts that generated these files. Note that the full training and test sets (i.e. containing both positive and negative examples) are stored separately for each model/CV iteration in the models archives. models-main.zip: contains the trained models and evaluation metrics for the main different experiments described in the bash and pbs scripts in ./src/scripts/hpc_scripts. Log files for the experiments outlined here can be found in ./src/scripts/hpc_scripts. models-full.zip: contains models that were trained on the complete VDJdb dataset without cross-validation, filtered on human TRB data, no 10x data and restricted to 10-20 (CDR3) or 8-11 (epitope) amino acid residues, with negatives that were generated by shuffling (i.e. sampling an negative epitope for each positive CDR3 sequence). One set of models uses downsampling to reduce the most abundant epitopes down to 400 pairs each, the other one does not use any downsampling. These models were also used for evaluating on the external Adaptive dataset, as outlined in ./src/scripts/evaluate/evaluate_adaptive.sh, and the TRA subset of sequences (./src/scripts/evaluate/evaluate_tra.sh). models-decoyfit.zip: contains models that were trained on true data, but evaluated on data where epitopes were replaced by decoys. models-padded-epitoperatio.zip: contains a quick test of trained models (padded/interaction map) that use a different type of negative shuffling, see docstrings in ./src/processing/negative_sampler.py for more info. models-repeat-local.zip: contains a number of repeated runs from models-main, used to estimate variability in model performance for multiple identical runs. comparisons.zip: contains comparison directories, each consisting of two or more model output directories, that contrast the performance metrics of the models. These outputs were generated by using the ./src/scripts/evaluate/visualize.py script, or by using the oneliners in ./src/scripts/evaluate/visualise.sh, which can operate on the entire comparisons directory at once. Note that any file paths described here are in reference to the associated GitHub repository (https://github.com/pmoris/ImRex). Overview of different experiments: Two main architectures were compared: the interaction map (or padded) CNN and a dual input CNN based on NetTCR (nettcr). Two different cross-validation strategies were used: a 5x repeated 5-fold CV (repeated5fold) and an epitope-grouped CV (epitope_grouped). The different dataset subsets are labelled as follows. Check the Makefile's preprocess-vdjdb-aug-2019 command (and the underlying script ./src/scripts/preprocessing/preprocess_vdjdb.py) for a more thorough overview of the different filtering options. mhci: only MHCI class presented epitopes. trb: only TRB CDR3 sequences. tra: only TRA CDR3 sequences. tratrb: both types of CDR3 sequences. down: moderate downsampling of most abundant epitopes to 1000 pairs. down400: strong downsampling of most abundant epitopes to 400 pairs. decoy: decoy epitope data. reg001: regularization factor 0.01 (only for padded/interaction type models, fixed value) Two different methods of generating negative TCR-epitope pairs were used: shuffling of positive pairs, i.e. sampling a single epitope from the positive pairs for each CDR3 sequence (shuffle), and sampling CDR3s from a reference repertoire (negref). The batch size is labelled as b32 = a batch size of 32. The learning rate was always 0.0001 (lre4) or 0.001 (lre3)

    Similar works