3 research outputs found
On TCR binding predictors failing to generalize to unseen peptides
Several recent studies investigate TCR-peptide/-pMHC binding prediction using machine learning or deep learning approaches. Many of these methods achieve impressive results on test sets, which include peptide sequences that are also included in the training set. In this work, we investigate how state-of-the-art deep learning models for TCR-peptide/-pMHC binding prediction generalize to unseen peptides. We create a dataset including positive samples from IEDB, VDJdb, McPAS-TCR, and the MIRA set, as well as negative samples from both randomization and 10X Genomics assays. We name this collection of samples TChard. We propose the hard split, a simple heuristic for training/test split, which ensures that test samples exclusively present peptides that do not belong to the training set. We investigate the effect of different training/test splitting techniques on the models’ test performance, as well as the effect of training and testing the models using mismatched negative samples generated randomly, in addition to the negative samples derived from assays. Our results show that modern deep learning methods fail to generalize to unseen peptides. We provide an explanation why this happens and verify our hypothesis on the TChard dataset. We then conclude that robust prediction of TCR recognition is still far for being solved
The Anti-Social System Properties: Bitcoin Network Data Analysis
© 2018 IEEE. Bitcoin is a cryptocurrency and a decentralized semi-anonymous peer-to-peer payment system in which the transactions are verified by network nodes and recorded in a public massively replicated ledger called the blockchain. Bitcoin is currently considered as one of the most disruptive technologies. Bitcoin represents a paradox of opposing forces. On one hand, it is fundamentally social, allowing people to transact in a peer-to-peer manner to create and exchange value. On the other hand, Bitcoin's core design philosophy and user base contain strong anti-social elements and constraints, emphasizing anonymity, privacy, and subversion of traditional centralized financial systems. We believe that the success of Bitcoin, and the financial ecosystem built around it, will likely rely on achieving an optimal balance between these social and anti-social forces. To elucidate the role of these forces, we analyze the evolution of the entire Bitcoin transaction graph from its inception, and quantify the evolution of its key structural properties. We observe that despite its different nature, the Bitcoin transaction graph exhibits many universal dynamics typical of social networks. However, we also find that Bitcoin deviates in important ways due to anonymity-seeking behavioral patterns of its users. As a result, the network exhibits a two-orders-of-magnitude larger diameter, sparse tree-like communities, and an overwhelming majority of transitional or intermediate accounts with incoming and outgoing edges but zero cumulative balances. These results illuminate the evolutionary dynamics of the most popular cryptocurrency, and provide us with initial understanding of social networks rooted in and driven by anti-social constraints
Microbiome-based disease prediction with multimodal variational information bottlenecks.
Scientific research is shedding light on the interaction of the gut microbiome with the human host and on its role in human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. Most of them leverage shotgun metagenomic sequencing to extract gut microbial species-relative abundances or strain-level markers. Each of these gut microbial profiling modalities showed diagnostic potential when tested separately; however, no existing approach combines them in a single predictive framework. Here, we propose the Multimodal Variational Information Bottleneck (MVIB), a novel deep learning model capable of learning a joint representation of multiple heterogeneous data modalities. MVIB achieves competitive classification performance while being faster than existing methods. Additionally, MVIB offers interpretable results. Our model adopts an information theoretic interpretation of deep neural networks and computes a joint stochastic encoding of different input data modalities. We use MVIB to predict whether human hosts are affected by a certain disease by jointly analysing gut microbial species-relative abundances and strain-level markers. MVIB is evaluated on human gut metagenomic samples from 11 publicly available disease cohorts covering 6 different diseases. We achieve high performance (0.80 < ROC AUC < 0.95) on 5 cohorts and at least medium performance on the remaining ones. We adopt a saliency technique to interpret the output of MVIB and identify the most relevant microbial species and strain-level markers to the model's predictions. We also perform cross-study generalisation experiments, where we train and test MVIB on different cohorts of the same disease, and overall we achieve comparable results to the baseline approach, i.e. the Random Forest. Further, we evaluate our model by adding metabolomic data derived from mass spectrometry as a third input modality. Our method is scalable with respect to input data modalities and has an average training time of < 1.4 seconds. The source code and the datasets used in this work are publicly available