4,984 research outputs found

    T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification

    Full text link
    Cancer is a complex disease characterized by uncontrolled cell growth and proliferation. T cell receptors (TCRs) are essential proteins for the adaptive immune system, and their specific recognition of antigens plays a crucial role in the immune response against diseases, including cancer. The diversity and specificity of TCRs make them ideal for targeting cancer cells, and recent advancements in sequencing technologies have enabled the comprehensive profiling of TCR repertoires. This has led to the discovery of TCRs with potent anti-cancer activity and the development of TCR-based immunotherapies. In this study, we investigate the use of sparse coding for the multi-class classification of TCR protein sequences with cancer categories as target labels. Sparse coding is a popular technique in machine learning that enables the representation of data with a set of informative features and can capture complex relationships between amino acids and identify subtle patterns in the sequence that might be missed by low-dimensional methods. We first compute the k-mers from the TCR sequences and then apply sparse coding to capture the essential features of the data. To improve the predictive performance of the final embeddings, we integrate domain knowledge regarding different types of cancer properties. We then train different machine learning (linear and non-linear) classifiers on the embeddings of TCR sequences for the purpose of supervised analysis. Our proposed embedding method on a benchmark dataset of TCR sequences significantly outperforms the baselines in terms of predictive performance, achieving an accuracy of 99.8\%. Our study highlights the potential of sparse coding for the analysis of TCR protein sequences in cancer research and other related fields

    Protein interface prediction using graph convolutional networks

    Get PDF
    2017 Fall.Includes bibliographical references.Proteins play a critical role in processes both within and between cells, through their interactions with each other and other molecules. Proteins interact via an interface forming a protein complex, which is difficult, expensive, and time consuming to determine experimentally, giving rise to computational approaches. These computational approaches utilize known electrochemical properties of protein amino acid residues in order to predict if they are a part of an interface or not. Prediction can occur in a partner independent fashion, where amino acid residues are considered independently of their neighbor, or in a partner specific fashion, where pairs of potentially interacting residues are considered together. Ultimately, prediction of protein interfaces can help illuminate cellular biology, improve our understanding of diseases, and aide pharmaceutical research. Interface prediction has historically been performed with a variety of methods, to include docking, template matching, and more recently, machine learning approaches. The field of machine learning has undergone a revolution of sorts with the emergence of convolutional neural networks as the leading method of choice for a wide swath of tasks. Enabled by large quantities of data and the increasing power and availability of computing resources, convolutional neural networks efficiently detect patterns in grid structured data and generate hierarchical representations that prove useful for many types of problems. This success has motivated the work presented in this thesis, which seeks to improve upon state of the art interface prediction methods by incorporating concepts from convolutional neural networks. Proteins are inherently irregular, so they don't easily conform to a grid structure, whereas a graph representation is much more natural. Various convolution operations have been proposed for graph data, each geared towards a particular application. We adapted these convolutions for use in interface prediction, and proposed two new variants. Neural networks were trained on the Docking Benchmark Dataset version 4.0 complexes and tested on the new complexes added in version 5.0. Results were compared against the state of the art method partner specific method, PAIRpred [1]. Results show that multiple variants of graph convolution outperform PAIRpred, with no method emerging as the clear winner. In the future, additional training data may be incorporated from other sources, unsupervised pretraining such as autoencoding may be employed, and a generalization of convolution to simplicial complexes may also be explored. In addition, the various graph convolution approaches may be applied to other applications with graph structured data, such as Quantitative Structure Activity Relationship (QSAR) learning, and knowledge base inference

    Improved functional prediction of proteins by learning kernel combinations in multilabel settings

    Get PDF
    Background We develop a probabilistic model for combining kernel matrices to predict the function of proteins. It extends previous approaches in that it can handle multiple labels which naturally appear in the context of protein function. Results Explicit modeling of multilabels significantly improves the capability of learning protein function from multiple kernels. The performance and the interpretability of the inference model are further improved by simultaneously predicting the subcellular localization of proteins and by combining pairwise classifiers to consistent class membership estimates. Conclusion For the purpose of functional prediction of proteins, multilabels provide valuable information that should be included adequately in the training process of classifiers. Learning of functional categories gains from co-prediction of subcellular localization. Pairwise separation rules allow very detailed insights into the relevance of different measurements like sequence, structure, interaction data, or expression data. A preliminary version of the software can be downloaded from http://www.inf.ethz.ch/personal/vroth/KernelHMM/.ISSN:1471-210
    • …
    corecore