102 research outputs found
Large margin methods for partner specific prediction of interfaces in protein complexes
2014 Spring.The study of protein interfaces and binding sites is a very important domain of research in bioinformatics. Information about the interfaces between proteins can be used not only in understanding protein function but can also be directly employed in drug design and protein engineering. However, the experimental determination of protein interfaces is cumbersome, expensive and not possible in some cases with today's technology. As a consequence, the computational prediction of protein interfaces from sequence and structure has emerged as a very active research area. A number of machine learning based techniques have been proposed for the solution to this problem. However, the prediction accuracy of most such schemes is very low. In this dissertation we present large-margin classification approaches that have been designed to directly model different aspects of protein complex formation as well as the characteristics of available data. Most existing machine learning techniques for this task are partner-independent in nature, i.e., they ignore the fact that the binding propensity of a protein to bind to another protein is dependent upon characteristics of residues in both proteins. We have developed a pairwise support vector machine classifier called PAIRpred to predict protein interfaces in a partner-specific fashion. Due to its more detailed model of the problem, PAIRpred offers state of the art accuracy in predicting both binding sites at the protein level as well as inter-protein residue contacts at the complex level. PAIRpred uses sequence and structure conservation, local structural similarity and surface geometry, residue solvent exposure and template based features derived from the unbound structures of proteins forming a protein complex. We have investigated the impact of explicitly modeling the inter-dependencies between residues that are imposed by the overall structure of a protein during the formation of a protein complex through transductive and semi-supervised learning models. We also present a novel multiple instance learning scheme called MI-1 that explicitly models imprecision in sequence-level annotations of binding sites in proteins that bind calmodulin to achieve state of the art prediction accuracy for this task
Machine Learning with Abstention for Automated Liver Disease Diagnosis
This paper presents a novel approach for detection of liver abnormalities in
an automated manner using ultrasound images. For this purpose, we have
implemented a machine learning model that can not only generate labels (normal
and abnormal) for a given ultrasound image but it can also detect when its
prediction is likely to be incorrect. The proposed model abstains from
generating the label of a test example if it is not confident about its
prediction. Such behavior is commonly practiced by medical doctors who, when
given insufficient information or a difficult case, can chose to carry out
further clinical or diagnostic tests before generating a diagnosis. However,
existing machine learning models are designed in a way to always generate a
label for a given example even when the confidence of their prediction is low.
We have proposed a novel stochastic gradient based solver for the learning with
abstention paradigm and use it to make a practical, state of the art method for
liver disease classification. The proposed method has been benchmarked on a
data set of approximately 100 patients from MINAR, Multan, Pakistan and our
results show that the proposed scheme offers state of the art classification
performance.Comment: Preprint version before submission for publication. complete version
published in proc. 15th International Conference on Frontiers of Information
Technology (FIT 2017), December 18-20, 2017, Islamabad, Pakistan.
http://ieeexplore.ieee.org/document/8261064
Deep and self-taught learning for protein accessible surface area prediction
ASA captures the degree of burial or surface accessibility of a protein residue. It is a very important indicator of the behavior of amino acids within a protein as well. It can be used to find protein interactions, interfaces, folding states, etc. Calculation of the ASA requires the presence of the structure of the protein. However, structure determination for proteins is expensive and requires significant technical effort. As a consequence, the prediction of ASA is a very important and fundamental problem in Bioinformatics and Proteomics. In this work, we have investigated self-taught machine learning methods along with deep neural network to predict the residue level accessible surface area (ASA) of a protein. We have found that deep learning neural networks can predict the ASA of the residues in a protein accurately. Furthermore, the proposed deep learning based method does not require the use of computationally demanding features such as the position specific scoring matrix (PSSM) which have been used in previous works. A simple Blosum62 matrix based position dependent representation of amino acids in a sequence window gives comparable performance. This is particularly attractive for proteome wide prediction of ASA. We have used various self-taught learning schemes for obtaining an optimal feature representation from unlabeled data. These include a sparse and regularized autoencoder neural network and a dictionary based learning scheme. We have used unlabeled data from the protein universe in an attempt to improve the feature representation. We have also evaluated the performance of a stochastic gradient based predictor of accessible surface area for different feature representations
SynCLay: Interactive Synthesis of Histology Images from Bespoke Cellular Layouts
Automated synthesis of histology images has several potential applications in
computational pathology. However, no existing method can generate realistic
tissue images with a bespoke cellular layout or user-defined histology
parameters. In this work, we propose a novel framework called SynCLay
(Synthesis from Cellular Layouts) that can construct realistic and high-quality
histology images from user-defined cellular layouts along with annotated
cellular boundaries. Tissue image generation based on bespoke cellular layouts
through the proposed framework allows users to generate different histological
patterns from arbitrary topological arrangement of different types of cells.
SynCLay generated synthetic images can be helpful in studying the role of
different types of cells present in the tumor microenvironmet. Additionally,
they can assist in balancing the distribution of cellular counts in tissue
images for designing accurate cellular composition predictors by minimizing the
effects of data imbalance. We train SynCLay in an adversarial manner and
integrate a nuclear segmentation and classification model in its training to
refine nuclear structures and generate nuclear masks in conjunction with
synthetic images. During inference, we combine the model with another
parametric model for generating colon images and associated cellular counts as
annotations given the grade of differentiation and cell densities of different
cells. We assess the generated images quantitatively and report on feedback
from trained pathologists who assigned realism scores to a set of images
generated by the framework. The average realism score across all pathologists
for synthetic images was as high as that for the real images. We also show that
augmenting limited real data with the synthetic data generated by our framework
can significantly boost prediction performance of the cellular composition
prediction task
- …