29 research outputs found

    Editorial Forward

    Get PDF
    With the availability of next generation sequencing technology, there is a tremendous need for development of novel tools, algorithms and methodologies for extracting useful information and knowledge from exponentially growing data.  This need has catalyzed active research in the overlapping fields of Machine Learning (ML) and Artificial Intelligence (AI). First issue of IJCB is bringing some very good research articles with a detailed view of the cutting edge machine learning algorithms

    A Multi-class Machine Learning Framework to Predict Ampicillin-Sulbactam Resistance of Acinetobacter baumannii

    Get PDF
    Acinetobacter baumannii is a serious pathogen responsible for many of the hospital-acquired infections. The emergence of multi-drug and pan-drug resistant strains of A. baumannii has been a growing concern. Ampicillin-sulbactam combination has proven to be effective in treatment of several resistant strains. However, strains resistant to ampicillin-sulbactam combination have also emerged necessitating other combination therapy. Rapid and accurate identification of the phenotype of the organism is essential for starting the right treatment. To this end, genome-based approaches have garnered much attention. In this work, we report a multi-class machine-learning based approach to predict the ampicillin-sulbactam resistance phenotype and MIC of Acinetobacter baumannii based on the presence/absence of AMR genes in the genome of strains isolated in the USA region. Our model achieves an accuracy of about 94% indicating that the gene presence/absence itself can capture the resistance phenotype.  Further, we show that our model, built based on the USA strains, does not predict reliably the AMR phenotypes of Indian isolates pointing to the need for building machine learning models from region-specific data

    LIPOPREDICT: Bacterial lipoprotein prediction server

    Get PDF
    Bacterial lipoproteins have many important functions owing to their essential nature and roles in pathogenesis and represent a class of possible vaccine candidates. The prediction of bacterial lipoproteins from sequence is thus an important task for computational vaccinology. A Support Vector Machines (SVM) based module for predicting bacterial lipoproteins, LIPOPREDICT, has been developed. The best performing sequence model were generated using selected dipeptide composition, which gave 97% accuracy of prediction. The results obtained were compared very well with those of previously developed methods

    A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli

    Get PDF
    Motivation: Inclusion body formation has been a major deterrent for overexpression studies since a large number of proteins form insoluble inclusion bodies when overexpressed in Escherichia coli. The formation of inclusion bodies is known to be an outcome of improper protein folding; thus the composition and arrangement of amino acids in the proteins would be a major influencing factor in deciding its aggregation propensity. There is a significant need for a prediction algorithm that would enable the rational identification of both mutants and also the ideal protein candidates for mutations that would confer higher solubility-on-overexpression instead of the presently used trial-and-error procedures. Results: Six physicochemical properties together with residue and dipeptide-compositions have been used to develop a support vector machine-based classifier to predict the overexpression status in E.coli. The prediction accuracy is ~72% suggesting that it performs reasonably well in predicting the propensity of a protein to be soluble or to form inclusion bodies. The algorithm could also correctly predict the change in solubility for most of the point mutations reported in literature. This algorithm can be a useful tool in screening protein libraries to identify soluble variants of proteins

    A support vector machine-based method for predicting the propensity of a protein to be soluble or to form inclusion body on overexpression in Escherichia coli

    Get PDF
    ABSTRACT Motivation: Inclusion body formation has been a major deterrent for overexpression studies since a large number of proteins form insoluble inclusion bodies when overexpressed in Escherichia coli. The formation of inclusion bodies is known to be an outcome of improper protein folding; thus the composition and arrangement of amino acids in the proteins would be a major influencing factor in deciding its aggregation propensity. There is a significant need for a prediction algorithm that would enable the rational identification of both mutants and also the ideal protein candidates for mutations that would confer higher solubility-on-overexpression instead of the presently used trial-anderror procedures. Results: Six physicochemical properties together with residue and dipeptide-compositions have been used to develop a support vector machine-based classifier to predict the overexpression status in E.coli. The prediction accuracy is~72% suggesting that it performs reasonably well in predicting the propensity of a protein to be soluble or to form inclusion bodies. The algorithm could also correctly predict the change in solubility for most of the point mutations reported in literature. This algorithm can be a useful tool in screening protein libraries to identify soluble variants of proteins

    Machine Learning Heuristics on Gingivobuccal Cancer Gene Datasets Reveals Key Candidate Attributes for Prognosis

    Get PDF
    Delayed cancer detection is one of the common causes of poor prognosis in the case of many cancers, including cancers of the oral cavity. Despite the improvement and development of new and efficient gene therapy treatments, very little has been carried out to algorithmically assess the impedance of these carcinomas. In this work, from attributes or NCBI’s oral cancer datasets, viz. (i) name, (ii) gene(s), (iii) protein change, (iv) condition(s), clinical significance (last reviewed). We sought to train the number of instances emerging from them. Further, we attempt to annotate viable attributes in oral cancer gene datasets for the identification of gingivobuccal cancer (GBC). We further apply supervised and unsupervised machine learning methods to the gene datasets, revealing key candidate attributes for GBC prognosis. Our work highlights the importance of automated identification of key genes responsible for GBC that could perhaps be easily replicated in other forms of oral cancer detection.publishedVersionPeer reviewe

    Review on lazy learning regressors and their applications in QSAR

    No full text
    Building accurate quantitative structure-activity relationships (QSAR) is important in drug design, environmental modeling, toxicology, and chemical property prediction. QSAR methods can be utilized to solve mainly two types of problems viz., pattern recognition, (or classification) where output is discrete (i.e. class information), e.g., active or non-active molecule, binding or non-binding molecule etc., and function approximation, (i.e. regression) where the output is continuous (e.g., actual activity prediction). The present review deals with the second type of problem (regression) with specific attention to one of the most effective machine learning procedures, viz. lazy learning. The methodologies of the algorithm along with the relevant technical information are discussed in detail. We also present three real life case studies to briefly outline the typical characteristics of the modeling formalism

    Modeling structure property relationships with Kernel recursive least squares

    No full text
    Motivation: Modeling structure property relationships accurately is a challenging task and newly developed kernel based methods may provide the accuracy for building these relationships. Method: Kernelized variant of traditional recursive least squares algorithm is used to model two QSPR datasets. Results: All the datasets showed a good correlation between actual and predicted values of boiling points with root mean squared errors (RMSEs) comparable to other conventional methods. For the datasets from Espinosa et al., KRLS showed good prediction statistics with R value in the range of 0.97-0.99 and S value in the range 5.5- 8 as compared to multiple linear regression (MLR) with R value in the range 0.85-0.88 and S value in the range 22-26. For the dataset from Trinajstiu et al., KRLS performed consistently well with R values lying in the range of 0.95-0.99 and S in the range of 5-10 as compared to MLR with R values in the range of 0.7-0.85 and S in the range of 25-30. Conclusions: The KRLS method works better when more number of variables from the dataset are included as against other methods such as support vector learning or lazy learning technique which works better for smaller number of reduced relevant variables from the dataset
    corecore