92 research outputs found

    Active machine learning for transmembrane helix prediction

    Get PDF
    Abstract Background About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others. Results An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins. Conclusion Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments

    Active Learning Strategies for Technology Assisted Sensitivity Review

    Get PDF
    Government documents must be reviewed to identify and protect any sensitive information, such as personal information, before the documents can be released to the public. However, in the era of digital government documents, such as e-mail, traditional sensitivity review procedures are no longer practical, for example due to the volume of documents to be reviewed. Therefore, there is a need for new technology assisted review protocols to integrate automatic sensitivity classification into the sensitivity review process. Moreover, to effectively assist sensitivity review, such assistive technologies must incorporate reviewer feedback to enable sensitivity classifiers to quickly learn and adapt to the sensitivities within a collection, when the types of sensitivity are not known a priori. In this work, we present a thorough evaluation of active learning strategies for sensitivity review. Moreover, we present an active learning strategy that integrates reviewer feedback, from sensitive text annotations, to identify features of sensitivity that enable us to learn an effective sensitivity classifier (0.7 Balanced Accuracy) using significantly less reviewer effort, according to the sign test (p < 0.01 ). Moreover, this approach results in a 51% reduction in the number of documents required to be reviewed to achieve the same level of classification accuracy, compared to when the approach is deployed without annotation features
    • …
    corecore