2 research outputs found

    Cancer Biomark

    Get PDF
    BACKGROUND:With the use of artificial intelligence and machine learning techniques for biomedical informatics, security and privacy concerns over the data and subject identities have also become an important issue and essential research topic. Without intentional safeguards, machine learning models may find patterns and features to improve task performance that are associated with private personal information.OBJECTIVE:The privacy vulnerability of deep learning models for information extraction from medical textural contents needs to be quantified since the models are exposed to private health information and personally identifiable information. The objective of the study is to quantify the privacy vulnerability of the deep learning models for natural language processing and explore a proper way of securing patients\u2019 information to mitigate confidentiality breaches.METHODS:The target model is the multi-task convolutional neural network for information extraction from cancer pathology reports, where the data for training the model are from the participated multiple state cancer registries. This study proposes the following schemes to collect vocabularies from the cancer pathology reports; (a) words appearing in multiple registries, and (b) words that have higher mutual information. We performed membership inference attacks on the models in high-performance computing environments.RESULTS:The comparison outcomes suggest that the proposed vocabulary selection methods resulted in lower privacy vulnerability while maintaining the same level of clinical task performance.HHSN261201800032C/CA/NCI NIH HHSUnited States/HHSN261201800009C/CA/NCI NIH HHSUnited States/NU58DP006344/DP/NCCDPHP CDC HHSUnited States/HHSN261201800015I/CA/NCI NIH HHSUnited States/HHSN261201800013C/CA/NCI NIH HHSUnited States/HHSN261201800016I/CA/NCI NIH HHSUnited States/HHSN261201800014I/CA/NCI NIH HHSUnited States/HHSN261201800032I/CA/NCI NIH HHSUnited States/U58 DP003907/DP/NCCDPHP CDC HHSUnited States/HHSN261201800015C/CA/NCI NIH HHSUnited States/HHSN261201800013I/CA/NCI NIH HHSUnited States/HHSN261201800014C/CA/NCI NIH HHSUnited States/HHSN261201800016C/CA/NCI NIH HHSUnited States/P30 CA177558/CA/NCI NIH HHSUnited States/HHSN261201300021C/CA/NCI NIH HHSUnited States/HHSN261201800009I/CA/NCI NIH HHSUnited States/HHSN261201800007C/CA/NCI NIH HHSUnited States/2022-08-15T00:00:00Z35213361PMC937755011773vault:4313

    J Biomed Inform

    Get PDF
    Objective:In machine learning, it is evident that the classification of the task performance increases if bootstrap aggregation (bagging) is applied. However, the bagging of deep neural networks takes tremendous amounts of computational resources and training time. The research question that we aimed to answer in this research is whether we could achieve higher task performance scores and accelerate the training by dividing a problem into sub-problems.Materials and Methods:The data used in this study consist of free text from electronic cancer pathology reports. We applied bagging and partitioned data training using Multi-Task Convolutional Neural Network (MT-CNN) and Multi-Task Hierarchical Convolutional Attention Network (MT-HCAN) classifiers. We split a big problem into 20 sub-problems, resampled the training cases 2,000 times, and trained the deep learning model for each bootstrap sample and each sub-problem\u2014thus, generating up to 40,000 models. We performed the training of many models concurrently in a high-performance computing environment at Oak Ridge National Laboratory (ORNL).Results:We demonstrated that aggregation of the models improves task performance compared with the single-model approach, which is consistent with other research studies; and we demonstrated that the two proposed partitioned bagging methods achieved higher classification accuracy scores on four tasks. Notably, the improvements were significant for the extraction of cancer histology data, which had more than 500 class labels in the task; these results show that data partition may alleviate the complexity of the task. On the contrary, the methods did not achieve superior scores for the tasks of site and subsite classification. Intrinsically, since data partitioning was based on the primary cancer site, the accuracy depended on the determination of the partitions, which needs further investigation and improvement.Conclusion:Results in this research demonstrate that 1. The data partitioning and bagging strategy achieved higher performance scores. 2. We achieved faster training leveraged by the high-performance Summit supercomputer at ORNL.20202021-01-13T00:00:00ZHHSN261201800013C/CA/NCI NIH HHSUnited States/HHSN261201800016C/CA/NCI NIH HHSUnited States/U58 DP003907/DP/NCCDPHP CDC HHSUnited States/HHSN261201800007C/CA/NCI NIH HHSUnited States/P30 CA177558/CA/NCI NIH HHSUnited States/HHSN261201300021C/CA/NCI NIH HHSUnited States/HHSN261201800013I/CA/NCI NIH HHSUnited States/P30 CA042014/CA/NCI NIH HHSUnited States/32919043PMC82765801002
    corecore