5 research outputs found

    LDP-GAN : Generative Adversarial Network with Local Differential Privacy for Patient Data Synthesis

    No full text
    전자의무기록(EMR)은 환자의 건강 상태, 진료결과, 처방 정보 등을 담은 의료 데이터의 일종이다. 환자에 대한 많은 정보를 담고 있어 다양하게 활용될 수 있으며 여러 방면에서 의료의 질을 향상시킬 수 있는 잠재력을 가지고 있다. 특히 최근 큰 발전을 이룬 기계학습(Machine learning)이 의료분야에도 도입됨에 따라 전자의무기록도 활용 도가 높아지고 있다. 그러나 전자의무기록은 환자의 민감한 개인정보를 다수 포함하고 있어 수집, 활용 및 공유가 까다롭다. 이러한 특성은 전자의무기록에 관한 연구를 어렵게 하며 활용도를 떨어뜨린다. 이런 경우 생성모델이 한가지 해결책이 될 수 있다. 생성모델은 실제 데이터를 모방해서 이와 유사한 가짜 데이터를 생성하는 모델을 말한다. 이 생성모델에서 생성된 가짜 데이터를 활용하면 개인정보에 관한 제약을 피할 수 있다. 생성모델에는 다양한 종류가 있지만 최근에는 딥러닝(Deep learning)을 활용한 생성모델이 가장 주목받고 있다. 딥러닝 생성모델은 이미지 분야에서 많은 발전을 이뤘고 사람의 눈으로는 진위를 판별하기 어려운 고해상도의 이미지도 생성할 수 있게 됐다. 딥러닝 생성모델은 의료 데이터에도 적용되었고 임상적으로 유의미한 데이터를 생성할 수 있었다. 딥러닝 생성모델이 좋은 성능을 보이기는 하지만 개인정보 완전하게 해결해 주지는 않는다. 몇몇 연구에서 딥러닝 모델에 대한 공격에 관한 내용이 다루어졌고 모델의 출력 값을 바탕으로 학습데이터를 유추할 수 있음이 밝혀졌다. 이는 딥러닝 생성모델을 사용하는 경우에도 여전히 프라이버시에 대한 위험이 있으며 개인정보 보호 목적을 위해 사용하는 경우라면 모델에 대한 보호가 필요함을 의미한다. 본 연구에서는 멤버십 추론 공격(Membership inference attack)으로부터 안전한 딥러닝 생성모델을 개발하는 것을 목표로 한다. 이 목표를 위해 딥러닝 생성모델 중 하나인 적대적 생성모델 신경망(GAN)을 사용했다. GAN의 한 종류인 WGAN-GP를 기본 모델로서 사용했고 프라이버시 보호를 위해 차분 프라이버시(Differential Privacy)를 접목했다. 차분 프라이버시에서는 수학적으로 디자인된 잡음을 통해 프라이버시를 보호하며 잡음의 강도에 관련된 파라미터인 ε을 사용해 효용성(Utility)과 프라이버시(Privacy) 보호 수준 사이의 Trade-off관계를 조절한다. 이 연구에서는 차분 프라이버시 중에서도 지역 차분 프라이버시를 채택하여 교란된 데이터로만 모델을 학습하는 방식을 개발했다. 교란된 데이터로만 학습을 수행하기 때문에 모델에 대한 공격으로부터 원본 데이터를 강력하게 보호할 수 있다. 이런 방식으로 학습된 모델의 성능은 효용성 측면과 프라이버시 측면으로 나누어서 평가되었다. ε에 따라 두 평가지표 모두 유의미한 변화를 보였으며 두 지표사이의 Trade-off 관계를 적절히 조절하여 최적의 모델을 얻는 것이 가능함을 보였다. 이 실험 결과는 적절한 잡음을 가하면 모델에 대한 공격으로부터 학습 데이터를 보호할 수 있음을 의미한다. 이 연구의 결과를 통해 전자의무기록의 개인정보 문제로 인해 생기는 제약을 어느정도 해결할 수 있을 것으로 예상된다. |The electronic medical records (EMR) are a type of medical data containing the patient's health condition, treatment results, and prescription information. It contains a lot of information on patients, so it can be used in various ways, and has the potential to improve the quality of medical care. In particular, machine learning which has recently made great progress, has been introduced into the medical field, eventually leading to the increased usage of EMR. However, EMR contain a number of sensitive personal information of patients, making it difficult to collect, utilize, and share. These characteristics make it difficult to study and utilize the EMR. Thus, the generative model can be a great solution to the previous difficulties. A generative model refers to a model that generates synthetic data similar to actual data. By utilizing synthetic data generated in this generative model, restrictions on personal information can be avoided. Although there are many types of generative models, recently, generative models using deep learning are the most noteworthy. In fact, deep learning generative models have made great strides in the field of images, and are able to generate high-resolution images that are difficult to determine authenticity with the human eye. Moreover, the deep learning generative model was also applied to medical data and was able to generate clinically meaningful data. Though deep learning generative models show good performance, they do not completely solve personal information problem. In the past, several studies have dealt with attacks on deep learning models, and it has been found that training data can be inferred based on the output values of the models. The result indicates that even when using a deep learning generative model, there still remains a risk to privacy, and protection of the model is therefore necessary if the model is used for the purpose of protecting personal information. Further, the objective of this study is to develop a deep learning generative model that is safe from membership inference attacks. To achieve this objective, we used WGAN-GP, a type of Generative adversarial network(GAN), as a basic model, and adopted differential privacy to protect privacy. The differential privacy protects privacy through mathematically designed noise, and uses ε, a parameter related to noise intensity, to adjust the trade-off relationship between utility and privacy protection levels. In this study, we developed a method for learning a model using only perturbation data by introducing regional differential privacy among differential privacy. Also, because training is performed only on perturbed data, the original data can be strongly protected from attacks on the model. Next, the performance of the model trained in this way was evaluated in terms of utility and privacy. Both evaluation indicators showed significant changes according to ε, and it was shown that it is possible to obtain an optimal model by appropriately adjusting the trade-off relationship between the two indicators. The results of this experiment signifies that the training data can be protected from attacks on the model if appropriate noise is applied. Through the finding of this experiment, it is expected that the limitations caused by the personal information issues on EMR can be resolved to some extent.Maste

    Self-Training With Quantile Errors for Multivariate Missing Data Imputation for Regression Problems in Electronic Medical Records: Algorithm Development Study

    No full text
    Background: When using machine learning in the real world, the missing value problem is the first problem encountered. Methods to impute this missing value include statistical methods such as mean, expectation-maximization, and multiple imputations by chained equations (MICE) as well as machine learning methods such as multilayer perceptron, k-nearest neighbor, and decision tree. Objective: The objective of this study was to impute numeric medical data such as physical data and laboratory data. We aimed to effectively impute data using a progressive method called self-training in the medical field where training data are scarce. Methods: In this paper, we propose a self-training method that gradually increases the available data. Models trained with complete data predict the missing values in incomplete data. Among the incomplete data, the data in which the missing value is validly predicted are incorporated into the complete data. Using the predicted value as the actual value is called pseudolabeling. This process is repeated until the condition is satisfied. The most important part of this process is how to evaluate the accuracy of pseudolabels. They can be evaluated by observing the effect of the pseudolabeled data on the performance of the model. Results: In self-training using random forest (RF), mean squared error was up to 12% lower than pure RF, and the Pearson correlation coefficient was 0.1% higher. This difference was confirmed statistically. In the Friedman test performed on MICE and RF, self-training showed a P value between .003 and .02. A Wilcoxon signed-rank test performed on the mean imputation showed the lowest possible P value, 3.05e-5, in all situations. Conclusions: Self-training showed significant results in comparing the predicted values and actual values, but it needs to be verified in an actual machine learning system. And self-training has the potential to improve performance according to the pseudolabel evaluation method, which will be the main subject of our future research

    CardioNet: a manually curated database for artificial intelligence-based research on cardiovascular diseases

    No full text
    BackgroundCardiovascular diseases (CVDs) are difficult to diagnose early and have risk factors that are easy to overlook. Early prediction and personalization of treatment through the use of artificial intelligence (AI) may help clinicians and patients manage CVDs more effectively. However, to apply AI approaches to CVDs data, it is necessary to establish and curate a specialized database based on electronic health records (EHRs) and include pre-processed unstructured data.MethodsTo build a suitable database (CardioNet) for CVDs that can utilize AI technology, contributing to the overall care of patients with CVDs. First, we collected the anonymized records of 748,474 patients who had visited the Asan Medical Center (AMC) or Ulsan University Hospital (UUH) because of CVDs. Second, we set clinically plausible criteria to remove errors and duplication. Third, we integrated unstructured data such as readings of medical examinations with structured data sourced from EHRs to create the CardioNet. We subsequently performed natural language processing to structuralize the significant variables associated with CVDs because most results of the principal CVD-related medical examinations are free-text readings. Additionally, to ensure interoperability for convergent multi-center research, we standardized the data using several codes that correspond to the common data model. Finally, we created the descriptive table (i.e., dictionary of the CardioNet) to simplify access and utilization of data for clinicians and engineers and continuously validated the data to ensure reliability.ResultsCardioNet is a comprehensive database that can serve as a training set for AI models and assist in all aspects of clinical management of CVDs. It comprises information extracted from EHRs and results of readings of CVD-related digital tests. It consists of 27 tables, a code-master table, and a descriptive table.ConclusionsCardioNet database specialized in CVDs was established, with continuing data collection. We are actively supporting multi-center research, which may require further data processing, depending on the subject of the study. CardioNet will serve as the fundamental database for future CVD-related research projects
    corecore