349 research outputs found

    ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์กด๋ถ„์„์ด ์ ์šฉ๋œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜ ํ‰๊ฐ€ ๋ชจ๋ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•œ ์ฝ•์Šค ๋ชจํ˜•๊ณผ ๊ฒฐํ•ฉ๋œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•: ํ—ฌ์Šค์ผ€์–ด-ํ™˜๊ฒฝ ์—ฐ๊ณ„ ๋ฐ์ดํ„ฐ ํ™œ์šฉ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์˜๊ณผ๋Œ€ํ•™ ์˜๊ณผํ•™๊ณผ, 2020. 8. ๋ฐ•์ƒ๋ฏผ .Background and aims: The contribution of different cardiovascular disease (CVD) risk factors for the risk evaluation and predictive modeling for incident CVD is often debated. Also, to what extent data on CVD risk factors from multiple data categories should be collected for comprehensive risk assessment and predictive modeling for CVD risk using survival analysis is uncertain despite the increasing availability of the relevant data sources. This study aimed to evaluate the contribution of different data categories derived from integrated data on healthcare and environmental exposure to the risk evaluation and prediction models for CVD risk using deep learning based survival analysis in combination with Cox proportional hazards regression and Cox proportional hazards regression. Methods: Information on the comprehensive list of CVD risk factors were collected from systematic reviews of variables included in the conventional CVD risk assessment tools and observational studies from medical literature database (PubMed and Embase). Each risk factor was screened for availability in the National Health Insurance Service-National Sample Cohort (NHIS-NSC) linked to environmental exposure data on cumulative particulate matter and urban green space using residential area code. Individual records of 137,249 patients more than 40 years of age who underwent the biennial national health screening between 2009 and 2010 without previous history of CVD were followed up for incident CVD event from January 1, 2011 to December 31, 2013 in the NHIS-NSC with data linkage to environmental exposure. Statistics-based variable selection methods were implemented as follows: statistical significance, subset with the minimum (best) Akaike Information Criteria (AIC), variables selected from the regularized Cox proportional hazards regression with elastic net penalty, and finally a variable set that commonly meets all the criteria from the abovementioned statistical methods. Prediction models using Cox proportional hazards deep neural network (DeepSurv) and Cox proportional hazards regression were constructed in the training set (80% of the total sample) using input feature sets selected from the abovementioned strategies and progressively adding input features by data categories to examine the relative contribution of each data type to the predictive performance for CVD risk. Performance evaluations of the DeepSurv and Cox proportional hazards regression models for CVD risk were conducted in the test set (20% of the total sample) with Unos concordance statistics (C-index), which is the most up-to-date evaluation metrics for the survival models with right censored data. Results: After the comprehensive review, data synthesis, and availability check, a total of 31 risk factors in the categories of sociodemographic, clinical laboratory test and measurement, lifestyle behavior, family history, underlying medical conditions, dental health, medication, and environmental exposure were identified in the NHIS-NSC linked to environmental exposure data. Among the models constructed with different variable selection methods, using statistically significant variables for DeepSurv (Unos C-index: 0.7069) and all of the variables for Cox proportional hazards regression (Unos C-index: 0.7052) showed improved predictive performance for CVD risk, which was a statistically significant increase (p-value for difference in Unos C-index: <0.0001 for both comparisons) compared to the models with basic clinical factors (age, sex, and body mass index), respectively. When all and statistically significant variables in each data category from sociodemographic to environmental exposure were progressively added as input features into DeepSurv and Cox proportional hazards regression for predictive modeling for CVD risk, the DeepSurv model with statistically significant variables pertaining to the sociodemographic factors, clinical laboratory test and measurement, and lifestyle behavior data showed the notable performance that outperformed Cox proportional hazards regression model with statistically significant variables added up to the medication category. Extensive data linkage to environmental exposure on cumulative particulate matter and urban green space offered only marginal improvement for the predictive performance of DeepSurv and Cox proportional hazards regression models for CVD risk. Conclusion: To obtain the best predictive performance of DeepSurv model for CVD risk with minimum number of input features, information on sociodemographic, clinical laboratory test and measurement, and lifestyle behavior should be primarily collected and used as input features in the NHIS-NSC. Also, the overall performance of DeepSurv for CVD risk assessment was improved with a hybrid approach using statistically significant variables from Cox proportional hazards regression as input features. When all the data categories in the NHIS-NSC linked to environmental exposure data are available, progressively adding variables in each data category could incrementally increase the predictive performance of DeepSurv model for CVD risk with the hybrid approach. Data linkage to the environmental exposure with residential area code in the NHIS-NSC offered marginally improved performance for CVD risk in both DeepSurv model with the hybrid approach and Cox proportional hazards regression model.๋ฐฐ๊ฒฝ ๋ฐ ๋ชฉ์ : ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜ํ‰๊ฐ€ ๋ฐ ์˜ˆ์ธก๋ชจ๋ธ๋ง์—์„œ ๋‹ค์–‘ํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜์ธ์ž๋“ค์˜ ๋ชจ๋ธ ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋Œ€ํ•œ ๊ธฐ์—ฌ๋„๋Š” ๋…ผ๋ž€์˜ ์š”์ง€๋กœ ๋ณด๊ณ ๋˜์–ด์™”๋‹ค. ๋˜ํ•œ, ์ง€์†์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ๊ด€๋ จ ๋ฐ์ดํ„ฐ์˜ ์ข…๋ฅ˜์™€ ์–‘์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ํฌ๊ด„์ ์ธ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜ํ‰๊ฐ€์™€ ์ตœ์ ์˜ ์˜ˆ์ธก ๋ชจํ˜• ๊ฐœ๋ฐœ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋Š ๋ฒ”์œ„์™€ ์ˆ˜์ค€๊นŒ์ง€ ์ˆ˜์ง‘ํ•ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ทผ๊ฑฐ๋Š” ๋ถ€์กฑํ•œ ํ˜„ํ™ฉ์ด๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ฝ•์Šค ๋ชจํ˜•๊ณผ ๊ฒฐํ•ฉ๋œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์กด๋ถ„์„ ์ ‘๊ทผ๋ฒ• ๋ฐ ์ฝ•์Šค ๋ชจํ˜•์„ ํ™œ์šฉํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜ํ‰๊ฐ€์™€ ์˜ˆ์ธก๋ชจ๋ธ๋ง์—์„œ ํ—ฌ์Šค์ผ€์–ด-ํ™˜๊ฒฝ ์—ฐ๊ณ„ ๋ฐ์ดํ„ฐ ํ™œ์šฉ๋ฐฉ๋ฒ• ๋ฐ ๋ฒ”์ฃผ์— ๋”ฐ๋ฅธ ๋ชจ๋ธ ์„ฑ๋Šฅํ–ฅ์ƒ์— ๋Œ€ํ•œ ๊ธฐ์—ฌ๋„๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค. ์—ฐ๊ตฌ ๋ฐฉ๋ฒ•: ์ „ํ†ต์  ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜ ํ‰๊ฐ€ ๋„๊ตฌ ๋ฐ ๊ด€์ฐฐ ์—ฐ๊ตฌ๋“ค์— ํฌํ•จ ๋œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜์š”์ธ ๊ด€๋ จ ๋ณ€์ˆ˜๋“ค์„ ์ฒด๊ณ„์  ๋ฌธํ—Œ๊ณ ์ฐฐ ๋ฐฉ๋ฒ•๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ์˜ํ•™์—ฐ๊ตฌ ๋ฌธํ—Œ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค (PubMed and Embase)์—์„œ ํฌ๊ด„์ ์œผ๋กœ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜์˜€๋‹ค. ๋ฏธ์„ธ๋จผ์ง€ ๋ˆ„์ ์žฅ๊ธฐ๋…ธ์ถœ ๋ฐ ๋„์‹œ๋…น์ง€๋ฉด์ ์— ๋Œ€ํ•œ ํ™˜๊ฒฝ ๋…ธ์ถœ ๋ฐ์ดํ„ฐ์™€ ์—ฐ๊ณ„ ๋œ ๊ตญ๋ฏผ๊ฑด๊ฐ•๋ณดํ—˜๊ณต๋‹จ ํ‘œ๋ณธ์ฝ”ํ˜ธํŠธ, (National Health Insurance Service-National Sample Cohort, NHIS-NSC)์—์„œ ๊ฐ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜์ธ์ž๋“ค์˜ ๋ฐ์ดํ„ฐ ํ™•๋ณด ๊ฐ€๋Šฅ์„ฑ์„ ๊ฒ€ํ† ํ•˜์˜€๋‹ค. NHIS-NSC๋ฅผ ๊ธฐ์ค€์œผ๋กœ 2009๋…„์—์„œ 2010๋…„ ์‚ฌ์ด์— ๊ตญ๊ฐ€๊ฑด๊ฐ•๊ฒ€์ง„์„ ๋ฐ›์€ 40์„ธ ์ด์ƒ ๋Œ€์ƒ์ž ์ค‘ ๊ณผ๊ฑฐ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ๋ณ‘๋ ฅ์ด ์—†๋Š” ๋Œ€์ƒ์ž 137,249๋ช…์˜ ํ™˜์ž์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ˆ˜์ง‘ํ•˜์—ฌ 2011 ๋…„ 1 ์›” 1 ์ผ๋ถ€ํ„ฐ 2013 ๋…„ 12 ์›” 31 ์ผ๊นŒ์ง€ ์‹ ๊ทœ ๋ฐœ์ƒํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜์— ๋Œ€ํ•ด ์‹œ๊ฐ„ ๊ฒฝ๊ณผ์— ๋”ฐ๋ผ ์ถ”์  ์กฐ์‚ฌํ•˜์˜€๋‹ค. ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋ณ€์ˆ˜์„ ํƒ ๋ฐฉ๋ฒ•์€ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ํ†ต๊ณ„์  ์œ ์˜์„ฑ, ์ตœ์†Œ (์ตœ์ƒ์˜) Akaike Information Criteria (AIC)์˜ ํ•˜์œ„ ์ง‘ํ•ฉ, elastic net penalty๋กœ ์ •๊ทœํ™” ๋œ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ์„ ํƒ๋œ ๋ณ€์ˆ˜ ๋ฐ ์œ„์— ์–ธ๊ธ‰๋œ ๋ชจ๋“  ๊ธฐ์ค€์„ ์ถฉ์กฑํ•˜๋Š” ๋ณ€์ˆ˜ ์„ธํŠธ๋กœ ์„ ์ •ํ•˜์˜€๋‹ค. ์œ„์— ๋ช…์‹œ๋œ ํ†ต๊ณ„์  ๋ฐฉ๋ฒ• ์™ธ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ์— ์†ํ•œ ๋ณ€์ˆ˜ ๋ฐ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋ณ€์ˆ˜ (ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•)๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ž…๋ ฅ ํ”ผ์ณ๋กœ ์ถ”๊ฐ€ํ•˜๋Š” ์ „๋žต์œผ๋กœ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ ์ƒ์กด๋ถ„์„ (Cox proportional hazards deep neural network, DeepSurv) ๋ฐ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ์˜ˆ์ธก ๋ชจ๋ธ๋“ค์„ ํ›ˆ๋ จ ์„ธํŠธ (์ „์ฒด ์ƒ˜ํ”Œ์˜ 80 %)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. DeepSurv ๋ฐ ์ฝ•์Šค๋น„๋ก€ ์œ„ํ—˜๋ชจํ˜•์„ ํ™œ์šฉํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ์˜ ์„ฑ๋Šฅํ‰๊ฐ€๋Š” ์ƒ์กด๋ถ„์„์„ ํ™œ์šฉํ•œ ์˜ˆ์ธก ๋ชจ๋ธ๋ง์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ํ‰๊ฐ€์ง€ํ‘œ๋กœ ์•Œ๋ ค์ง„ Unos concordance statistics (C-index)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ ์„ธํŠธ (์ด ์ƒ˜ํ”Œ์˜ 20 %)์—์„œ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ: ์ฒด๊ณ„์  ๋ฌธํ—Œ๊ณ ์ฐฐ, ๋ฐ์ดํ„ฐ ์ทจํ•ฉ ๋ฐ ์ถ”์ถœ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€ํ†  ํ›„, ์ธ๊ตฌ์‚ฌํšŒํ•™์  ์š”์ธ, ๊ฑด๊ฐ•๊ฒ€์ง„ ๋ฐ ์ธก์ • ๊ฒฐ๊ณผ, ์ƒํ™œ์Šต๊ด€, ๊ฐ€์กฑ๋ ฅ, ๊ฑด๊ฐ•์ƒํƒœ, ๊ตฌ๊ฐ•๊ฑด๊ฐ•, ์•ฝ๋ฌผ ๋ฐ ํ™˜๊ฒฝ ๋…ธ์ถœ ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ์—์„œ ์ด 31 ๊ฐœ์˜ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜์ธ์ž๊ฐ€ ์ง€์—ญํ™˜๊ฒฝ ์ž๋ฃŒ์™€ ์—ฐ๊ณ„๋œ NHIS-NSC์—์„œ ํ™•์ธ๋˜์—ˆ๋‹ค. ํ†ต๊ณ„ ๊ธฐ๋ฐ˜ ๋ณ€์ˆ˜์„ ํƒ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฐœ๋ฐœํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ ์ค‘ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋ณ€์ˆ˜๋ฅผ DeepSurv์— ์ ์šฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์ด Uno 's C-index ๊ฐ’ 0.7069, ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์— ์ ์šฉํ•œ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์ด Uno 's C-index ๊ฐ’ 0.7052๋กœ ๋‚˜ํƒ€๋‚˜ ๊ธฐ๋ณธ ์ž„์ƒ ์š”์ธ (์—ฐ๋ น, ์„ฑ๋ณ„ ๋ฐ ์ฒด์งˆ๋Ÿ‰์ง€์ˆ˜)์ด ํฌํ•จ๋œ ์˜ˆ์ธก ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜์—ฌ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋ชจ๋ธ ์˜ˆ์ธก๋ ฅ ์ฆ๊ฐ€๋ฅผ ๋ณด์˜€๋‹ค (๋‘ ๋ชจ๋ธ ๋ชจ๋‘ Unos C-index ์ฐจ์ด์— ๋Œ€ํ•œ p-value : <0.0001). ์ธ๊ตฌ์‚ฌํšŒํ•™์  ํŠน์„ฑ์—์„œ ํ™˜๊ฒฝ ๋…ธ์ถœ์— ์ด๋ฅด๊ธฐ๊นŒ์ง€ ๊ฐ ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ์—์„œ ๋ชจ๋‘ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋ณ€์ˆ˜๋“ค์ด ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ๋ง์„์œ„ํ•œ DeepSurv ๋ฐ Cox ๋น„๋ก€ ์œ„ํ—˜ ํšŒ๊ท€์— ์ž…๋ ฅ ํ”ผ์ณ๋กœ ์ ์ง„์ ์œผ๋กœ ์ถ”๊ฐ€ ๋œ ๊ฒฝ์šฐ, ์ธ๊ตฌ์‚ฌํšŒํ•™์  ์š”์ธ, ๊ฑด๊ฐ•๊ฒ€์ง„ ๋ฐ ์ธก์ • ๊ฒฐ๊ณผ, ์ƒํ™œ์Šต๊ด€ ์š”์ธ ์ค‘ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ๋ณ€์ˆ˜๋“ค๋กœ ๊ตฌ์„ฑ๋œ DeepSurv ๋ชจ๋ธ์ด ์˜์•ฝํ’ˆ ์‚ฌ์šฉ๊นŒ์ง€ ๊ณ ๋ คํ•œ Cox ๋น„๋ก€ ์œ„ํ—˜ ํšŒ๊ท€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ชจ๋ธ ๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ๋‹ค. ๋ฏธ์„ธ๋จผ์ง€ ๋ฐ ๋„์‹œ๋…น์ง€๋ฉด์ ์— ๋Œ€ํ•œ ํ™˜๊ฒฝ ๋…ธ์ถœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฑฐ์ฃผ์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ NHIS-NSC์™€ ์—ฐ๊ณ„ ํ›„ ์ ์ง„์ ์œผ๋กœ ์ž…๋ ฅ ํ”ผ์ณ๋กœ ์ถ”๊ฐ€ ์‹œ DeepSurv ๋ฐ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์„ ํ™œ์šฉํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ๋ง ์„ฑ๋Šฅ์„ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์ˆ˜์ค€์œผ๋กœ ๊ฐœ์„ ํ•˜์ง€ ๋ชปํ–ˆ๋‹ค. ๊ฒฐ๋ก : ์ตœ์†Œ ์ž…๋ ฅ ํ”ผ์ณ๋ฅผ ๊ฐ–์ถ˜ ์ƒ์กด ๋ถ„์„ ๊ธฐ๋ฐ˜ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ์—์„œ ์ตœ์ƒ์˜ ์„ฑ๋Šฅ์„ ์–ป์œผ๋ ค๋ฉด ์ธ๊ตฌ์‚ฌํšŒํ•™์ , ๊ฑด๊ฐ•๊ฒ€์ง„ ๋ฐ ์ธก์ • ๊ฒฐ๊ณผ, ๋ฐ ์ƒํ™œ์Šต๊ด€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ NHIS-NSC์—์„œ ์ˆ˜์ง‘ํ•˜์—ฌ DeepSurv์˜ ์ž…๋ ฅ ํ”ผ์ณ๋กœ ํ™œ์šฉํ•ด์•ผํ•œ๋‹ค. ์ง€์—ญํ™˜๊ฒฝ ์ž๋ฃŒ์™€ ์—ฐ๊ณ„๋œ NHIS-NSC์—์„œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์„ ๋•Œ ์ ์ง„์ ์œผ๋กœ ๊ฐ ๋ฐ์ดํ„ฐ ๋ฒ”์ฃผ ์ค‘ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜•์—์„œ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์œ„ํ—˜์ธ์ž๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ž…๋ ฅ ํ”ผ์ณ๋กœ DeepSurv ๋ชจ๋ธ์— ์ถ”๊ฐ€ํ•˜๋Š” ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ๋ฒ•์—์„œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ๋ง ์„ฑ๋Šฅ์ด ์ ์ฐจ ํ–ฅ์ƒ ๋  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฃผ๊ฑฐ ์ง€์—ญ ์ฝ”๋“œ๋ฅผ ์‚ฌ์šฉํ•œ NHIS-NSC์™€ ํ™˜๊ฒฝ ๋…ธ์ถœ ๋ฐ์ดํ„ฐ ์—ฐ๊ณ„๋Š” DeepSurv ๋ฐ ์ฝ•์Šค๋น„๋ก€์œ„ํ—˜๋ชจํ˜• ๋ชจ๋‘์—์„œ ์‹ฌํ˜ˆ๊ด€์งˆํ™˜ ์˜ˆ์ธก ๋ชจ๋ธ๋ง ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์—ˆ์ง€๋งŒ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ์ฆ๊ฐ€ ์ˆ˜์ค€์€ ์•„๋‹Œ ๊ฒƒ์œผ๋กœ ๋‚˜ํƒ€๋‚˜ ํ™˜๊ฒฝ ๋…ธ์ถœ ๋ฐ์ดํ„ฐ ์—ฐ๊ณ„ ๋ฐ ์ ์šฉ ์‹œ ๊ฒ€ํ† ๊ฐ€ ํ•„์š”ํ•  ๊ฒƒ์œผ๋กœ ์ถ”์ •๋œ๋‹ค.I. Introduction 1 1. Background 1 2. Research problem 4 3. Hypothesis and objective 6 3.1. Hypothesis 6 3.2. Objective 6 II. Materials and methods 8 1. Comprehensive review and identification of cardiovascular disease (CVD) risk factors 8 1.1. Systematic review on variables included in conventional CVD risk assessment tools 8 1.2. Systematic review on traditional and emerging CVD risk factors from observational studies 9 1.3. Integration of the comprehensive list of CVD risk factors 11 1.4. Screening for data availability 11 2. Cohort analysis for measuring strength of association between risk factors and incident cardiovascular disease 11 2.1 Study population and linkage to environmental exposure data 11 2.2. Variable selection and data processing 15 2.3. Population-based cohort analysis 17 3. Predictive modeling using survival analysis: DeepSurv and Cox proportional hazards regression 17 3.1. Model development 17 3.2. Evaluation of the predictive performance of the models 20 III. Results 21 1. Identification and categorization of cardiovascular disease risk factors 21 2. Magnitude of association between selected risk factors with cardiovascular disease 43 3. Model performance evaluation 56 VI. Discussion 68 1. Key findings and contributions 68 2. Comparison to other studies 69 3. Strengths and limitations 73 4. Implications 74 5. Future perspectives 75 V. Conclusion 77 Reference 78 ๊ตญ๋ฌธ์ดˆ๋ก 88Docto

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A โ€œgenotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This โ€œBig Dataโ€ is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of geneโ€“environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Language modelling for clinical natural language understanding and generation

    Get PDF
    One of the long-standing objectives of Artificial Intelligence (AI) is to design and develop algorithms for social good including tackling public health challenges. In the era of digitisation, with an unprecedented amount of healthcare data being captured in digital form, the analysis of the healthcare data at scale can lead to better research of diseases, better monitoring patient conditions and more importantly improving patient outcomes. However, many AI-based analytic algorithms rely solely on structured healthcare data such as bedside measurements and test results which only account for 20% of all healthcare data, whereas the remaining 80% of healthcare data is unstructured including textual data such as clinical notes and discharge summaries which is still underexplored. Conventional Natural Language Processing (NLP) algorithms that are designed for clinical applications rely on the shallow matching, templates and non-contextualised word embeddings which lead to limited understanding of contextual semantics. Though recent advances in NLP algorithms have demonstrated promising performance on a variety of NLP tasks in the general domain with contextualised language models, most of these generic NLP algorithms struggle at specific clinical NLP tasks which require biomedical knowledge and reasoning. Besides, there is limited research to study generative NLP algorithms to generate clinical reports and summaries automatically by considering salient clinical information. This thesis aims to design and develop novel NLP algorithms especially clinical-driven contextualised language models to understand textual healthcare data and generate clinical narratives which can potentially support clinicians, medical scientists and patients. The first contribution of this thesis focuses on capturing phenotypic information of patients from clinical notes which is important to profile patient situation and improve patient outcomes. The thesis proposes a novel self-supervised language model, named Phenotypic Intelligence Extraction (PIE), to annotate phenotypes from clinical notes with the detection of contextual synonyms and the enhancement to reason with numerical values. The second contribution is to demonstrate the utility and benefits of using phenotypic features of patients in clinical use cases by predicting patient outcomes in Intensive Care Units (ICU) and identifying patients at risk of specific diseases with better accuracy and model interpretability. The third contribution is to propose generative models to generate clinical narratives to automate and accelerate the process of report writing and summarisation by clinicians. This thesis first proposes a novel summarisation language model named PEGASUS which surpasses or is on par with the state-of-the-art performance on 12 downstream datasets including biomedical literature from PubMed. PEGASUS is further extended to generate medical scientific documents from input tabular data.Open Acces

    Predict, diagnose, and treat chronic kidney disease with machine learning: a systematic literature review

    Get PDF
    Objectives In this systematic review we aimed at assessing how artificial intelligence (AI), including machine learning (ML) techniques have been deployed to predict, diagnose, and treat chronic kidney disease (CKD). We systematically reviewed the available evidence on these innovative techniques to improve CKD diagnosis and patient management.Methods We included English language studies retrieved from PubMed. The review is therefore to be classified as a "rapid review ", since it includes one database only, and has language restrictions; the novelty and importance of the issue make missing relevant papers unlikely. We extracted 16 variables, including: main aim, studied population, data source, sample size, problem type (regression, classification), predictors used, and performance metrics. We followed the Preferred Reporting Items for Systematic Reviews (PRISMA) approach; all main steps were done in duplicate. The review was registered on PROSPERO.ResultsFrom a total of 648 studies initially retrieved, 68 articles met the inclusion criteria.Models, as reported by authors, performed well, but the reported metrics were not homogeneous across articles and therefore direct comparison was not feasible. The most common aim was prediction of prognosis, followed by diagnosis of CKD. Algorithm generalizability, and testing on diverse populations was rarely taken into account. Furthermore, the clinical evaluation and validation of the models/algorithms was perused; only a fraction of the included studies, 6 out of 68, were performed in a clinical context.Conclusions Machine learning is a promising tool for the prediction of risk, diagnosis, and therapy management for CKD patients. Nonetheless, future work is needed to address the interpretability, generalizability, and fairness of the models to ensure the safe application of such technologies in routine clinical practice

    Relation Prediction over Biomedical Knowledge Bases for Drug Repositioning

    Get PDF
    Identifying new potential treatment options for medical conditions that cause human disease burden is a central task of biomedical research. Since all candidate drugs cannot be tested with animal and clinical trials, in vitro approaches are first attempted to identify promising candidates. Likewise, identifying other essential relations (e.g., causation, prevention) between biomedical entities is also critical to understand biomedical processes. Hence, it is crucial to develop automated relation prediction systems that can yield plausible biomedical relations to expedite the discovery process. In this dissertation, we demonstrate three approaches to predict treatment relations between biomedical entities for the drug repositioning task using existing biomedical knowledge bases. Our approaches can be broadly labeled as link prediction or knowledge base completion in computer science literature. Specifically, first we investigate the predictive power of graph paths connecting entities in the publicly available biomedical knowledge base, SemMedDB (the entities and relations constitute a large knowledge graph as a whole). To that end, we build logistic regression models utilizing semantic graph pattern features extracted from the SemMedDB to predict treatment and causative relations in Unified Medical Language System (UMLS) Metathesaurus. Second, we study matrix and tensor factorization algorithms for predicting drug repositioning pairs in repoDB, a general purpose gold standard database of approved and failed drugโ€“disease indications. The idea here is to predict repoDB pairs by approximating the given input matrix/tensor structure where the value of a cell represents the existence of a relation coming from SemMedDB and UMLS knowledge bases. The essential goal is to predict the test pairs that have a blank cell in the input matrix/tensor based on the shared biomedical context among existing non-blank cells. Our final approach involves graph convolutional neural networks where entities and relation types are embedded in a vector space involving neighborhood information. Basically, we minimize an objective function to guide our model to concept/relation embeddings such that distance scores for positive relation pairs are lower than those for the negative ones. Overall, our results demonstrate that recent link prediction methods applied to automatically curated, and hence imprecise, knowledge bases can nevertheless result in high accuracy drug candidate prediction with appropriate configuration of both the methods and datasets used

    Towards PACE-CAD Systems

    Get PDF
    Despite phenomenal advancements in the availability of medical image datasets and the development of modern classification algorithms, Computer-Aided Diagnosis (CAD) has had limited practical exposure in the real-world clinical workflow. This is primarily because of the inherently demanding and sensitive nature of medical diagnosis that can have far-reaching and serious repercussions in case of misdiagnosis. In this work, a paradigm called PACE (Pragmatic, Accurate, Confident, & Explainable) is presented as a set of some of must-have features for any CAD. Diagnosis of glaucoma using Retinal Fundus Images (RFIs) is taken as the primary use case for development of various methods that may enrich an ordinary CAD system with PACE. However, depending on specific requirements for different methods, other application areas in ophthalmology and dermatology have also been explored. Pragmatic CAD systems refer to a solution that can perform reliably in day-to-day clinical setup. In this research two, of possibly many, aspects of a pragmatic CAD are addressed. Firstly, observing that the existing medical image datasets are small and not representative of images taken in the real-world, a large RFI dataset for glaucoma detection is curated and published. Secondly, realising that a salient attribute of a reliable and pragmatic CAD is its ability to perform in a range of clinically relevant scenarios, classification of 622 unique cutaneous diseases in one of the largest publicly available datasets of skin lesions is successfully performed. Accuracy is one of the most essential metrics of any CAD system's performance. Domain knowledge relevant to three types of diseases, namely glaucoma, Diabetic Retinopathy (DR), and skin lesions, is industriously utilised in an attempt to improve the accuracy. For glaucoma, a two-stage framework for automatic Optic Disc (OD) localisation and glaucoma detection is developed, which marked new state-of-the-art for glaucoma detection and OD localisation. To identify DR, a model is proposed that combines coarse-grained classifiers with fine-grained classifiers and grades the disease in four stages with respect to severity. Lastly, different methods of modelling and incorporating metadata are also examined and their effect on a model's classification performance is studied. Confidence in diagnosing a disease is equally important as the diagnosis itself. One of the biggest reasons hampering the successful deployment of CAD in the real-world is that medical diagnosis cannot be readily decided based on an algorithm's output. Therefore, a hybrid CNN architecture is proposed with the convolutional feature extractor trained using point estimates and a dense classifier trained using Bayesian estimates. Evaluation on 13 publicly available datasets shows the superiority of this method in terms of classification accuracy and also provides an estimate of uncertainty for every prediction. Explainability of AI-driven algorithms has become a legal requirement after Europeโ€™s General Data Protection Regulations came into effect. This research presents a framework for easy-to-understand textual explanations of skin lesion diagnosis. The framework is called ExAID (Explainable AI for Dermatology) and relies upon two fundamental modules. The first module uses any deep skin lesion classifier and performs detailed analysis on its latent space to map human-understandable disease-related concepts to the latent representation learnt by the deep model. The second module proposes Concept Localisation Maps, which extend Concept Activation Vectors by locating significant regions corresponding to a learned concept in the latent space of a trained image classifier. This thesis probes many viable solutions to equip a CAD system with PACE. However, it is noted that some of these methods require specific attributes in datasets and, therefore, not all methods may be applied on a single dataset. Regardless, this work anticipates that consolidating PACE into a CAD system can not only increase the confidence of medical practitioners in such tools but also serve as a stepping stone for the further development of AI-driven technologies in healthcare

    Addressing subjectivity in the classification of palaeoenvironmental remains with supervised deep learning convolutional neural networks

    Get PDF
    Archaeological object identifications have been traditionally undertaken through a comparative methodology where each artefact is identified through a subjective, interpretative act by a professional. Regarding palaeoenvironmental remains, this comparative methodology is given boundaries by using reference materials and codified sets of rules, but subjectivity is nevertheless present. The problem with this traditional archaeological methodology is that higher level of subjectivity in the identification of artefacts leads to inaccuracies, which then increases the potential for Type I and Type II errors in the testing of hypotheses. Reducing the subjectivity of archaeological identifications would improve the statistical power of archaeological analyses, which would subsequently lead to more impactful research. In this thesis, it is shown that the level of subjectivity in palaeoenvironmental research can be reduced by applying deep learning convolutional neural networks within an image recognition framework. The primary aim of the presented research is therefore to further the on-going paradigm shift in archaeology towards model-based object identifications, particularly within the realm of palaeoenvironmental remains. Although this thesis focuses on the identification of pollen grains and animal bones, with the latter being restricted to the astragalus of sheep and goats, there are wider implications for archaeology as these methods can easily be extended beyond pollen and animal remains. The previously published POLEN23E dataset is used as the pilot study of applying deep learning in pollen grain classification. In contrast, an image dataset of modern bones was compiled for the classification of sheep and goat astragali due to a complete lack of available bone image datasets and a double blind study with inexperienced and experienced zooarchaeologists was performed to have a benchmark to which image recognition models can be compared. In both classification tasks, the presented models outperform all previous formal modelling methods and only the best human analysts match the performance of the deep learning model in the sheep and goat astragalus separation task. Throughout the thesis, there is a specific focus on increasing trust in the models through the visualization of the modelsโ€™ decision making and avenues of improvements to Grad-CAM are explored. This thesis makes an explicit case for the phasing out of the comparative methods in favour of a formal modelling framework within archaeology, especially in palaeoenvironmental object identification

    Advances in Artificial Intelligence: Models, Optimization, and Machine Learning

    Get PDF
    The present book contains all the articles accepted and published in the Special Issue โ€œAdvances in Artificial Intelligence: Models, Optimization, and Machine Learningโ€ of the MDPI Mathematics journal, which covers a wide range of topics connected to the theory and applications of artificial intelligence and its subfields. These topics include, among others, deep learning and classic machine learning algorithms, neural modelling, architectures and learning algorithms, biologically inspired optimization algorithms, algorithms for autonomous driving, probabilistic models and Bayesian reasoning, intelligent agents and multiagent systems. We hope that the scientific results presented in this book will serve as valuable sources of documentation and inspiration for anyone willing to pursue research in artificial intelligence, machine learning and their widespread applications

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    Washington University Senior Undergraduate Research Digest (WUURD), Spring 2018

    Get PDF
    From the Washington University Office of Undergraduate Research Digest (WUURD), Vol. 13, 05-01-2018. Published by the Office of Undergraduate Research. Joy Zalis Kiefer, Director of Undergraduate Research and Associate Dean in the College of Arts & Scienc
    • โ€ฆ
    corecore