398 research outputs found

    ๊ธฐ๊ณ„ํ•™์Šต๊ณผ ์ˆ˜์šฉ๋ชจ๋ธ์„ ์ด์šฉํ•œ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ์˜ค์—ผ์› ๋ฐ ๊ธฐ์—ฌ๋„์˜ ์‹œ๊ณต๊ฐ„ ๋ถ„ํฌ ๋ถ„์„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ๊ฑด์„คํ™˜๊ฒฝ๊ณตํ•™๋ถ€, 2023. 2. ๊น€์žฌ์˜.์ง๊ฒฝ 2.5 ยตm ์ดํ•˜์˜ ์ž…์ž์ƒ ๋ฌผ์งˆ์ธ ์ดˆ๋ฏธ์„ธ๋จผ์ง€๋Š” ๋Œ€๊ธฐ์ค‘์— ์กด์žฌํ•˜๋ฉฐ, ๊ฑด๊ฐ•์— ๋ฏธ์น˜๋Š” ์•…์˜ํ–ฅ์œผ๋กœ ์ธํ•ด ์ˆ˜์‹ญ ๋…„ ๋™์•ˆ ์„ธ๊ณ„์ ์œผ๋กœ ๊ด€์‹ฌ์˜ ๋Œ€์ƒ์ด ๋˜๊ณ  ์žˆ๋Š” ๋Œ€๊ธฐ์˜ค์—ผ๋ฌผ์งˆ์ด๋‹ค. ์ดˆ๋ฏธ์„ธ๋จผ์ง€๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‹ค์–‘ํ•œ ์‹œ๊ฐ„๊ณผ ๊ณต๊ฐ„์— ๋Œ€ํ•ด ์ดˆ๋ฏธ์„ธ๋จผ์ง€์˜ ์˜ค์—ผ์› ์œ ํ˜•์„ ํŒŒ์•…ํ•˜๊ณ , ๊ฐ ์œ ํ˜•๋ณ„ ๊ธฐ์—ฌ๋„๋ฅผ ์ •๋Ÿ‰ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋”ฐ๋ผ์„œ, ์ดˆ๋ฏธ์„ธ๋จผ์ง€์˜ ์˜ค์—ผ์› ์ถ”์ •์€ ํ•ต์‹ฌ ๊ณผ์ œ๋กœ ๋‹ค๋ค„์ ธ ์™”์œผ๋ฉฐ, ํ†ต๊ณ„ํ•™์  ๋ฐฉ๋ฒ•๋ก ์„ ์ ์šฉํ•ด ์˜ค์—ผ์›์„ ์ถ”์ •ํ•˜๋Š” ์ˆ˜์šฉ๋ชจ๋ธ์ด ๋งŽ์ด ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ดˆ๋ฏธ์„ธ๋จผ์ง€์˜ ์„ธ๋ถ€ ํŠน์„ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ์˜ค์—ผ์› ์ถ”์ •๊ณผ ์ถ”์ •๋œ ์˜ค์—ผ์›์˜ ์‹œ๊ณต๊ฐ„ ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์˜€์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ํšจ๊ณผ์ ์ธ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ๊ด€๋ฆฌ ๋ฐฉ์•ˆ ๋งˆ๋ จ์— ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉ์ ์œผ๋กœ ํ•˜์˜€๋‹ค. ์˜ค์—ผ์› ์œ ํ˜• ์ถ”์ • ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•ด, ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ๋ง์ด ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค. ์ฒซ๋ฒˆ์งธ๋Š” ์–‘ํ–‰๋ ฌ ์ธ์ž ๋ถ„์„(Positive matrix factorization, PMF) ๋ชจ๋ธ๋ง์œผ๋กœ, ์ด๋Š” ํ•œ ์žฅ์†Œ์—์„œ ์ดˆ๋ฏธ์„ธ๋จผ์ง€์˜ ์˜ค์—ผ์› ์œ ํ˜•์„ ๊ตฌ์ฒด์ ์œผ๋กœ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ™œ์šฉ๋˜์—ˆ๋‹ค. ๋‘๋ฒˆ์งธ๋Š” ๋ฒ ์ด์ง€์•ˆ ๋‹ค๋ณ€๋Ÿ‰ ์ˆ˜์šฉ ๋ชจ๋ธ๋ง(Bayesian spatial multivariate receptor modelingm, BSMRM)์œผ๋กœ, ์ด๋Š” ๋‹ค์ˆ˜์˜ ์ธก์ • ์ง€์ ์œผ๋กœ๋ถ€ํ„ฐ ๋„“์€ ๋ฒ”์œ„์˜ ๋ฉด์ ์— ๋Œ€ํ•ด ์ฃผ์š” ์˜ค์—ผ์› ์œ ํ˜•์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ํ™œ์šฉ๋˜์—ˆ๋‹ค. ๋˜ํ•œ, ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ์˜ค์—ผ์› ์œ ํ˜• ์ถ”์ •์— ๊ฐ€์žฅ ์ค‘์š”ํ•œ ์ž๋ฃŒ๋กœ ํ™œ์šฉ๋˜๋Š” ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™์„ฑ๋ถ„ ๋†๋„๋ฅผ ์˜ˆ์ธกํ•˜์˜€๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์„ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™์„ฑ๋ถ„ ์ž๋ฃŒ์— ๋Œ€ํ•ด ํ™œ์šฉ๊ฐ€๋Šฅํ•œ์ง€๋ฅผ ๊ฒ€ํ† ํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™์„ฑ๋ถ„ ์ž๋ฃŒ์˜ ๋ฌด๊ฒฐ์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ ์ž ํ•˜์˜€๋‹ค. PMF ๋ชจ๋ธ๋ง์„ ํ†ตํ•ด, ๋Œ€ํ•œ๋ฏผ๊ตญ ์‹œํฅ์‹œ์˜ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ์˜ค์—ผ์› ์œ ํ˜• 10๊ฐ€์ง€๋ฅผ ๋„์ถœํ•˜์˜€๋‹ค. ์ด๋Š” ๊ฐ๊ฐ 2์ฐจ ์ƒ์„ฑ ์งˆ์‚ฐ์—ผ(24.3%), 2์ฐจ ์ƒ์„ฑ ํ™ฉ์‚ฐ์—ผ(18.8%), ์ด๋™ ์˜ค์—ผ์›(18.8%), ๋‚œ๋ฐฉ์—ฐ์†Œ(12.6%), ์ƒ๋ฌผ์ฒด ์—ฐ์†Œ(11.8%), ์„ํƒ„ ์—ฐ์†Œ(3.6%), ์ค‘์œ  ๊ด€๋ จ ์‚ฐ์—… ์˜ค์—ผ์›(1.8%), ์ œ๋ จ ๊ด€๋ จ ์‚ฐ์—… ์˜ค์—ผ์›(4.0%), ํ•ด์—ผ ์ž…์ž(2.7%), ํ† ์–‘(1.7%)์˜€๋‹ค. ๋„์ถœ๋œ ์˜ค์—ผ์› ์œ ํ˜•๋ณ„๋กœ, ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ˜ธํก์— ๋”ฐ๋ฅธ ๊ฑด๊ฐ• ์˜ํ–ฅ์„ ํ‰๊ฐ€ํ•˜์˜€๋‹ค. ์„ํƒ„ ์—ฐ์†Œ, ์ค‘์œ  ๊ด€๋ จ ์‚ฐ์—… ์˜ค์—ผ์›, ์ด๋™ ์˜ค์—ผ์›์˜ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ๊ธฐ์—ฌ๋„๋Š” ๋‚ฎ์•˜์ง€๋งŒ, ์ด๋กœ ์ธํ•œ ๋ฐœ์•” ์œ„ํ•ด๋„๋Š” 10E-6 ์ด์ƒ์œผ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋”ฐ๋ผ์„œ, ์ดˆ๋ฏธ์„ธ๋จผ์ง€์˜ ์งˆ๋Ÿ‰๋†๋„ ๊ฐ์ถ• ์ค‘์‹ฌ์˜ ๋Œ€์‘๋งŒ์ด ์•„๋‹Œ, ์˜ค์—ผ์›๋ณ„ ๊ฑด๊ฐ•์˜ํ–ฅ ์ค‘์‹ฌ์˜ ๋Œ€์‘์ด ์š”๊ตฌ๋œ๋‹ค. ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์˜ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™์„ฑ๋ถ„ ์˜ˆ์ธก ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด 4๊ฐ€์ง€ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์— ๋Œ€ํ•ด ์ž…๋ ฅ ์ž๋ฃŒ ์ˆ˜์ค€, ์˜ˆ์ธก ๋Œ€์ƒ ์„ฑ๋ถ„, ์ž…๋ ฅ ์ž๋ฃŒ ๊ธฐ๊ฐ„, ์ž…๋ ฅ ์ž๋ฃŒ์˜ ๊ฒฐ์ธก ๋น„์œจ, ์ž๋ฃŒ ๋Œ€์ƒ ์ง€์—ญ์„ ๋ณ€ํ™”ํ•˜๋ฉฐ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ๋น„๊ต ํ‰๊ฐ€ํ•˜์˜€๋‹ค. GAIN(Generative Adversarial Imputation Network), FCDNN(Fully Connected Deep Neural Network), Random forest(RF), kNN(k-nearest neighboring) ๋ชจ๋ธ์˜ 4๊ฐ€์ง€ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์„ ํ•œ๊ตญ์˜ 3๊ฐœ ์ง€์—ญ(์„œ์šธ, ์šธ์‚ฐ, ๋ฐฑ๋ น)์˜ 2016๋…„๋ถ€ํ„ฐ 2018๋…„๊นŒ์ง€์˜ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™ ์„ฑ๋ถ„ ์ž๋ฃŒ์— ๋Œ€ํ•ด ์ ์šฉํ•˜์—ฌ ๋†๋„๋ฅผ ์˜ˆ์ธกํ•˜์˜€๋‹ค. ์˜ˆ์ธก๊ฐ’๊ณผ ๊ด€์ธก๊ฐ’ ์‚ฌ์ด์˜ ๊ฒฐ์ •๊ณ„์ˆ˜๋ฅผ ํ†ตํ•ด ์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•œ ๊ฒฐ๊ณผ, ์˜ˆ์ธก ์ •ํ™•๋„๋Š” GAIN์ด ๊ฐ€์žฅ ๋†’์•˜๊ณ , FCDNN, RF ๋˜๋Š” kNN ์ˆœ์„œ๋กœ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ž…๋ ฅ ์ž๋ฃŒ์˜ ๊ฒฐ์ธก๋ฅ ์ด 20%์—์„œ 80%๊นŒ์ง€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์˜ˆ์ธก ์ •ํ™•๋„๋Š” ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ๊ฐ์†Œํ•˜์˜€์œผ๋‚˜, ๋น„์ง€๋„ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์ธ GAIN๊ณผ kNN์—์„œ ๊ฐ์†Œ ํญ์ด ๋” ํฌ๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ์ž…๋ ฅ ์ž๋ฃŒ์˜ ๊ธฐ๊ฐ„์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก, ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ธ GAIN๊ณผ FCDNN์ด ๋‹ค๋ฅธ ๋‘ ๋ชจ๋ธ์ธ RF์™€ kNN๋ณด๋‹ค ์˜ˆ์ธก ์ •ํ™•๋„ ์ฆ๊ฐ€ ํญ์ด ๋” ์ปธ๋‹ค. ์˜ˆ์ธก ๋Œ€์ƒ ์ง€์—ญ๋ณ„๋กœ๋Š”, ์ž์ฒด ๋ฐฐ์ถœ์›์ด ๋งŽ์€ ์šธ์‚ฐ์˜ ๊ฒฝ์šฐ๊ฐ€ ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๊ณ , ์ž์ฒด ๋ฐฐ์ถœ์›์˜ ์˜ํ–ฅ์ด ๊ฑฐ์˜ ์—†๋Š” ๋ฐฑ๋ น๋„์˜ ๊ฒฝ์šฐ ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๊ฐ€์žฅ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๋‹ค. ๋Œ€์ƒ ์„ฑ๋ถ„๋ณ„๋กœ๋Š” ์ด์˜จ ์„ฑ๋ถ„์ด ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋†’๊ฒŒ ๋‚˜ํƒ€๋‚ฌ๊ณ , ๋ฏธ๋Ÿ‰์›์†Œ ์„ฑ๋ถ„์€ ์˜ˆ์ธก ์ •ํ™•๋„๊ฐ€ ๋‚ฎ์•˜๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์˜ ์˜ˆ์ธก ์ •ํ™•๋„๋ฅผ ๋‹ค์–‘ํ•œ ์‹คํ—˜ ์กฐ๊ฑด์— ๋”ฐ๋ผ ํ‰๊ฐ€ํ•˜์—ฌ ๋Œ€๊ธฐ์˜ค์—ผ ๋ถ„์•ผ์—์„œ์˜ ๊ธฐ๊ณ„ํ•™์Šต ๋ชจ๋ธ์˜ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ์„ ํ‰๊ฐ€ํ–ˆ๋‹ค. ๋ฒ ์ด์ง€์•ˆ ๋‹ค๋ณ€๋Ÿ‰ ์ˆ˜์šฉ ๋ชจ๋ธ๋ง(BSMRM)์„ ํ†ตํ•ด์„œ๋Š” 8๊ฐœ์˜ ๊ด€์ธก ์ง€์  ์ž๋ฃŒ๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๋‚˜๋ผ์˜ ์ฃผ์š” ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ์˜ค์—ผ์› 5๊ฐ€์ง€๋ฅผ ๋„์ถœํ•˜๊ณ , ๊ฐ๊ฐ ์˜ค์—ผ์› ์œ ํ˜•๋ณ„ ๊ธฐ์—ฌ๋„๋ฅผ ์šฐ๋ฆฌ๋‚˜๋ผ ์ „์ฒด์— ๋Œ€ํ•œ ๊ณต๊ฐ„ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•˜์˜€๋‹ค. 5๊ฐ€์ง€ ์˜ค์—ผ์›์€ ๊ฐ๊ฐ 2์ฐจ ์งˆ์‚ฐ์—ผ, 2์ฐจ ํ™ฉ์‚ฐ์—ผ, ์ž๋™์ฐจ ๋ฐฐ์ถœ, ์‚ฐ์—… ์˜ค์—ผ์›, ํ•ด์—ผ ์ž…์ž์˜€๋‹ค. ๊ฐ ์˜ค์—ผ์› ์œ ํ˜•๋ณ„ ์ผํ‰๊ท  ๊ธฐ์—ฌ๋„ ๋†๋„๋ฅผ ์ง€๋„์— ๊ณต๊ฐ„์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ, BSMRM์„ ํ†ตํ•ด ์˜ˆ์ธกํ•œ ์˜ค์—ผ์› ์œ ํ˜•๋ณ„ ๊ธฐ์—ฌ๋„์˜ ํƒ€๋‹น์„ฑ ๊ฒ€ํ† ๋ฅผ ์œ„ํ•ด ํ…Œ์ŠคํŠธ ์‚ฌ์ดํŠธ(์•ˆ์‚ฐ, ๋Œ€์ „, ๊ด‘์ฃผ)์˜ ์ž๋ฃŒ๋Š” ๊ฐ๊ฐ ์ œ์™ธ๋œ ๋ชจ๋ธ๋ง์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฒฐ๊ณผ๋ฅผ ์„œ๋กœ ๋น„๊ตํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ์ด์ฒ˜๋Ÿผ ๊ณต๊ฐ„์ ์œผ๋กœ ์ถ”์ •๋œ ์˜ค์—ผ์› ์œ ํ˜• ๊ธฐ์—ฌ๋„๋Š” ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ํ™”ํ•™์„ฑ๋ถ„์„ ์ธก์ •ํ•˜์ง€ ์•Š๋Š” ๋„์‹œ์—์„œ ์ดˆ๋ฏธ์„ธ๋จผ์ง€ ๋Œ€์‘ ๋ฐฉ์•ˆ์„ ์ˆ˜๋ฆฝํ•˜๋Š”๋ฐ ํฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, 8๊ฐœ์˜ ์ธก์ • ์ž๋ฃŒ๋งŒ์œผ๋กœ ์šฐ๋ฆฌ๋‚˜๋ผ ์ „์ฒด์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ†ตํ•ด, ์ธก์ • ์ง€์ ์ด ์—†๋Š” ๋ชจ๋“  ๋„์‹œ์— ๋Œ€ํ•ด ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜์˜€์œผ๋ฉฐ, ์ด ๊ฒฐ๊ณผ๋Š” ๊ฑด๊ฐ• ์˜ํ–ฅ ํ‰๊ฐ€์™€ ๊ฐ™์€ ์ถ”๊ฐ€ ์—ฐ๊ตฌ์—๋„ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.Particulate matter less than 2.5 micrometers (PM2.5) has been a pollutant of interest globally for more than decades, owing to its adverse health effects. For developing effective PM2.5 management strategies, it is crucial to identify their sources and quantify how much they contribute to ambient PM2.5 concentrations in time and space. Source apportionment is the key to identifying the characteristics of PM2.5. Receptor modeling is widely used to identify PM2.5 sources as a statistical method of source apportionment. The chemical constituents of PM2.5 were used as input data for receptor modeling. Therefore, this study aimed to investigate the characteristics of PM2.5 using models of source apportionment and spatiotemporal analysis for effective management strategies. Two types of modeling were performed for the source apportionment study. The first is positive matrix factorization modeling, which identifies a specific source type and its contributions to PM2.5 from one site. The second is Bayesian spatial multivariate receptor modeling, which derives major sources and their contributions to PM2.5 from multiple monitoring sites. In addition, machine learning models were used to predict the concentrations of PM2.5, which are important data for receptor modeling. Machine learning models that can be used to increase data integrity and applicability to PM2.5 data were assessed. The sources of PM2.5 and their contributions in Siheung, South Korea, were identified using positive matrix factorization modeling. These 10 sources were secondary nitrate (24.3%), secondary sulfate (18.8%), traffic (18.8%), combustion for heating (12.6%), biomass burning (11.8%), coal combustion (3.6%), heavy oil industry (1.8%), smelting industry (4.0%), sea salt (2.7%), and soil (1.7%). Based on the derived sources, the carcinogenic and non-carcinogenic health risks due to PM2.5 inhalation were estimated. The contribution to PM2.5 mass concentration was low for coal combustion, heavy oil industry, and traffic sources but exceeded the benchmark carcinogenic health risk value (1E-06). Therefore, countermeasures on PM2.5 emission sources should be performed based on the PM2.5 mass concentration and health risks. The feature extraction capabilities of the four machine learning models to predict the chemical constituents of PM2.5 were assessed by comparing the prediction accuracy depending on input variables, target constituents for prediction, available period, missing ratios of input data, and study sites. The concentrations of PM2.5 constituents were predicted at three sites (Seoul, Ulsan, and Baengnyeong) in South Korea between 2016 and 2018, using four machine learning models: generative adversarial imputation network (GAIN), fully connected deep neural network (FCDNN), random forest (RF), and k-nearest neighbor (kNN). The prediction accuracy identified by the coefficient of determination (R2) between the prediction and observation was highest in GAIN, followed by FCDNN, RF, and kNN. As the missing ratios (20, 40, 60, and 80%) of the input data increased, the prediction accuracy decreased in the four models and was more noticeable in GAIN and kNN, which are unsupervised models. As the input data period increased, the two deep learning models, GAIN and DNN, had better applicability than the other models, RF and kNN. The study sites with more emission sources exhibited lower prediction accuracy, resulting in the highest R2 in the BR island and the lowest in Ulsan. Among the target constituent groups, ions and trace elements were predicted to have the highest and lowest R2, respectively. This study demonstrated that machine learning models can be extended for further air pollution studies depending on model features, required performance, and experimental conditions, such as data availability and time constraints. The spatial distributions of five PM2.5 sources in South Korea were estimated using Bayesian spatial multivariate receptor modeling. Secondary nitrate, secondary sulfate, motor vehicle emissions, industry, and sea salts were determined to be significant contributors to ambient PM2.5 concentrations in South Korea. The spatial surface of the daily average contribution for each source in South Korea was derived from measurement data from the eight monitoring sites. The source contributions predicted by the BSMRM were also validated using held-out data from a test site (such as Ansan, Daejeon, and Gwangju). These predicted source contributions can aid in developing effective PM2.5 control strategies in cities where no speciated PM2.5 monitoring stations are available. They can also be utilized as source-specific exposures in health effect studies, even in cities where no monitoring stations are available.CHAPTER 1. INTRODUCTION 1 1.1. Background 1 1.2. Objectives 4 1.3. Dissertation structure 5 References 7 CHAPTER 2. LITERATURE REVIEW 10 2.1. Source apportionment and receptor modeling of PM2.5 10 2.2. Toxicity and health risk of assessment PM2.5 21 2.3. Machine learning approaches in prediction of PM2.5 31 2.4. Bayesian approach in source apportionment 41 References 54 CHAPTER 3. SOURCE APPORTIONMENT OF PM2.5 USING PMF MODEL AND HEALTH RISK ASSESSMENT BY INHALATION 69 3.1. Introduction 69 3.2. Materials and methods 72 3.2.1 Study site, sampling, and analysis 72 3.2.2 Positive matrix factorization (PMF) modeling and combined analysis with meteorological data 76 3.2.3 Health risk assessment 80 3.3. Results and discussion 85 3.3.1 PM2.5 mass concentration and chemical speciation 85 3.3.2 Source apportionment of PM2.5 by PMF modeling 89 3.3.3 Carcinogenic and non-carcinogenic health risks 94 3.3.4 Probable source areas or directions 103 3.4. Summary 106 References 107 CHAPTER 4. FEATURE EXTRACTION AND PREDICTION OF PM2.5 CHEMICAL CONSTITUENTS USING MACHINE LEARNING MODELS 120 4.1. Introduction 120 4.2. Materials and methods 124 4.2.1. Study Sites and Data Collection 124 4.2.2. Machine Learning Models and Hyperparameter Optimization 127 4.2.3. Prediction Scenarios 131 4.2.4. Model Validation and Error Estimation 133 4.3. Results and discussion 134 4.3.1. Hyperparameter Optimization 134 4.3.2. Prediction Results for Scenario #1 135 4.3.3. Prediction Results for Scenario #2 157 4.3.4. Features and Performance of Four ML Models 164 4.4. Summary 166 Data Availability 167 Code Availability 167 References 168 CHAPTER 5. BAYESIAN SPATIAL MULTIVARIATE RECEPTOR MODELING FOR SPATIOTEMPORAL ANALYSIS OF PM2.5 SOURCES 175 5.1. Introduction 175 5.2. Materials and methods 180 5.2.1 Air pollution data 180 5.2.2 Bayesian spatial multivariate receptor modeling (BSMRM) 183 5.2.3 Application of BSMRM to Korea PM2.5 speciation data 185 5.3. Results and discussion 189 5.3.1 Bayesian spatial multivariate receptor modeling (BSMRM) results 189 5.3.2 Model validation 196 5.3.3 Spatial distribution of each source in South Korea 204 5.4. Summary 207 References 208 CHAPTER 6. CONCLUSIONS AND FUTURE WORK 214 6.1. Conclusions 214 6.2. Future work 218 ๊ตญ๋ฌธ ์ดˆ๋ก(ABSTRACT IN KOREAN) 219๋ฐ•

    Current approaches used in epidemiologic studies to examine short-term multipollutant air pollution exposures

    Get PDF
    Air pollution epidemiology traditionally focuses on the relationship between individual air pollutants and health outcomes (e.g., mortality). To account for potential copollutant confounding, individual pollutant associations are often estimated by adjusting or controlling for other pollutants in the mixture. Recently, the need to characterize the relationship between health outcomes and the larger multipollutant mixture has been emphasized in an attempt to better protect public health and inform more sustainable air quality management decisions

    Statistical Methods in Integrative Genomics

    Get PDF
    Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions

    Multi-omics analysis of the ageing liver

    Get PDF
    This dissertation presents three manuscripts that originated from my work on the ageing liver. The first two manuscripts concentrate on integrative multi-omics ap-proaches, such as scATAC-seq and scRNA-seq, spatial transcriptomics, CUT&RUN sequencing and lipidomics. They reveal distinct ageing signatures within the murine liver, mainly in hepatocytes. The third manuscript introduces a new methodology for the analysis of spatial sequencing data, which was developed with the ageing liver as an intended application. The first manuscript, "Single-cell resolution unravels spatial alterations in metabo-lism, transcriptome, and epigenome of ageing liver", establishes how spatial loca-tion and microenvironmental changes impact the ageing trajectories of hepatocytes within liver tissue. Through the integration of spatial transcriptomics, single-cell ATAC- and RNA-seq, lipidomics, and functional assays, the study elucidates zonation-specific and age-related changes in the epigenome, transcriptome, and metabolic states. We identified a zonation-dependent shift in the epigenome and show that changing microenvironments within a tissue exert strong influences on their resident cells that can shape epigenetic, metabolic and phenotypic outputs. From a functional perspective, periportal hepatocytes exhibited diminished mitochondrial fitness, whereas pericentral hepatocytes demonstrated an increased accumulation of large lipid droplets. The second manuscript, "Ageing is associated with increased chromatin accessibil-ity and reduced polymerase pausing in liver", examines the chromatin landscape of the ageing liver by using CUT&RUN for RNA polymerase mapping, integrated with ATAC-seq, RNA-seq, and NET-seq. The study reveals an increase in chromatin accessi-bility at promoter regions as a characteristic of ageing, which is not accompanied by a corresponding increase in transcriptional output. Ageing is also found to be associated with a decrease in promoter-proximal pausing of RNA Polymerase II. Our observations suggest that alterations in transcriptional regulation associated with ageing may be due to decreased stability of the pausing complex. The third manuscript, "Dimension reduction by spatial components analysis im-proves pattern detection in multivariate spatial data", introduces SPACO, a new statistical approach designed to enhance pattern recognition in multivariate spatial sequencing data. SPACO stands out by focusing on gene co-regulation and maximising local covariance. It provides a more sensitive and accurate test for the identification of genes with a spatial expression pattern. Moreover, the use of spatial components for gene denoising by SPACO boosts the effective linkage of histological observations with gene expression patterns, even in high-noise conditions

    Bayesian Statistical Modeling of Spatially Resolved Transcriptomics Data

    Get PDF
    Spatially resolved transcriptomics (SRT) quantifies expression levels at different spatial locations, providing a new and powerful tool to investigate novel biological insights. As experimental technologies enhance both in capacity and efficiency, there arises a growing demand for the development of analytical methodologies. One question in SRT data analysis is to identify genes whose expressions exhibit spatially correlated patterns, called spatially variable (SV) genes. Most current methods to identify SV genes are built upon the geostatistical model with Gaussian process, which could limit the models\u27 ability to identify complex spatial patterns. In order to overcome this challenge and capture more types of spatial patterns, in Chapter 2, we introduce a Bayesian approach to identify SV genes via a modified Ising model. The key idea is to use the energy interaction parameter of the Ising model to characterize spatial expression patterns. We use auxiliary variable Markov chain Monte Carlo algorithms to sample from the posterior distribution with an intractable normalizing constant in the model. Simulation studies using both simulated and synthetic data showed that the energy-based modeling approach led to higher accuracy in detecting SV genes than those kernel-based methods. When applied to two real SRT datasets, the proposed method discovered novel spatial patterns that shed light on the biological mechanisms. Spatial domain identification is another direction in SRT analysis, which enables the transcriptomic characterization of tissue structures and further contributes to the evaluation of heterogeneity across different tissue locations. Current spatial domain analysis of SRT data primarily relies on molecular information and fails to fully exploit the morphological features present in histology images, leading to compromised accuracy and interpretability. To overcome these limitations, in Chapter 3, we develop a multi-stage statistical method called iIMPACT. It includes a finite mixture model to identify and define spatial domains based on AI-reconstructed histology images and spatial context of gene expression measurements, and a negative binomial regression model to detect domain-specific spatially variable genes. Through multiple case studies, we demonstrated iIMPACT outperformed existing methods, confirmed by ground truth biological knowledge. These findings underscore the accuracy and interpretability of iIMPACT as a new clustering approach, providing valuable insights into the cellular spatial organization and landscape of functional genes within SRT data. Most next-generation sequencing-based SRT techniques are limited to measuring gene expression in a confined array of spots, capturing only a fraction of the spatial domain. Typically, these spots encompass gene expression from a few to hundreds of cells, underscoring a critical need for more detailed, single-cell resolution SRT data to enhance our understanding of biological functions within the tissue context. Addressing this challenge, in Chapter 4, we introduce BayesDeep, a novel Bayesian hierarchical model that leverages cellular morphological data from histology images, commonly paired with SRT data, to reconstruct SRT data at the single-cell resolution. BayesDeep effectively model count data from SRT studies via a negative binomial regression model. This model incorporates explanatory variables such as cell types and nuclei-shape information for each cell extracted from the paired histology image. A feature selection scheme is integrated to examine the association between the morphological and molecular profiles, thereby improving the model robustness. We applied BayesDeep to two real SRT datasets, successfully demonstrating its capability to reconstruct SRT data at the single-cell resolution. This advancement not only yields new biological insights but also significantly enhances various downstream analyses, such as pseudotime and cell-cell communication

    Unraveling the Thousand Word Picture: An Introduction to Super-Resolution Data Analysis

    Get PDF
    Super-resolution microscopy provides direct insight into fundamental biological processes occurring at length scales smaller than lightโ€™s diffraction limit. The analysis of data at such scales has brought statistical and machine learning methods into the mainstream. Here we provide a survey of data analysis methods starting from an overview of basic statistical techniques underlying the analysis of super-resolution and, more broadly, imaging data. We subsequently break down the analysis of super-resolution data into four problems: the localization problem, the counting problem, the linking problem, and what weโ€™ve termed the interpretation problem

    Book of Abstracts XVIII Congreso de Biometrรญa CEBMADRID

    Get PDF
    Abstracts of the XVIII Congreso de Biometrรญa CEBMADRID held from 25 to 27 May in MadridInteractive modelling and prediction of patient evolution via multistate models / Leire Garmendia Bergรฉs, Jordi Cortรฉs Martรญnez and Guadalupe Gรณmez Melis : This research was funded by the Ministerio de Ciencia e Innovaciรณn (Spain) [PID2019104830RBI00]; and the Generalitat de Catalunya (Spain) [2017SGR622 and 2020PANDE00148].Operating characteristics of a model-based approach to incorporate non-concurrent controls in platform trials / Pavla Krotka, Martin Posch, Marta Bofill Roig : EU-PEARL (EU Patient-cEntric clinicAl tRial pLatforms) project has received funding from the Innovative Medicines Initiative (IMI) 2 Joint Undertaking (JU) under grant agreement No 853966. This Joint Undertaking receives support from the European Unionโ€™s Horizon 2020 research and innovation programme and EFPIA and Childrenโ€™s Tumor Foundation, Global Alliance for TB Drug Development non-profit organisation, Spring works Therapeutics Inc.Modeling COPD hospitalizations using variable domain functional regression / Pavel Hernรกndez Amaro, Marรญa Durbรกn Reguera, Marรญa del Carmen Aguilera Morillo, Cristobal Esteban Gonzalez, Inma Arostegui : This work is supported by the grant ID2019-104901RB-I00 from the Spanish Ministry of Science, Innovation and Universities MCIN/AEI/10.13039/501100011033.Spatio-temporal quantile autoregression for detecting changes in daily temperature in northeastern Spain / Jorge Castillo-Mateo, Alan E. Gelfand, Jesรบs Asรญn, Ana C. Cebriรกn / Spatio-temporal quantile autoregression for detecting changes in daily temperature in northeastern Spain : This work was partially supported by the Ministerio de Ciencia e Innovaciรณn under Grant PID2020-116873GB-I00; Gobierno de Aragรณn under Research Group E46_20R: Modelos Estocรกsticos; and JC-M was supported by Gobierno de Aragรณn under Doctoral Scholarship ORDEN CUS/581/2020.Estimation of the area under the ROC curve with complex survey data / Amaia Iparragirre, Irantzu Barrio, Inmaculada Arostegui : This work was financially supported in part by IT1294-19, PID2020-115882RB-I00, KK-2020/00049. The work of AI was supported by PIF18/213.INLAMSM: Adjusting multivariate lattice models with R and INLA / Francisco Palmรญ Perales, Virgilio Gรณmez Rubio and Miguel รngel Martรญnez Beneito : This work has been supported by grants PPIC-2014-001-P and SBPLY/17/180501/000491, funded by Consejerรญa de Educaciรณn, Cultura y Deportes (Junta de Comunidades de Castilla-La Mancha, Spain) and FEDER, grant MTM2016-77501-P, funded by Ministerio de Economรญa y Competitividad (Spain), grant PID2019-106341GB-I00 from Ministerio de Ciencia e Innovaciรณn (Spain) and a grant to support research groups by the University of Castilla-La Mancha (Spain). F. Palmรญ-Perales has been supported by a Ph.D. scholarship awarded by the University of Castilla-La Mancha (Spain)

    Inferential stability in systems biology

    Get PDF
    The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the โ€œlarge p, small nโ€ problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions
    • โ€ฆ
    corecore