20 research outputs found

    RNA-seq ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•œ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„์˜ ์ •๋Ÿ‰ํ™”์— ๊ด€ํ•œ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต,2019. 8. ๊น€์„ .RNA-seq ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RNA ์ „์‚ฌ์ฒด์˜ ๋ณ€ํ™”๋Ÿ‰์„ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์•ผ์—์„œ ํ•„์ˆ˜์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ณ  ์žˆ๋Š” ๋ถ„์„ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ RNA-seq์€ ์ธ๊ฐ„์˜ 2๋งŒ๊ฐœ ์ด์ƒ์˜ ์œ ์ „์ž๋ฅผ ํฌํ•จํ•˜๋Š” ๊ณ ์ฐจ์›์˜ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์–‘์˜ ์ƒ˜ํ”Œ๋“ค์„ ๋ถ„์„ํ•˜๊ณ ์ž ํ• ๋•Œ๋Š” ๋ฐ์ดํ„ฐ ํ•ด์„์— ์žˆ์–ด์„œ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๋” ๋‚˜์€ ์ƒ๋ฌผํ•™์  ์ดํ•ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ์ƒ๋ฌผํ•™์  ํŒจ์Šค์›จ์ด์™€ ๊ฐ™์ด ์ž˜ ์š”์•ฝ๋˜๊ณ  ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์œ ์šฉํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ๋ฌผํ•™์  ํŒจ์Šค์›จ์ด๋กœ ์š”์•ฝํ•˜๋Š” ๊ฒƒ์€ ๋ช‡ ๊ฐ€์ง€ ์ด์œ ๋กœ ๋งค์šฐ ์–ด๋ ค์šด ์ž‘์—…์ด๋‹ค. ์ฒซ์งธ, ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํŒจ์Šค์›จ์ด ์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜ํ•  ๋•Œ ์—„์ฒญ๋‚œ ์ •๋ณด ์†์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ธ๊ฐ„์— ์กด์žฌํ•˜๋Š” ์ „์ฒด ์œ ์ „์ž์˜ 1/3๋งŒ์ด KEGG ํŒจ์Šค์›จ์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ๋ณด๊ณ ๋˜๊ณ  ์žˆ๋‹ค. ๋‘˜์งธ, ๊ฐ ํŒจ์Šค์›จ์ด๋Š” ๋งŽ์€ ์œ ์ „์ž๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ํŒจ์Šค์›จ์ด์˜ ํ™œ์„ฑ๋„๋ฅผ ์ธก์ •ํ•˜๋ ค๋ฉด ๊ตฌ์„ฑํ•˜๊ณ  ์žˆ๋Š” ์œ ์ „์ž ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜๋ฉด์„œ ์œ ์ „์ž ๋ฐœํ˜„ ๊ฐ’์„ ๋‹จ์ผ ๊ฐ’์œผ๋กœ ์š”์•ฝํ•ด์•ผ ํ•œ๋‹ค. ๋ณธ ๋ฐ•์‚ฌ ํ•™์œ„ ๋…ผ๋ฌธ์€ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ์ธก์ •์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๊ณ  ์—ฌ๋Ÿฌ ๋น„๊ต ๊ธฐ์ค€์— ๋”ฐ๋ผ ๊ธฐ์กด์— ๋ณด๊ณ ๋œ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ๋„๊ตฌ๋“ค์— ๋Œ€ํ•œ ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€ ์‹คํ—˜์„ ์ˆ˜ํ–‰ํ•˜๊ณ ์ž ํ•œ๋‹ค. ๋˜ํ•œ ์ผ๋ฐ˜ ์‚ฌ์šฉ์ž๊ฐ€ ์ž์‹ ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‰ฝ๊ฒŒ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๋„๊ตฌ๋“ค์„ ์›น ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ ๊ตฌ์ถ•์„ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ์ฒซ ๋ฒˆ์งธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ „์‚ฌ์ฒด ์œ ์ „์ž ๋ฐœํ˜„์–‘ ์ •๋ณด๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์ƒํ˜ธ์ž‘์šฉ ๋„คํŠธ์›Œํฌ ์ธก๋ฉด์—์„œ ์œ ์ „์ž ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ํŒจ์Šค์›จ์ด์˜ ๊ด€์ ์œผ๋กœ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์š”์•ฝํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฐฉ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ๋‹จ๋ฐฑ์งˆ ์ƒํ˜ธ ์ž‘์šฉ ๋„คํŠธ์›Œํฌ, ํŒจ์Šค์›จ์ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๋ฐ RNA-seq ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒ๋ฌผํ•™์  ํŒจ์Šค์›จ์ด๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์‹œ์Šคํ…œ์œผ๋กœ ๊ตฌ๋ถ„ํ•˜๋Š” ์ƒˆ๋กœ์šด ๊ฐœ๋…์„ ์ œ์•ˆํ•˜๊ณ ์ž ํ•œ๋‹ค. ๊ฐ ์‹œ์Šคํ…œ ๋ฐ ๊ฐ ์ƒ˜ํ”Œ๋งˆ๋‹ค์˜ ํ™œ์„ฑํ™” ์ •๋„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด SAS (Subsystem Activation Score)๋ฅผ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์ƒ˜ํ”Œ ๋“ค๊ฐ„ ๋ฐ ์œ ๋ฐฉ์•” ์•„ํ˜•๋“ค ์‚ฌ์ด์—์„œ ์ฐจ๋ณ„์ ์œผ๋กœ ํ™œ์„ฑํ™”๋˜๋Š” ํŠน์œ ์˜ ์œ ์ „์ฒด ์ƒ์—์„œ์˜ ํ™œ์„ฑํ™” ํŒจํ„ด ๋˜๋Š” ์„œ๋ธŒ ์‹œ์Šคํ…œ์„ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ, ๋ถ„๋ฅ˜ ๋ฐ ํšŒ๊ท€ ํŠธ๋ฆฌ (CART) ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์˜ˆํ›„ ๋ชจ๋ธ๋ง์„ ์œ„ํ•ด SAS ์ •๋ณด๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, 10 ๊ฐœ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํ•˜์œ„ ์‹œ์Šคํ…œ์œผ๋กœ ์ •์˜ ๋œ 11 ๊ฐœ์˜ ํ™˜์ž ํ•˜์œ„ ๊ทธ๋ฃน์€ ์ƒ์กด ๊ฒฐ๊ณผ์— ์žˆ์–ด ์ตœ๋Œ€ ๋ถˆ์ผ์น˜๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ด ๋ชจ๋ธ์€ ์œ ์‚ฌํ•œ ์ƒ์กด ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง„ ํ™˜์ž ํ•˜์œ„ ๊ทธ๋ฃน์„ ์ •์˜ํ–ˆ์„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ธฐ๋Šฅ์ ์œผ๋กœ ์œ ์ตํ•œ ์œ ๋ฐฉ์•” ์œ ์ „์ž ์„ธํŠธ๋ฅผ ์ œ์•ˆํ•˜๋Š” ํ•˜์œ„ ์‹œ์Šคํ…œ์˜ ํ™œ์„ฑํ™” ์ƒํƒœ์— ๋”ฐ๋ผ ๊ฒฐ์ •๋˜๋Š” ์ƒ˜ํ”Œ ํŠน์ด์ ์ธ ์ƒํƒœ์˜ ํŒ๋‹จ ๊ฒฝ๋กœ๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ๋‘ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋Š” ์ „ ์•” (pan-cancer) ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์„ฏ ๊ฐ€์ง€ ๋น„๊ต ๊ธฐ์ค€์— ๋”ฐ๋ผ 13 ๊ฐ€์ง€์˜ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ์ธก์ • ๋„๊ตฌ๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ๋น„๊ต ๋ฐ ํ‰๊ฐ€ํ•˜๋Š” ์—ฐ๊ตฌ์ด๋‹ค.ํ˜„์กดํ•˜๋Š” ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ์ธก์ • ๋„๊ตฌ๊ฐ€ ๋งŽ์ด ์žˆ์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋„๊ตฌ๊ฐ€ ์ฝ”ํ˜ธํŠธ ์ˆ˜์ค€์—์„œ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๋น„๊ต ์—ฐ๊ตฌ๋Š” ์—†๋‹ค. ์ด ์—ฐ๊ตฌ๋Š” ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค. ์ฒซ์งธ, ์ด ์—ฐ๊ตฌ๋Š” ๊ธฐ์กด์˜ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ์ธก์ • ๋„๊ตฌ์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ณ„์‚ฐ ๊ธฐ๋ฒ•์— ๋Œ€ํ•œ ํฌ๊ด„์ ์ธ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค. ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ์ธก์ •์€ ๋‹ค์–‘ํ•œ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•˜๊ณ , ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๋ณ€ํ™˜, ์ƒ˜ํ”Œ ์ •๋ณด์˜ ์‚ฌ์šฉ, ์ฝ”ํ˜ธํŠธ ์ˆ˜์ค€์˜ ์ธํ’‹ ๋ฐ์ดํ„ฐ์˜ ํ•„์š”์„ฑ, ์œ ์ „์ž ๊ด€๊ณ„ ๋ฐ ์ ์ˆ˜์ฒด๊ณ„์˜ ์‚ฌ์šฉ ๋“ฑ์—์„œ ๋‹ค์–‘ํ•œ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๊ฐ€์ •ํ•ด์•ผ ํ•œ๋‹ค. ๋‘˜์งธ, ์ด๋Ÿฌํ•œ ๋„๊ตฌ์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ๋‹ค์„ฏ ๊ฐ€์ง€ ๋น„๊ต ๊ธฐ์ค€์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ด‘๋ฒ”์œ„ํ•œ ํ‰๊ฐ€๊ฐ€ ์ˆ˜ํ–‰๋˜์—ˆ๋‹ค. ๋„๊ตฌ๊ฐ€ ์›๋ž˜์˜ ์œ ์ „์ž ๋ฐœํ˜„ ํ”„๋กœํŒŒ์ผ์˜ ํŠน์„ฑ์„ ์–ผ๋งˆ๋‚˜ ์ž˜ ์œ ์ง€ํ•˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ, ์œ ์ „์ž ๋ฐœํ˜„ ๋ฐ์ดํ„ฐ์— ๋…ธ์ด์ฆˆ๋ฅผ ์ž„์˜๋กœ ๋„์ž…ํ•˜์˜€์„ ๋•Œ ์–ผ๋งˆ๋‚˜ ๋‘”๊ฐํ•œ์ง€ ๋“ฑ์„ ์กฐ์‚ฌํ–ˆ๋‹ค. ์ž„์ƒ ์ ์šฉ์„ ์œ„ํ•œ ๋„๊ตฌ์˜ ์œ ์šฉ์„ฑ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์„ธ๊ฐ€์ง€ ๋ณ€์ˆ˜ (์ข…์–‘ ๋Œ€ ์ •์ƒ, ์ƒ์กด ๋ฐ ์•”์˜ ์•„ํ˜•)์— ๋Œ€ํ•œ ๋ถ„๋ฅ˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ์„ธ ๋ฒˆ์งธ ์—ฐ๊ตฌ๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ์ „์‚ฌ์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ณตํ•˜๊ณ , ์•ž์„  ์—ฐ๊ตฌ์—์„œ ๋น„๊ตํ•œ ํ™œ์„ฑ๋„ ์ธก์ • ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” ํด๋ผ์šฐ๋“œ ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ (PathwayCloud)์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์‚ฌ์šฉ์ž๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ์Šคํ…œ์— ์—…๋กœ๋“œํ•˜๊ณ  ์‹คํ–‰ํ•  ๋ถ„์„ ๋„๊ตฌ๋ฅผ ์„ ํƒํ•˜๋ฉด, ์ด ์‹œ์Šคํ…œ์€ ๊ฐ ๋„๊ตฌ์— ๋Œ€ํ•œ ํŒจ์Šค์›จ์ด ํ™œ์„ฑ๋„ ๊ฐ’๊ณผ ์„ ํƒํ•œ ๋„๊ตฌ์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ๋น„๊ต ์š”์•ฝ์„ ์ž๋™์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์‚ฌ์šฉ์ž๋Š” ๋˜ํ•œ ์ฃผ์–ด์ง„ ์ƒ˜ํ”Œ ์ •๋ณด์˜ ์ธก๋ฉด์—์„œ ์–ด๋–ค ํŒจ์Šค์›จ์ด๊ฐ€ ์ค‘์š”ํ•œ์ง€ ์กฐ์‚ฌ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, KEGG rest API๋ฅผ ํ†ตํ•ด์„œ ์ง์ ‘ ํŒจ์Šค์›จ์ด์˜ ์–ด๋–ค ์œ ์ „์ž์˜ ๋ณ€ํ™”๊ฐ€ ์œ ์˜๋ฏธํ•œ์ง€๋ฅผ ์‹œ๊ฐ์ ์œผ๋กœ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ, ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์€ ๊ณ ์šฉ๋Ÿ‰์˜ ์œ ์ „์ž ๋ฐœํ˜„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ๋ฌผํ•™์  ํŒจ์Šค์›จ์ด์— ๋Œ€ํ•œ ๋ถ„์„ ๋ฐฉ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๊ณ , ๋‹ค๋ฅธ ์œ ํ˜•์˜ ๋„๊ตฌ๋ฅผ ํฌ๊ด„์ ์ธ ๊ธฐ์ค€์œผ๋กœ ๋น„๊ตํ•˜๊ณ , ์‚ฌ์šฉ์ž๊ฐ€ ์ด ๋„๊ตฌ๋“ค์— ์‰ฝ๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ์›น ๊ธฐ๋ฐ˜ ์‹œ์Šคํ…œ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ด ์ „๋ฐ˜์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ƒ๋ฌผํ•™์  ํŒจ์Šค์›จ์ด ์ธก๋ฉด์—์„œ ์œ ์ „์ž ๋ฐœํ˜„ ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ ์ค‘์š”ํ–ˆ๋‹ค.Measuring the dynamics of RNA transcripts using RNA-seq data has become routine in bioinformatics analyses. However, RNA-seq produces high-dimensional transcriptome data on more than 20,000 genes in humans. This makes the interpretation of the data extremely difficult given a relatively small set of samples. Therefore, it is desirable to use well-summarized and widely-used information such as biological pathways for better biological comprehension. However, summarizing transcriptome data in terms of biological pathways is a very challenging task for several reasons. First, there is a huge information loss when transforming transcriptome data to pathway space. For example, in humans, only one third of the entire set of genes being analyzed are present in KEGG pathways. Second, each pathway consists of many genes; thus, measuring pathway activity requires a strategy to summarize expression profiles of component genes into a single value, while considering relationship among the constituent genes. My doctoral study aimed to develop a new method for pathway activity measurement, and to perform extensive evaluation experiments on existing pathway measurement tools in terms of multiple evaluation criteria. In addition, a cloud-based system was constructed to deploy such tools, which facilitates users analyzing their own data easily. The first study is to develop a new method to summarize transcriptome data in terms of pathways by using explicit transcript quantity information and considering relationship among genes in terms of their interactions. In this study, I propose a novel concept of decomposing biological pathways into subsystems by utilizing protein interaction network, pathway information, and RNA-seq data. A subsystem activation score (SAS) was designed to measure the degree of activation for each subsystem and each patient. This method revealed distinctive genome-wide activation patterns or landscapes of subsystems that are differentially activated among samples as well as among breast cancer subtypes. Next, we used SAS information for prognostic modeling by classification and regression tree (CART) analysis. Eleven subgroups of patients, defined by the 10 most significant subsystems, were identified with maximal discrepancy in survival outcome. Our model not only defined patient subgroups with similar survival outcomes, but also provided patient-specific decision paths determined by SAS status, suggesting functionally informative gene sets in breast cancer. The second study aimed to systematically compare and evaluate thirteen different pathway activity inference tools based on five comparison criteria using a pan-cancer data set. Although many pathway activity tools are available, there is no comparative study on how effective these tools are in producing useful information at the cohort level, enabling comparison of many samples. This study has two major contributions. First, this study provides a comprehensive survey on computational techniques used by existing pathway activity inference tools. Existing tools use different strategies and assume different requirements on data: input transformation, use of labels, necessity of cohort-level input data, use of gene relations and scoring metrics. Second, extensive evaluations were conducted using five comparison criteria concerning the performance of these tools. Starting from measuring how well a tool maintains the characteristics of an original gene expression profile, robustness was also investigated by introducing noise into gene expression data. Classification tasks on three clinical variables were performed to evaluate the utility of tools. The third study is to build a cloud-based system where a user provides transcriptome data and measures pathway activities using the tools that were used for the comparative study. When a user uploads input data to the system and selects which preferred analysis tools are to be run, the system automatically generates pathway activity values for each tool as well as a summary of performance comparison for the selected tools. Users can also investigate which pathways are significant in terms of the given sample information and visually inspect genes within a pathway-linked KEGG rest API. In conclusion, in my thesis, I sought to develop an analysis method regarding biological pathways using high throughput gene expression data to compare different types of tools with comprehensive criteria, and to arrange the tools in a cloud-based system that is easily accessible. As pathways aggregate various molecular events among genes in to a single entity, the set of suggested approaches will aid interpretation of high-throughput data as well as facilitate integration of diverse data layers such as miRNA or DNA methylation profiles being taken into consideration.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.1 Biological pathways . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Gene expression . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.3 Pathway-based analysis . . . . . . . . . . . . . . . . . . . 7 1.1.4 Pathway activity measurement . . . . . . . . . . . . . . . 8 1.2 Challenges in pathway activity measurement . . . . . . . . . . . 9 1.2.1 Calculating effective pathway activity values from RNAseq data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.2 Lack of comparative criteria to evaluate pathway activity tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.2.3 Absence of a user-friendly environment of pathway activity inference tools . . . . . . . . . . . . . . . . . . . . . . 11 1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 12 Chapter 2 Measuring pathway activity from RNA-seq data to identify breast cancer subsystems using protein-protein interaction network 14 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 Breast cancer subsystems . . . . . . . . . . . . . . . . . . 20 2.3.2 Subsystem Activation Score . . . . . . . . . . . . . . . . . 22 2.3.3 Prognostic modeling . . . . . . . . . . . . . . . . . . . . . 23 2.3.4 Hierarchical clustering of patients and subsystems . . . . 24 2.3.5 Tools used in this study . . . . . . . . . . . . . . . . . . . 25 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.1 Pathways were decomposed into coherent functional units - subsystems . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4.2 Landscape of subsystems reflect the breast cancer biology 26 2.4.3 SAS revealed patient clusters associated with PAM50 subtypes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.4 Prognostic modeling by subsystems showed 11 patient subgroups with distinct survival outcome . . . . . . . . . 31 2.4.5 Relapse rate and CNVs were enriched to worse prognostic subgroups . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 Chapter 3 Comprehensive evaluation of pathway activity measurement tools on pan-cancer data 40 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Materials and methods . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.1 Pathway activity inference Tools . . . . . . . . . . . . . . 45 3.3.2 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.3 Pathway database . . . . . . . . . . . . . . . . . . . . . . 47 3.3.4 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4 Comparative approach . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Radar chart criteria . . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Similarity among the tools . . . . . . . . . . . . . . . . . . 53 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.5.1 Distance preservation . . . . . . . . . . . . . . . . . . . . 53 3.5.2 Robustness against noise . . . . . . . . . . . . . . . . . . . 57 3.5.3 Classification: Tumor vs Normal . . . . . . . . . . . . . . 60 3.5.4 Classification: survival information . . . . . . . . . . . . . 62 3.5.5 Classification: cancer subtypes . . . . . . . . . . . . . . . 63 3.5.6 Similarity among the tools . . . . . . . . . . . . . . . . . . 63 3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 Chapter 4 A cloud-based system of pathway activity inference tools using high-throughput gene expression data 68 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.4.1 Calculating pathway activity values . . . . . . . . . . . . 71 4.4.2 Identification of significant pathways . . . . . . . . . . . . 72 4.4.3 Visualization in KEGG pathways . . . . . . . . . . . . . . 72 4.4.4 Comparison of the tools . . . . . . . . . . . . . . . . . . . 75 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Chapter 5 Conclusion 77 ์ดˆ๋ก 101Docto

    Novel Wavelet-Based Statistical Methods with Applications in Classification, Shrinkage, and Nano-Scale Image Analysis

    Get PDF
    Given the recent popularity and clear evidence of wide applicability of wavelets, this thesis is devoted to several statistical applications of Wavelet transforms. Statistical multiscale modeling has, in the most recent decade, become a well-established area in both theoretical and applied statistics, with impact on developments in statistical methodology. Wavelet-based methods are important in statistics in areas such as regression, density and function estimation, factor analysis, modeling and forecasting in time series analysis, assessing self-similarity and fractality in data, and spatial statistics. In this thesis we show applicability of the wavelets by considering three problems: First, we consider a binary wavelet-based linear classifier. Both consistency results and implemental issues are addressed. We show that under mild assumptions wavelet-based classification rule is both weakly and strongly universally consistent. The proposed method is illustrated on synthetic data sets in which the truth is known and on applied classification problems from the industrial and bioengineering fields. Second, we develop wavelet shrinkage methodology based on testing multiple hypotheses in the wavelet domain. The shrinkage/thresholding approach by implicit or explicit simultaneous testing of many hypotheses had been considered by many researchers and goes back to the early 1990's. We propose two new approaches to wavelet shrinkage/thresholding based on local False Discovery Rate (FDR), Bayes factors and ordering of posterior probabilities. Finally, we propose a novel method for the analysis of straight-line alignment of features in the images based on Hough and Wavelet transforms. The new method is designed to work specifically with Transmission Electron Microscope (TEM) images taken at nanoscale to detect linear structure formed by the atomic lattice.Ph.D.Committee Chair: Vidakovic, Brani; Committee Member: Hayter, Anthony; Committee Member: Heil, Chris; Committee Member: Huo, Xiaoming; Committee Member: Wang, Yan

    Statistical power analysis for single-cell RNA-sequencing

    Get PDF
    RNA-sequencing (RNA-seq) is an established method to quantify levels of gene expression genome-wide. The recent development of single cell RNA sequencing (scRNA-seq) protocols opens up the possibility to systematically characterize cell transcriptomes and their underlying developmental and regulatory mechanisms. Since the first publication on single-cell transcriptomics a decade ago, hundreds of scRNA-seq datasets from a variety of sources have been released, profiling gene expression of sorted cells, tumors, whole dissociated organs and even complete organisms. Currently, it is also the main tool to systematically characterize human cells within the Human Cell Atlas Project. Given its wide applicability and increasing popularity, many experimental protocols and computational analysis approaches exist for scRNA-seq. However, the technology remains experimentally and computationally challenging. Firstly, single cells contain only minute mRNA amounts that need to be reliably captured and amplified for accurate quantification by sequencing. Importantly, the Polymerase Chain Reaction (PCR) is commonly used for amplification which might introduce biases and increase technical variation. Secondly, once the sequencing results are obtained, finding the best computational processing pipeline can be a struggle. A number of comparison studies have already been conducted - esp. for bulk RNA-seq - but usually they deal only with one aspect of the workflow. Furthermore, in how far the conclusions and recommendations of these studies can be transferred to scRNA-seq is unknown. Related to the processing of RNA-sequencing, we investigate the effect of PCR amplification on differential expression analysis. We find that computational removal of duplicates has either a negligible or a negative impact on specificity and sensitivity of differential expression analysis, and we therefore recommend not to remove read duplicates by mapping position. In contrast, if duplicates are identified using unique molecular identifiers (UMIs) tagging RNA molecules, both specificity and sensitivity improve. The first integral step of any scRNA-seq experiment is the preparation of sequencing libraries from the cells. We conducted an independent benchmarking study of popular library preparation protocols in terms of detection sensitivity, accuracy and precision using the same mouse embryonic stem cells and exogenous mRNA spike-ins. We recapitulate our previous finding that technical variance is markedly decreased when using UMIs to remove duplicates. In order to assign a monetary value to the detected amounts of technical variance, we developed a simulation framework, that enabled us to compare the power to detect differentially expressed genes across the scRNA-seq library preparation protocols. Our experiences during this comparison study led to the development of the sequencing data processing in zUMIs and the simulation framework and power analysis in powsimR. zUMIs is a pipeline for processing scRNA-seq data with flexible choices regarding UMI and cell barcode design. In addition, we showed with powsimR simulations that the inclusion of intronic reads for gene expression quantification increases the power to detect DE genes and added it as a unique feature to zUMIs. In powsimR, we present our simulation framework extending choices concerning data analysis, enabling researchers to assess experimental design and analysis plans of RNA-seq in terms of statistical power. Lastly, we conducted a systematic evaluation of scRNA-seq experimental and analytical pipelines. We found that choices made concerning normalisation and library preparation protocols have the biggest impact on the validity of scRNA-seq DE analysis. Choosing a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the cell sample size. Taken together, we have established and applied a simulation framework that allowed us to benchmark experimental and computational scRNA-seq protocols and hence inform the experimental design and method choices of this important technology

    Algoritmos de compressรฃo sem perdas para imagens de microarrays e alinhamento de genomas completos

    Get PDF
    Doutoramento em InformรกticaNowadays, in the 21st century, the never-ending expansion of information is a major global concern. The pace at which storage and communication resources are evolving is not fast enough to compensate this tendency. In order to overcome this issue, sophisticated and efficient compression tools are required. The goal of compression is to represent information with as few bits as possible. There are two kinds of compression, lossy and lossless. In lossless compression, information loss is not tolerated so the decoded information is exactly the same as the encoded one. On the other hand, in lossy compression some loss is acceptable. In this work we focused on lossless methods. The goal of this thesis was to create lossless compression tools that can be used in two types of data. The first type is known in the literature as microarray images. These images have 16 bits per pixel and a high spatial resolution. The other data type is commonly called Whole Genome Alignments (WGA), in particularly applied to MAF files. Regarding the microarray images, we improved existing microarray-specific methods by using some pre-processing techniques (segmentation and bitplane reduction). Moreover, we also developed a compression method based on pixel values estimates and a mixture of finite-context models. Furthermore, an approach based on binary-tree decomposition was also considered. Two compression tools were developed to compress MAF files. The first one based on a mixture of finite-context models and arithmetic coding, where only the DNA bases and alignment gaps were considered. The second tool, designated as MAFCO, is a complete compression tool that can handle all the information that can be found in MAF files. MAFCO relies on several finite-context models and allows parallel compression/decompression of MAF files.Hoje em dia, no sรฉculo XXI, a expansรฃo interminรกvel de informaรงรฃo รฉ uma grande preocupaรงรฃo mundial. O ritmo ao qual os recursos de armazenamento e comunicaรงรฃo estรฃo a evoluir nรฃo รฉ suficientemente rรกpido para compensar esta tendรชncia. De forma a ultrapassar esta situaรงรฃo, sรฃo necessรกrias ferramentas de compressรฃo sofisticadas e eficientes. A compressรฃo consiste em representar informaรงรฃo utilizando a menor quantidade de bits possรญvel. Existem dois tipos de compressรฃo, com e sem perdas. Na compressรฃo sem perdas, a perda de informaรงรฃo nรฃo รฉ tolerada, por isso a informaรงรฃo descodificada รฉ exatamente a mesma que a informaรงรฃo que foi codificada. Por outro lado, na compressรฃo com perdas alguma perda รฉ aceitรกvel. Neste trabalho, focรกmo-nos apenas em mรฉtodos de compressรฃo sem perdas. O objetivo desta tese consistiu na criaรงรฃo de ferramentas de compressรฃo sem perdas para dois tipos de dados. O primeiro tipo de dados รฉ conhecido na literatura como imagens de microarrays. Estas imagens tรชm 16 bits por pรญxel e uma resoluรงรฃo espacial elevada. O outro tipo de dados รฉ geralmente denominado como alinhamento de genomas completos, particularmente aplicado a ficheiros MAF. Relativamente ร s imagens de microarrays, melhorรกmos alguns mรฉtodos de compressรฃo especรญficos utilizando algumas tรฉcnicas de prรฉ-processamento (segmentaรงรฃo e reduรงรฃo de planos binรกrios). Alรฉm disso, desenvolvemos tambรฉm um mรฉtodo de compressรฃo baseado em estimaรงรฃo dos valores dos pixรฉis e em misturas de modelos de contexto-finito. Foi tambรฉm considerada, uma abordagem baseada em decomposiรงรฃo em รกrvore binรกria. Foram desenvolvidas duas ferramentas de compressรฃo para ficheiros MAF. A primeira ferramenta, รฉ baseada numa mistura de modelos de contexto-finito e codificaรงรฃo aritmรฉtica, onde apenas as bases de ADN e os sรญmbolos de alinhamento foram considerados. A segunda, designada como MAFCO, รฉ uma ferramenta de compressรฃo completa que consegue lidar com todo o tipo de informaรงรฃo que pode ser encontrada nos ficheiros MAF. MAFCO baseia-se em vรกrios modelos de contexto-finito e permite compressรฃo/descompressรฃo paralela de ficheiros MAF

    Bayesian learning in bioinformatics

    Get PDF
    Life sciences research is advancing in breadth and scope, affecting many areas of life including medical care and government policy. The field of Bioinformatics, in particular, is growing very rapidly with the help of computer science, statistics, applied mathematics, and engineering. New high-throughput technologies are making it possible to measure genomic variation across phenotypes in organisms at costs that were once inconceivable. In conjunction, and partly as a consequence, massive amounts of information about the genomes of many organisms are becoming accessible in the public domain. Some of the important and exciting questions in the post-genomics era are how to integrate all of the information available from diverse sources. Learning in complex systems biology requires that information be shared in a natural and interpretable way, to integrate knowledge and data. The statistical sciences can support the advancement of learning in Bioinformatics in many ways, not the least of which is by developing methodologies that can support the synchronization of efforts across sciences, offering real-time learning tools that can be shared across many fields from basic science to the clinical applications. This research is an introduction to several current research problems in Bioinformatics that addresses integration of information, and discusses statistical methodologies from the Bayesian school of thought that may be applied. Bayesian statistical methodologies are proposed to integrate biological knowledge and improve statistical inference for three relevant Bioinformatics applications: gene expression arrays, BAC and aCGH arrays, and real-time gene expression experiments. A unified Bayesian model is proposed to perform detection of genes and gene classes, defined from historical pathways, with gene expression arrays. A novel Bayesian statistical method is proposed to infer chromosomal copy number aberrations in clinical populations with BAC or aCGH experiments. A theoretical model is proposed, motivated from historical work in mathematical biology, for inference with real-time gene expression experiments, and fit with Bayesian methods. Simulation and case studies show that Bayesian methodologies show great promise to improve the way we learn with high-throughput Bioinformatics experiments

    Optimising gene expression profiling using RNA-seq

    Get PDF

    A bioinformatics framework for management and analysis of high throughput CGH microarray projects

    Get PDF
    High throughput experimental techniques have revolutionised biological research; these techniques enable researchers, in an unbiased fashion to survey entire biological systems such as all the somatic mutations in a tumour in a single experiment. Due to the often complex informatics demands of these techniques, robust computational solutions are required to ensure high quality reproducible results are generated. The challenge of this thesis was to develop such a computational solution for the management and analysis of high throughput microarray Comparative Genomic Hybridisation (aCGH) projects. This task also provided an opportunity to test the hypothesis that agile software development approaches are well suited for bioinformatics projects and that formalised development practices produce better quality software. This is an important question as formalised software development practices have been underused so far in the eld of bioinformatics. This thesis describes the development and application of a bioinformatics framework for the management and analysis of microarray CGH projects. The framework includes: a Laboratory Information Management System (LIMS) that manages and records all aspects of microarray CGH experimentation; a set of easy to use visualisation tools for aCGH experimental data; and a suite of object oriented Perl modules providing a exible way to construct data pipelines quickly using the statistical programming language R for quality control, normalisation and analysis. In order to test the framework, it was successfully applied in the aCGH pro ling of 94 ovarian tumour samples. Subsequent analysis of these data identi ed 4 well supported genomic regions which appear to in uence patient survival. The evaluation of agile practices implemented in this thesis has demonstrated that they are well suited to the development of bioinformatics solutions as they enable developers to react to the changes of this rapidly evolving eld, to create successful software solutions such as the bioinformatics framework presented here

    Fully Bayesian T-probit Regression with Heavy-tailed Priors for Selection in High-Dimensional Features with Grouping Structure

    Get PDF
    Feature selection is demanded in many modern scientific research problems that use high-dimensional data. A typical example is to find the genes that are most related to a certain disease (e.g., cancer) from high-dimensional gene expression profiles. There are tremendous difficulties in eliminating a large number of useless or redundant features. The expression levels of genes have structure; for example, a group of co-regulated genes that have similar biological functions tend to have similar mRNA expression levels. Many statistical methods have been proposed to take the grouping structure into consideration in feature selection and regression, including Group LASSO, Supervised Group LASSO, and regression on group representatives. In this thesis, we propose to use a sophisticated Markov chain Monte Carlo method (Hamiltonian Monte Carlo with restricted Gibbs sampling) to fit T-probit regression with heavy-tailed priors to make selection in the features with grouping structure. We will refer to this method as fully Bayesian T-probit. The main feature of fully Bayesian T-probit is that it can make feature selection within groups automatically without a pre-specification of the grouping structure and more efficiently discard noise features than LASSO (Least Absolute Shrinkage and Selection Operator). Therefore, the feature subsets selected by fully Bayesian T-probit are significantly more sparse than subsets selected by many other methods in the literature. Such succinct feature subsets are much easier to interpret or understand based on existing biological knowledge and further experimental investigations. In this thesis, we use simulated and real datasets to demonstrate that the predictive performances of the more sparse feature subsets selected by fully Bayesian T-probit are comparable with the much larger feature subsets selected by plain LASSO, Group LASSO, Supervised Group LASSO, random forest, penalized logistic regression and t-test. In addition, we demonstrate that the succinct feature subsets selected by fully Bayesian T-probit have significantly better predictive power than the feature subsets of the same size taken from the top features selected by the aforementioned methods

    Detection and identification of elliptical structure arrangements in images: theory and algorithms

    Get PDF
    Cette thรจse porte sur diffรฉrentes problรฉmatiques liรฉes ร  la dรฉtection, l'ajustement et l'identification de structures elliptiques en images. Nous plaรงons la dรฉtection de primitives gรฉomรฉtriques dans le cadre statistique des mรฉthodes a contrario afin d'obtenir un dรฉtecteur de segments de droites et d'arcs circulaires/elliptiques sans paramรจtres et capable de contrรดler le nombre de fausses dรฉtections. Pour amรฉliorer la prรฉcision des primitives dรฉtectรฉes, une technique analytique simple d'ajustement de coniques est proposรฉe ; elle combine la distance algรฉbrique et l'orientation du gradient. L'identification d'une configuration de cercles coplanaires en images par une signature discriminante demande normalement la rectification Euclidienne du plan contenant les cercles. Nous proposons une technique efficace de calcul de la signature qui s'affranchit de l'รฉtape de rectification ; elle est fondรฉe exclusivement sur des propriรฉtรฉs invariantes du plan projectif, devenant elle mรชme projectivement invariante. ABSTRACT : This thesis deals with different aspects concerning the detection, fitting, and identification of elliptical features in digital images. We put the geometric feature detection in the a contrario statistical framework in order to obtain a combined parameter-free line segment, circular/elliptical arc detector, which controls the number of false detections. To improve the accuracy of the detected features, especially in cases of occluded circles/ellipses, a simple closed-form technique for conic fitting is introduced, which merges efficiently the algebraic distance with the gradient orientation. Identifying a configuration of coplanar circles in images through a discriminant signature usually requires the Euclidean reconstruction of the plane containing the circles. We propose an efficient signature computation method that bypasses the Euclidean reconstruction; it relies exclusively on invariant properties of the projective plane, being thus itself invariant under perspective

    Analytical Techniques for the Improvement of Mass Spectrometry Protein Profiling

    Get PDF
    Bioinformatics is rapidly advancing through the "post-genomic" era following the sequencing of the human genome. In preparation for studying the inner workings behind genes, proteins and even smaller biological elements, several subdivisions of bioinformatics have developed. The subdivision of proteomics, concerning the structure and function of proteins, has been aided by the mass spectrometry data source. Biofluid or tissue samples are rapidly assayed for their protein composition. The resulting mass spectra are analyzed using machine learning techniques to discover reliable patterns which discriminate samples from two populations, for example, healthy or diseased, or treatment responders versus non-responders. However, this data source is imperfect and faces several challenges: unwanted variability arising from the data collection process, obtaining a robust discriminative model that generalizes well to future data, and validating a predictive pattern statistically and biologically.This thesis presents several techniques which attempt to intelligently deal with the problems facing each stage of the analytical process. First, an automatic preprocessing method selection system is demonstrated. This system learns from data and selects a combination of preprocessing methods which is most appropriate for the task at hand. This reduces the noise affecting potential predictive patterns. Our results suggest that this method can help adapt to data from different technologies, improving downstream predictive performance. Next, the issues of feature selection and predictive modeling are revisited with respect to the unique challenges posed by proteomic profile data. Approaches to model selection through kernel learning are also investigated. Key insights are obtained for designing the feature selection and predictive modeling portion of the analytical framework. Finally, methods for interpreting the resultsof predictive modeling are demonstrated. These methods are used to assure the user of various desirable properties: validation of the strength of a predictive model, validation of reproducible signal across multiple data generation sessions and generalizability of predictive models to future data. A method for labeling profile features with biological identities is also presented, which aids in the interpretation of the data. Overall, these novel techniques give the protein profiling community additional support and leverage to aid the predictive capability of the technology
    corecore