219 research outputs found

    Medoidshift clustering applied to genomic bulk tumor data.

    Get PDF
    Despite the enormous medical impact of cancers and intensive study of their biology, detailed characterization of tumor growth and development remains elusive. This difficulty occurs in large part because of enormous heterogeneity in the molecular mechanisms of cancer progression, both tumor-to-tumor and cell-to-cell in single tumors. Advances in genomic technologies, especially at the single-cell level, are improving the situation, but these approaches are held back by limitations of the biotechnologies for gathering genomic data from heterogeneous cell populations and the computational methods for making sense of those data. One popular way to gain the advantages of whole-genome methods without the cost of single-cell genomics has been the use of computational deconvolution (unmixing) methods to reconstruct clonal heterogeneity from bulk genomic data. These methods, too, are limited by the difficulty of inferring genomic profiles of rare or subtly varying clonal subpopulations from bulk data, a problem that can be computationally reduced to that of reconstructing the geometry of point clouds of tumor samples in a genome space. Here, we present a new method to improve that reconstruction by better identifying subspaces corresponding to tumors produced from mixtures of distinct combinations of clonal subpopulations. We develop a nonparametric clustering method based on medoidshift clustering for identifying subgroups of tumors expected to correspond to distinct trajectories of evolutionary progression. We show on synthetic and real tumor copy-number data that this new method substantially improves our ability to resolve discrete tumor subgroups, a key step in the process of accurately deconvolving tumor genomic data and inferring clonal heterogeneity from bulk data

    RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ํ•ด๋…๊ณผ ํ™œ์šฉ์„ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต,2019. 8. ๊น€์„ .์ง„ํ•ต ์„ธํฌ ์‹œ์Šคํ…œ์—์„œ๋Š” mRNA ๋ถ„์ž๊ฐ€ ์ „์‚ฌ๋œ ์ดํ›„ ์™„์ „ํžˆ ์ฒ˜๋ฆฌ๋˜์–ด ๋‹จ๋ฐฑ์งˆ๋กœ ๋ฒˆ์—ญ๋  ๋•Œ๊นŒ์ง€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์€ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ, ์„ ํƒ์  ์•„๋ฐ๋‹ํ™” ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ฆ‰ ์–ด๋Š ํ•œ ์‹œ์ ์—์„œ ์ „์‚ฌ์ฒด๋ฅผ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด ๊ทธ ๋‚ด๋ถ€๋Š” ๋‹ค์–‘ํ•œ ์ค‘๊ฐ„์ฒด๋“ค์˜ ํ˜ผํ•ฉ๋ฌผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ์กฐ์ ˆ ์‹œ์Šคํ…œ ๋•Œ๋ฌธ์— ์ „์‚ฌ์ฒด๋ฅผ ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋ณธ ํ•™์œ„ ์—ฐ๊ตฌ๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด๋…ํ•˜๊ณ  ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์ด๋ฉฐ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ ์œ ์ „์ž ๋ฐœํ˜„์˜ ๊ด€์ ์—์„œ ์ˆ˜ํ–‰๋œ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. RNA ํŽธ์ง‘์€ ADAR(A=>I) ๊ณผ APOBEC(C=>U) ๋‘ ๊ฐ€์ง€ ํšจ์†Œ์— ์˜ํ•ด ์ด‰๋งค ๋˜๋Š” ์ „์‚ฌ ํ›„ RNA ์„œ์—ด ์กฐ์ ˆ ๊ธฐ์ž‘์ด๋‹ค. RNA ํŽธ์ง‘์€ ๋‹จ๋ฐฑ์งˆ ํ™œ์„ฑ๋„, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ miRNA ํ‘œ์  ์กฐ์ ˆ ๋“ฑ ๋‹ค์–‘ํ•œ ์„ธํฌ ๊ธฐ์ž‘์„ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ ์ค‘์š”ํ•œ ์ƒˆํฌ ๋‚ด ์กฐ์ ˆ ์‹œ์Šคํ…œ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ์„ ์ด์šฉํ•ด RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์€ RNA ํŽธ์ง‘ ํ˜„์ƒ์˜ ์ƒ๋ฌผํ•™์  ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๋ฌธ์ œ๋Š” ์ด ๊ณผ์ •์—์„œ ์ƒ๋‹นํ•œ ์–‘์˜ ์œ„์–‘์„ฑ์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ์ƒ˜ํ”Œ๋‹น ์ˆ˜๋งŒ ๊ฐœ ์ด์ƒ ๋ฐœ์ƒํ•˜๋Š” RNA ํŽธ์ง‘ ์ž”๊ธฐ๋“ค ๋ชจ๋‘๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•œ ์ „์‚ฐํ•™์  ๋ชจ๋ธ์ด ์š”๊ตฌ๋œ๋‹ค. RDDpred๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์œ„์–‘์„ฑ ์ž”๊ธฐ๋“ค์„ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ์ˆ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. RDDpred๋Š” ๋‘ ๊ฐœ์˜ ๊ธฐ ๋ฐœํ‘œ๋œ RNA ํŽธ์ง‘ ์—ฐ๊ตฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. RNA ์‹œํ€€์‹ฑ ๊ธฐ์ˆ ์ด ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋˜ ํ•˜๋‚˜์˜ ๋ณต์žกํ•œ ๋ฌธ์ œ๋กœ ์ ‘ํ•ฉ์ฒด ์ฐจ์›์—์„œ์˜ ์ข…์–‘ ์ด์งˆ์„ฑ (ITH) ์ธก์ • ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ITH๋Š” ์•” ์กฐ์ง์„ ๊ตฌ์„ฑํ•˜๋Š” ์„ธํฌ ์ง‘๋‹จ์˜ ๋‹ค์–‘์„ฑ์˜ ์ง€ํ‘œ์ด๋ฉฐ, ์ตœ๊ทผ ์ถœํŒ๋œ ์—ฐ๊ตฌ๋“ค์˜ ๊ฒฐ๊ณผ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ธก์ •๋œ ์ „์‚ฌ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH๊ฐ€ ์•” ํ™˜์ž์˜ ์˜ˆํ›„์˜ˆ์ธก์— ์œ ์šฉํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค. ์ ‘ํ•ฉ์ฒด๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰๊ณผ ํ•จ๊ป˜ ์ „์‚ฌ์ฒด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ฃผ์š” ์š”์†Œ ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ๋”ฐ๋ผ์„œ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ณด๋‹ค ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ „์‚ฌ์ฒด ITH๋ฅผ ์—ฐ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ๋ฆ„์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์•” ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ณผ์ •์—๋Š” ๋ณต์žกํ•œ ์ ‘ํ•ฉ ํŒจํ„ด๊ณผ ๊ด‘๋ฒ”์œ„ํ•œ ์ธํŠธ๋ก  ์—ฐ์žฅ ๋ณ€์ด ๋ฐ ์งง์€ ์‹œํ€€์‹ฑ ํŒ๋… ๊ธธ์ด ๋“ฑ์˜ ์‹ฌ๊ฐํ•œ ๊ธฐ์ˆ ์  ๋‚œ๊ด€๋“ค์ด ์žˆ๋‹ค. SpliceHetero๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH (์ฆ‰, sITH)๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ์ด๋ฉฐ ๋‚ด๋ถ€์ ์œผ๋กœ ์ •๋ณด์ด๋ก ์„ ํ™œ์šฉํ•œ๋‹ค. SpliceHetero๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ, ์ด์ข…์ด์‹ ์ข…์–‘ ๋ฐ์ดํ„ฐ ๋ฐ TCGA pan-cancer ๋ฐ์ดํ„ฐ ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ๊ฒ€์ฆ๋˜์—ˆ์œผ๋ฉฐ ITH๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ด๋ฟ ์•„๋‹ˆ๋ผ sITH๋Š” ์•”์˜ ์ง„ํ–‰๊ณผ ์•” ํ™˜์ž์˜ ์˜ˆํ›„ ๋ฐ PAM50์™€ ๊ฐ™์€ ์ž˜ ์•Œ๋ ค์ง„ ๋ถ„์ž ์•„ํ˜•๋“ค๊ณผ๋„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์—ฐ๊ตฌ ์ฃผ์ œ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋Š” ์•” ํ™˜์ž์˜ ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ํ”„๋กœํŒŒ์ผ์„ ์–ป๋Š” ๋ฐ์— ์œ ์šฉํ•œ ๋„๊ตฌ์ด์ง€๋งŒ, 2๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ฐจ์›์„ ๊ฐ€์ง„ ๋งค์šฐ ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์งˆ์ ์ธ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ ์ฐจ์›์˜ ํฌ๊ธฐ๋ฅผ ์ถ•์†Œํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋•Œ ๊ฐ ์œ ์ „์ž๋“ค์€ ๋ณต์žกํ•˜์ง€๋งŒ ๊ณ ์œ ํ•œ ๋ฐฉ์‹์œผ๋กœ ์„œ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ๋œ ๋‹จ๋ฐฑ์งˆ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ๋ชจ์•„ ๋„คํŠธ์›Œํฌ ํ˜•ํƒœ๋กœ ๋ฌถ์€ ๊ฒƒ์„ ๋‹จ๋ฐฑ์งˆ ์ƒํ˜ธ์ž‘์šฉ ๋„คํŠธ์›Œํฌ (ํ˜น์€ PIN)๋ผ ๋ถ€๋ฅธ๋‹ค. ์ด PIN์„ ํ™œ์šฉํ•˜์—ฌ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ด๋ฉด์„œ๋„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ํŠน์ง•๋“ค์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. Tumor2Vec์€ ์ด๋ ‡๊ฒŒ ์ถ”์ถœ๋œ PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•œ๋‹ค. Tumor2Vec์€ ์กฐ๊ธฐ ๊ตฌ๊ฐ• ์•”์—์„œ ๋ฆผํ”„์ ˆ ์ „์ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ผ๋Ÿฟ ์—ฐ๊ตฌ์— ์ ์šฉ๋˜์—ˆ์œผ๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์—ฌ ๋ฆผํ”„์ ˆ ์ „์ด ์˜ˆ์ธก ๋ชจ๋ธ์„ ์ƒ์„ฑํ–ˆ๊ณ  ์ด ๊ณผ์ •์—์„œ ์•” ํ‘œํ˜„ํ˜•์„ ์ž˜ ์„ค๋ช…ํ•˜๋Š” PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ๋ณด์กดํ•˜๋Š” ๋ฐ์—๋„ ์„ฑ๊ณตํ–ˆ๋‹ค.In eukaryotic cells, there are several post-transcriptional modification steps such as RNA editing and alternative splicing, until mRNA molecules are fully matured and translated into proteins. Thus, the transcriptome is a complex mixture of various intermediates that are processed in multiple steps. This complex regulatory structure makes it difficult to fully understand the landscape of transcriptome. My doctoral study consists of three studies that enable RNA-seq to be decoded and utilized in terms of RNA editing, alternative splicing, and gene expression. RNA editing is a post-transcriptional RNA sequence modification performed by two catalytic enzymes ADAR (A-to-I) and APOBEC (C-to-U). RNA editing is considered an important regulatory system that controls a variety of cellular functions such as protein activation, alternative splicing, and miRNA targeting. Therefore, detecting RNA editing events in RNA-seq data is important for understanding its biological functions. However, it is known that a significant amount of false-positives occur when detecting RNA editing in RNA-seq. Since it is not possible to experimentally validate all RNA editing residues extracted from RNA-seq, a computational model is needed to filter potential false-positive RNA editing calls. RDDpred, an RNA editing predictor based on machine learning techniques, was developed to filter out false-positive RNA editing calls in RNA-seq. It uses prior knowledge bases to collect training instances directly from the input data, and then trains the random forest (RF) predictors that are specific to the input data. RDDpred was tested using two publicly available datasets of RNA editing studies and has shown good performance. Another complex problem in RNA-seq decoding is spliceomic intratumor heterogeneity (ie, sITH). Intratumor heterogeneity (ITH) represents the diversity of cell populations that make up the cancer tissue. Recent studies have identified ITH at the transcriptome level and suggested that ITH at gene expression levels is useful for predicting prognosis. Measuring ITH levels at the spliceome level is a natural extension. There is a serious technical challenge in measuring sITH from bulk tumor RNA-seq, such as complex splicing patterns, widespread intron retentions, and short sequencing read lengths. SpliceHetero, an information-theoretic method for measuring the sITH of a tumor, was developed to address the aforementioned technical problems. SpliceHetero was extensively tested in experiments using synthetic data, xenograft tumor data and TCGA pan-cancer data and measured sITH successfully. Also, sITH was shown to be closely related to cancer progression and clonal heterogeneity, along with clinically significant features such as cancer stage, survival outcome, and PAM50 subtype. The last research topic is to develop a machine learning algorithm that defines patient subspaces specific to particular cancer phenotypes based on gene expression data. Since RNA-seq data is high-dimensional data composed of 20,000 or more genes in general, it is not easy to apply a machine learning algorithm. A network that collects information of experimentally verified interaction of proteins is called a Protein Interaction Network (PIN). Tumor2Vec defines the patient subspace by defining the subnetwork communities that interact with each other by applying the Graph Embedding technique to PIN. Tumor2Vec proposed a clinical model by defining a subspace for patients with different lymph node metastases in early oral cancer and found biologically significant features in the PIN subnetwork unit in the process.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges in decoding and utilizing RNA-seq data . . . . . . . . 5 1.2.1 false-positives in RNA editing calls . . . . . . . . . . . . . 6 1.2.2 Absence of a model for measuring spliceomic intratumor heterogeneity considering complex cancer spliceome . . . 6 1.2.3 Lack of biological interpretation of dimension reduction techniques using gene expression . . . . . . . . . . . . . . 8 1.3 Machine learning techniques to solve difficulties in using RNA-seq 9 1.4 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 RDDpred: A condition specific machine learning model for filtering false-positive RNA editing calls in RNAseq data 11 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Design of experiments for evaluation . . . . . . . . . . . . 18 2.5.2 Evaluation using data from Bahn et al. . . . . . . . . . . 19 2.5.3 Evaluation using data from Peng et al. . . . . . . . . . . . 19 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3 SpliceHetero: An information-theoretic approach for measuring spliceomic intratumor heterogeneity from bulk tumor RNA-seq data 24 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Xenograft tumor data . . . . . . . . . . . . . . . . . . . . 36 3.5.3 TCGA pan-cancer data . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4 Tumor2Vec: A supervised learning algorithm for extracting subnetwork representations of cancer RNAseq data using protein interaction networks 48 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.1 Lymph node metastasis in early oral cancer . . . . . . . . 57 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 5 Conclusion 62 ์ดˆ๋ก 78Docto

    A survey on data integration for multi-omics sample clustering

    Get PDF
    Due to the current high availability of omics, data-driven biology has greatly expanded, and several papers have reviewed state-of-the-art technologies. Nowadays, two main types of investigation are available for a multi-omics dataset: extraction of relevant features for a meaningful biological interpretation and clustering of the samples. In the latter case, a few reviews refer to some outdated or no longer available methods, whereas others lack the description of relevant clustering metrics to compare the main approaches. This work provides a general overview of the major techniques in this area, divided into four groups: graph, dimensionality reduction, statistical and neural-based. Besides, eight tools have been tested both on a synthetic and a real biological dataset. An extensive performance comparison has been provided using four clustering evaluation scores: Peak Signal-to-Noise Ratio (PSNR), Davies-Bouldin(DB) index, Silhouette value and the harmonic mean of cluster purity and efficiency. The best results were obtained by using the dimensionality reduction, either explicitly or implicitly, as in the neural architecture

    Sparse group sufficient dimension reduction and covariance cumulative slicing estimation

    Get PDF
    This dissertation contains two main parts: In Part One, for regression problems with grouped covariates, we adopt the idea of sparse group lasso (Friedman et al., 2010) to the framework of the sufficient dimension reduction. We propose a method called the sparse group sufficient dimension reduction (sgSDR) to conduct group and within group variable selections simultaneously without assuming a specific model structure on the regression function. Simulation studies show that our method is comparable to the sparse group lasso under the regular linear model setting, and outperforms sparse group lasso with higher true positive rates and substantially lower false positive rates when the regression function is nonlinear or (and) the error distributions are non-Gaussian. One immediate application of our method is to the gene pathway data analysis where genes naturally fall into groups (pathways). An analysis of a glioblastoma microarray data is included for illustration of our method. In Part Two, for many-valued or continuous Y , the standard practice of replacing the response Y by a discrete version of Y usually results in the loss of power due to the ignorance of intra-slice information. Most of the existing slicing methods highly reply on the selection of the number of slices h. Zhu et al. (2010) proposed a method called the cumulative slicing estimation (CUME) which avoids the otherwise subjective selection of h. In this dissertation, we revisit CUME from a different perspective to gain more insights, and then refine its performance by incorporating the intra-slice covariances. The resulting new method, which we call the covariance cumulative slicing estimation (COCUM), is comparable to CUME when the predictors are normally distributed, and outperforms CUME when the predictors are non-Gaussian, especially in the existence of outliers. The asymptotic results of COCUM are also well proved. --Abstract, page iv

    Integrated Graph Theoretic, Radiomics, and Deep Learning Framework for Personalized Clinical Diagnosis, Prognosis, and Treatment Response Assessment of Body Tumors

    Get PDF
    Purpose: A new paradigm is beginning to emerge in radiology with the advent of increased computational capabilities and algorithms. The future of radiological reading rooms is heading towards a unique collaboration between computer scientists and radiologists. The goal of computational radiology is to probe the underlying tissue using advanced algorithms and imaging parameters and produce a personalized diagnosis that can be correlated to pathology. This thesis presents a complete computational radiology framework (I GRAD) for personalized clinical diagnosis, prognosis and treatment planning using an integration of graph theory, radiomics, and deep learning. Methods: There are three major components of the I GRAD frameworkโ€“image segmentation, feature extraction, and clinical decision support. Image Segmentation: I developed the multiparametric deep learning (MPDL) tissue signature model for segmentation of normal and abnormal tissue from multiparametric (mp) radiological images. The segmentation MPDL network was constructed from stacked sparse autoencoders (SSAE) with five hidden layers. The MPDL network parameters were optimized using k-fold cross-validation. In addition, the MPDL segmentation network was tested on an independent dataset. Feature Extraction: I developed the radiomic feature mapping (RFM) and contribution scattergram (CSg) methods for characterization of spatial and inter-parametric relationships in multiparametric imaging datasets. The radiomic feature maps were created by filtering radiological images with first and second order statistical texture filters followed by the development of standardized features for radiological correlation to biology and clinical decision support. The contribution scattergram was constructed to visualize and understand the inter-parametric relationships of the breast MRI as a complex network. This multiparametric imaging complex network was modeled using manifold learning and evaluated using graph theoretic analysis. Feature Integration: The different clinical and radiological features extracted from multiparametric radiological images and clinical records were integrated using a hybrid multiview manifold learning technique termed the Informatics Radiomics Integration System (IRIS). IRIS uses hierarchical clustering in combination with manifold learning to visualize the high-dimensional patient space on a two-dimensional heatmap. The heatmap highlights the similarity and dissimilarity between different patients and variables. Results: All the algorithms and techniques presented in this dissertation were developed and validated using breast cancer as a model for diagnosis and prognosis using multiparametric breast magnetic resonance imaging (MRI). The deep learning MPDL method demonstrated excellent dice similarity of 0.87ยฑ0.05 and 0.84ยฑ0.07 for segmentation of lesions on malignant and benign breast patients, respectively. Furthermore, each of the methods, MPDL, RFM, and CSg demonstrated excellent results for breast cancer diagnosis with area under the receiver (AUC) operating characteristic (ROC) curve of 0.85, 0.91, and 0.87, respectively. Furthermore, IRIS classified patients with low risk of breast cancer recurrence from patients with medium and high risk with an AUC of 0.93 compared to OncotypeDX, a 21 gene assay for breast cancer recurrence. Conclusion: By integrating advanced computer science methods into the radiological setting, the I-GRAD framework presented in this thesis can be used to model radiological imaging data in combination with clinical and histopathological data and produce new tools for personalized diagnosis, prognosis or treatment planning by physicians

    Unsupervised Algorithms for Microarray Sample Stratification

    Get PDF
    The amount of data made available by microarrays gives researchers the opportunity to delve into the complexity of biological systems. However, the noisy and extremely high-dimensional nature of this kind of data poses significant challenges. Microarrays allow for the parallel measurement of thousands of molecular objects spanning different layers of interactions. In order to be able to discover hidden patterns, the most disparate analytical techniques have been proposed. Here, we describe the basic methodologies to approach the analysis of microarray datasets that focus on the task of (sub)group discovery.Peer reviewe

    Integrative Analysis Methods for Biological Problems Using Data Reduction Approaches

    Full text link
    The "big data" revolution of the past decade has allowed researchers to procure or access biological data at an unprecedented scale, on the front of both volume (low-cost high-throughput technologies) and variety (multi-platform genomic profiling). This has fueled the development of new integrative methods, which combine and consolidate across multiple sources of data in order to gain generalizability, robustness, and a more comprehensive systems perspective. The key challenges faced by this new class of methods primarily relate to heterogeneity, whether it is across cohorts from independent studies or across the different levels of genomic regulation. While the different perspectives among data sources is invaluable in providing different snapshots of the global system, such diversity also brings forth many analytic difficulties as each source introduces a distinctive element of noise. In recent years, many styles of data integration have appeared to tackle this problem ranging from Bayesian frameworks to graphical models, a wide assortment as diverse as the biology they intend to explain. My focus in this work is dimensionality reduction-based methods of integration, which offer the advantages of efficiency in high-dimensions (an asset among genomic datasets) and simplicity in allowing for elegant mathematical extensions. In the course of these chapters I will describe the biological motivations, the methodological directions, and the applications of three canonical reductionist approaches for relating information across multiple data groups.PHDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/138564/1/yangzi_1.pd
    • โ€ฆ
    corecore