1,516 research outputs found

    Encounter complexes and dimensionality reduction in protein-protein association

    Get PDF
    An outstanding challenge has been to understand the mechanism whereby proteins associate. We report here the results of exhaustively sampling the conformational space in proteinโ€“protein association using a physics-based energy function. The agreement between experimental intermolecular paramagnetic relaxation enhancement (PRE) data and the PRE profiles calculated from the docked structures shows that the method captures both specific and non-specific encounter complexes. To explore the energy landscape in the vicinity of the native structure, the nonlinear manifold describing the relative orientation of two solid bodies is projected onto a Euclidean space in which the shape of low energy regions is studied by principal component analysis. Results show that the energy surface is canyon-like, with a smooth funnel within a two dimensional subspace capturing over 75% of the total motion. Thus, proteins tend to associate along preferred pathways, similar to sliding of a protein along DNA in the process of protein-DNA recognition

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Measures of Analysis of Time Series (MATS): A MATLAB Toolkit for Computation of Multiple Measures on Time Series Data Bases

    Get PDF
    In many applications, such as physiology and finance, large time series data bases are to be analyzed requiring the computation of linear, nonlinear and other measures. Such measures have been developed and implemented in commercial and freeware softwares rather selectively and independently. The Measures of Analysis of Time Series ({\tt MATS}) {\tt MATLAB} toolkit is designed to handle an arbitrary large set of scalar time series and compute a large variety of measures on them, allowing for the specification of varying measure parameters as well. The variety of options with added facilities for visualization of the results support different settings of time series analysis, such as the detection of dynamics changes in long data records, resampling (surrogate or bootstrap) tests for independence and linearity with various test statistics, and discrimination power of different measures and for different combinations of their parameters. The basic features of {\tt MATS} are presented and the implemented measures are briefly described. The usefulness of {\tt MATS} is illustrated on some empirical examples along with screenshots.Comment: 25 pages, 9 figures, two tables, the software can be downloaded at http://eeganalysis.web.auth.gr/indexen.ht

    Unsupervised learning on social data

    Get PDF

    Finding and Visualizing Relevant Subspaces for Clustering High-Dimensional Astronomical Data Using Connected Morphological Operators

    Get PDF
    Data sets in many scientific areas are growing to enormous sizes. For example, modern astronomical surveys provide not only image data but also catalogues of millions of objects (stars, galaxies), each object with hundreds of associated parameters. Gene expression ex-periments produce data about the complete genome of an organism under different conditions and at a sequence of time points. Ex-ploration of such very high-dimensional data spaces poses a huge challenge. Subspace clustering is one among several approaches which have been proposed for this purpose in recent years. How-ever, many clustering algorithms require the user to set a large num-ber of parameters without any guidelines. Some methods also do not provide a concise summary of the datasets, or, if they do, they lack additional important information such as the number of clus-ters present or the significance of the clusters

    A bottom-up framework for analysing city-scale energy data using high dimension reduction techniques

    Get PDF
    Worldwide cities are becoming more sustainable and are being monitored using data collection techniques at various geographical levels. Given the growing volume of data, there is a need to identify challenges associated with the processing, visualization, and analysis of the generated data from an urban scale. This study proposes a framework to investigate the capabilities of dimensionality reduction techniques (t-SNE, and UMAP) applied to city-scale data to identify key features of high consumption and generation areas based on building characteristics. The analysis is performed on measured data from 2735 postcodes consisting of 72000 households/buildings from a city in the Netherlands. The evaluation results showed that the UMAP's algorithm mean sigma quickly approaches a threshold of 0.6 at n_neighbor values of 50 and the low dimensional shape does not change with increasing values. Whereas the t-SNE's mean sigma value increases continuously with the increasing perplexity value, implying that t-SNE is significantly more sensitive to the perplexity parameter. The UMAP algorithm was used to extract information about the high photovoltaic generation and consumption regions. The proposed framework will assist grid operators and energy planners in extracting information from energy consumption data at the neighbourhood level by utilizing high dimensional reduction techniques

    RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ํ•ด๋…๊ณผ ํ™œ์šฉ์„ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต,2019. 8. ๊น€์„ .์ง„ํ•ต ์„ธํฌ ์‹œ์Šคํ…œ์—์„œ๋Š” mRNA ๋ถ„์ž๊ฐ€ ์ „์‚ฌ๋œ ์ดํ›„ ์™„์ „ํžˆ ์ฒ˜๋ฆฌ๋˜์–ด ๋‹จ๋ฐฑ์งˆ๋กœ ๋ฒˆ์—ญ๋  ๋•Œ๊นŒ์ง€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„์˜ ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์„ ๊ฑฐ์น˜๊ฒŒ ๋œ๋‹ค. ์ „์‚ฌ ํ›„ ์กฐ์ ˆ ๊ณผ์ •์€ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ, ์„ ํƒ์  ์•„๋ฐ๋‹ํ™” ๋“ฑ์„ ํฌํ•จํ•œ๋‹ค. ์ฆ‰ ์–ด๋Š ํ•œ ์‹œ์ ์—์„œ ์ „์‚ฌ์ฒด๋ฅผ ๋“ค์—ฌ๋‹ค๋ณด๋ฉด ๊ทธ ๋‚ด๋ถ€๋Š” ๋‹ค์–‘ํ•œ ์ค‘๊ฐ„์ฒด๋“ค์˜ ํ˜ผํ•ฉ๋ฌผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ณต์žกํ•œ ์กฐ์ ˆ ์‹œ์Šคํ…œ ๋•Œ๋ฌธ์— ์ „์‚ฌ์ฒด๋ฅผ ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ดํ•ดํ•˜๊ธฐ๊ฐ€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋ณธ ํ•™์œ„ ์—ฐ๊ตฌ๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด๋…ํ•˜๊ณ  ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ๋ฒ•๋“ค์— ๋Œ€ํ•œ ์—ฐ๊ตฌ์ด๋ฉฐ RNA ํŽธ์ง‘, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ ์œ ์ „์ž ๋ฐœํ˜„์˜ ๊ด€์ ์—์„œ ์ˆ˜ํ–‰๋œ ์„ธ ๊ฐ€์ง€ ์—ฐ๊ตฌ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. RNA ํŽธ์ง‘์€ ADAR(A=>I) ๊ณผ APOBEC(C=>U) ๋‘ ๊ฐ€์ง€ ํšจ์†Œ์— ์˜ํ•ด ์ด‰๋งค ๋˜๋Š” ์ „์‚ฌ ํ›„ RNA ์„œ์—ด ์กฐ์ ˆ ๊ธฐ์ž‘์ด๋‹ค. RNA ํŽธ์ง‘์€ ๋‹จ๋ฐฑ์งˆ ํ™œ์„ฑ๋„, ์„ ํƒ์  ์ ‘ํ•ฉ ๋ฐ miRNA ํ‘œ์  ์กฐ์ ˆ ๋“ฑ ๋‹ค์–‘ํ•œ ์„ธํฌ ๊ธฐ์ž‘์„ ์ œ์–ดํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ง„ ์ค‘์š”ํ•œ ์ƒˆํฌ ๋‚ด ์กฐ์ ˆ ์‹œ์Šคํ…œ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ์„ ์ด์šฉํ•ด RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์€ RNA ํŽธ์ง‘ ํ˜„์ƒ์˜ ์ƒ๋ฌผํ•™์  ๊ธฐ๋Šฅ์„ ์ดํ•ดํ•˜๋Š” ๋ฐ์— ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๋ฌธ์ œ๋Š” ์ด ๊ณผ์ •์—์„œ ์ƒ๋‹นํ•œ ์–‘์˜ ์œ„์–‘์„ฑ์ด ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์ ์ด๋‹ค. ์ƒ˜ํ”Œ๋‹น ์ˆ˜๋งŒ ๊ฐœ ์ด์ƒ ๋ฐœ์ƒํ•˜๋Š” RNA ํŽธ์ง‘ ์ž”๊ธฐ๋“ค ๋ชจ๋‘๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฅผ ๊ฑธ๋Ÿฌ๋‚ด๊ธฐ ์œ„ํ•œ ์ „์‚ฐํ•™์  ๋ชจ๋ธ์ด ์š”๊ตฌ๋œ๋‹ค. RDDpred๋Š” RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ RNA ํŽธ์ง‘ ํ˜„์ƒ์„ ๊ฒ€์ถœํ•˜๋Š” ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์œ„์–‘์„ฑ ์ž”๊ธฐ๋“ค์„ ๊ธฐ๊ณ„ํ•™์Šต ๊ธฐ์ˆ ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตฌ๋ถ„ํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. RDDpred๋Š” ๋‘ ๊ฐœ์˜ ๊ธฐ ๋ฐœํ‘œ๋œ RNA ํŽธ์ง‘ ์—ฐ๊ตฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. RNA ์‹œํ€€์‹ฑ ๊ธฐ์ˆ ์ด ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๋˜ ํ•˜๋‚˜์˜ ๋ณต์žกํ•œ ๋ฌธ์ œ๋กœ ์ ‘ํ•ฉ์ฒด ์ฐจ์›์—์„œ์˜ ์ข…์–‘ ์ด์งˆ์„ฑ (ITH) ์ธก์ • ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ITH๋Š” ์•” ์กฐ์ง์„ ๊ตฌ์„ฑํ•˜๋Š” ์„ธํฌ ์ง‘๋‹จ์˜ ๋‹ค์–‘์„ฑ์˜ ์ง€ํ‘œ์ด๋ฉฐ, ์ตœ๊ทผ ์ถœํŒ๋œ ์—ฐ๊ตฌ๋“ค์˜ ๊ฒฐ๊ณผ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์ธก์ •๋œ ์ „์‚ฌ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH๊ฐ€ ์•” ํ™˜์ž์˜ ์˜ˆํ›„์˜ˆ์ธก์— ์œ ์šฉํ•จ์„ ์‹œ์‚ฌํ•œ๋‹ค. ์ ‘ํ•ฉ์ฒด๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰๊ณผ ํ•จ๊ป˜ ์ „์‚ฌ์ฒด๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ฃผ์š” ์š”์†Œ ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ๋”ฐ๋ผ์„œ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฒƒ์€ ๋ณด๋‹ค ์ „์ฒด์ ์ธ ์ˆ˜์ค€์—์„œ ์ „์‚ฌ์ฒด ITH๋ฅผ ์—ฐ๊ตฌํ•˜๊ธฐ ์œ„ํ•œ ์ž์—ฐ์Šค๋Ÿฌ์šด ํ๋ฆ„์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์•” ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ ITH๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ณผ์ •์—๋Š” ๋ณต์žกํ•œ ์ ‘ํ•ฉ ํŒจํ„ด๊ณผ ๊ด‘๋ฒ”์œ„ํ•œ ์ธํŠธ๋ก  ์—ฐ์žฅ ๋ณ€์ด ๋ฐ ์งง์€ ์‹œํ€€์‹ฑ ํŒ๋… ๊ธธ์ด ๋“ฑ์˜ ์‹ฌ๊ฐํ•œ ๊ธฐ์ˆ ์  ๋‚œ๊ด€๋“ค์ด ์žˆ๋‹ค. SpliceHetero๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋“ค์„ ๊ณ ๋ คํ•˜์—ฌ ์ ‘ํ•ฉ์ฒด ์ˆ˜์ค€์—์„œ์˜ ITH (์ฆ‰, sITH)๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋„๊ตฌ์ด๋ฉฐ ๋‚ด๋ถ€์ ์œผ๋กœ ์ •๋ณด์ด๋ก ์„ ํ™œ์šฉํ•œ๋‹ค. SpliceHetero๋Š” ์‹œ๋ฎฌ๋ ˆ์ด์…˜ ๋ฐ์ดํ„ฐ, ์ด์ข…์ด์‹ ์ข…์–‘ ๋ฐ์ดํ„ฐ ๋ฐ TCGA pan-cancer ๋ฐ์ดํ„ฐ ๋“ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ๊ฒ€์ฆ๋˜์—ˆ์œผ๋ฉฐ ITH๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ์ด๋ฟ ์•„๋‹ˆ๋ผ sITH๋Š” ์•”์˜ ์ง„ํ–‰๊ณผ ์•” ํ™˜์ž์˜ ์˜ˆํ›„ ๋ฐ PAM50์™€ ๊ฐ™์€ ์ž˜ ์•Œ๋ ค์ง„ ๋ถ„์ž ์•„ํ˜•๋“ค๊ณผ๋„ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์œผ๋กœ ํ™•์ธ๋˜์—ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ์—ฐ๊ตฌ ์ฃผ์ œ๋Š” ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•˜๋Š” ๊ธฐ๊ณ„ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ๋Š” ์•” ํ™˜์ž์˜ ์œ ์ „์ž ๋ฐœํ˜„๋Ÿ‰ ํ”„๋กœํŒŒ์ผ์„ ์–ป๋Š” ๋ฐ์— ์œ ์šฉํ•œ ๋„๊ตฌ์ด์ง€๋งŒ, 2๋งŒ ๊ฐœ ์ด์ƒ์˜ ์ฐจ์›์„ ๊ฐ€์ง„ ๋งค์šฐ ๊ณ ์ฐจ์›์˜ ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์งˆ์ ์ธ ์šฉ๋„๋กœ ์‚ฌ์šฉ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ทธ ์ฐจ์›์˜ ํฌ๊ธฐ๋ฅผ ์ถ•์†Œํ•  ํ•„์š”๊ฐ€ ์žˆ๋‹ค. ์ด๋•Œ ๊ฐ ์œ ์ „์ž๋“ค์€ ๋ณต์žกํ•˜์ง€๋งŒ ๊ณ ์œ ํ•œ ๋ฐฉ์‹์œผ๋กœ ์„œ๋กœ ์ƒํ˜ธ์ž‘์šฉํ•œ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ๋œ ๋‹จ๋ฐฑ์งˆ ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ๋ชจ์•„ ๋„คํŠธ์›Œํฌ ํ˜•ํƒœ๋กœ ๋ฌถ์€ ๊ฒƒ์„ ๋‹จ๋ฐฑ์งˆ ์ƒํ˜ธ์ž‘์šฉ ๋„คํŠธ์›Œํฌ (ํ˜น์€ PIN)๋ผ ๋ถ€๋ฅธ๋‹ค. ์ด PIN์„ ํ™œ์šฉํ•˜์—ฌ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์ด๋ฉด์„œ๋„ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ƒ๋ฌผํ•™์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ํŠน์ง•๋“ค์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. Tumor2Vec์€ ์ด๋ ‡๊ฒŒ ์ถ”์ถœ๋œ PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ์•” ํ‘œํ˜„ํ˜•์— ํŠน์ด์ ์ธ ํ™˜์ž ๋ถ€๋ถ„ ๊ณต๊ฐ„์„ ์ •์˜ํ•œ๋‹ค. Tumor2Vec์€ ์กฐ๊ธฐ ๊ตฌ๊ฐ• ์•”์—์„œ ๋ฆผํ”„์ ˆ ์ „์ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ์ผ๋Ÿฟ ์—ฐ๊ตฌ์— ์ ์šฉ๋˜์—ˆ์œผ๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ RNA ์‹œํ€€์‹ฑ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์›์„ ์ค„์—ฌ ๋ฆผํ”„์ ˆ ์ „์ด ์˜ˆ์ธก ๋ชจ๋ธ์„ ์ƒ์„ฑํ–ˆ๊ณ  ์ด ๊ณผ์ •์—์„œ ์•” ํ‘œํ˜„ํ˜•์„ ์ž˜ ์„ค๋ช…ํ•˜๋Š” PIN ์ˆ˜์ค€์˜ ํŠน์ง•๋“ค์„ ๋ณด์กดํ•˜๋Š” ๋ฐ์—๋„ ์„ฑ๊ณตํ–ˆ๋‹ค.In eukaryotic cells, there are several post-transcriptional modification steps such as RNA editing and alternative splicing, until mRNA molecules are fully matured and translated into proteins. Thus, the transcriptome is a complex mixture of various intermediates that are processed in multiple steps. This complex regulatory structure makes it difficult to fully understand the landscape of transcriptome. My doctoral study consists of three studies that enable RNA-seq to be decoded and utilized in terms of RNA editing, alternative splicing, and gene expression. RNA editing is a post-transcriptional RNA sequence modification performed by two catalytic enzymes ADAR (A-to-I) and APOBEC (C-to-U). RNA editing is considered an important regulatory system that controls a variety of cellular functions such as protein activation, alternative splicing, and miRNA targeting. Therefore, detecting RNA editing events in RNA-seq data is important for understanding its biological functions. However, it is known that a significant amount of false-positives occur when detecting RNA editing in RNA-seq. Since it is not possible to experimentally validate all RNA editing residues extracted from RNA-seq, a computational model is needed to filter potential false-positive RNA editing calls. RDDpred, an RNA editing predictor based on machine learning techniques, was developed to filter out false-positive RNA editing calls in RNA-seq. It uses prior knowledge bases to collect training instances directly from the input data, and then trains the random forest (RF) predictors that are specific to the input data. RDDpred was tested using two publicly available datasets of RNA editing studies and has shown good performance. Another complex problem in RNA-seq decoding is spliceomic intratumor heterogeneity (ie, sITH). Intratumor heterogeneity (ITH) represents the diversity of cell populations that make up the cancer tissue. Recent studies have identified ITH at the transcriptome level and suggested that ITH at gene expression levels is useful for predicting prognosis. Measuring ITH levels at the spliceome level is a natural extension. There is a serious technical challenge in measuring sITH from bulk tumor RNA-seq, such as complex splicing patterns, widespread intron retentions, and short sequencing read lengths. SpliceHetero, an information-theoretic method for measuring the sITH of a tumor, was developed to address the aforementioned technical problems. SpliceHetero was extensively tested in experiments using synthetic data, xenograft tumor data and TCGA pan-cancer data and measured sITH successfully. Also, sITH was shown to be closely related to cancer progression and clonal heterogeneity, along with clinically significant features such as cancer stage, survival outcome, and PAM50 subtype. The last research topic is to develop a machine learning algorithm that defines patient subspaces specific to particular cancer phenotypes based on gene expression data. Since RNA-seq data is high-dimensional data composed of 20,000 or more genes in general, it is not easy to apply a machine learning algorithm. A network that collects information of experimentally verified interaction of proteins is called a Protein Interaction Network (PIN). Tumor2Vec defines the patient subspace by defining the subnetwork communities that interact with each other by applying the Graph Embedding technique to PIN. Tumor2Vec proposed a clinical model by defining a subspace for patients with different lymph node metastases in early oral cancer and found biologically significant features in the PIN subnetwork unit in the process.Chapter 1 Introduction 1 1.1 Biological background . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Challenges in decoding and utilizing RNA-seq data . . . . . . . . 5 1.2.1 false-positives in RNA editing calls . . . . . . . . . . . . . 6 1.2.2 Absence of a model for measuring spliceomic intratumor heterogeneity considering complex cancer spliceome . . . 6 1.2.3 Lack of biological interpretation of dimension reduction techniques using gene expression . . . . . . . . . . . . . . 8 1.3 Machine learning techniques to solve difficulties in using RNA-seq 9 1.4 Outline of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2 RDDpred: A condition specific machine learning model for filtering false-positive RNA editing calls in RNAseq data 11 2.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Design of experiments for evaluation . . . . . . . . . . . . 18 2.5.2 Evaluation using data from Bahn et al. . . . . . . . . . . 19 2.5.3 Evaluation using data from Peng et al. . . . . . . . . . . . 19 2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Chapter 3 SpliceHetero: An information-theoretic approach for measuring spliceomic intratumor heterogeneity from bulk tumor RNA-seq data 24 3.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 A preliminary study . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.1 Synthetic data . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5.2 Xenograft tumor data . . . . . . . . . . . . . . . . . . . . 36 3.5.3 TCGA pan-cancer data . . . . . . . . . . . . . . . . . . . 38 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Chapter 4 Tumor2Vec: A supervised learning algorithm for extracting subnetwork representations of cancer RNAseq data using protein interaction networks 48 4.1 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.1 Lymph node metastasis in early oral cancer . . . . . . . . 57 4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Chapter 5 Conclusion 62 ์ดˆ๋ก 78Docto
    • โ€ฆ
    corecore