1,827 research outputs found

    Feature selection with interactions in logistic regression models using multivariate synergies for a GWAS application

    Get PDF
    Abstract Background Genotype-phenotype association has been one of the long-standing problems in bioinformatics. Identifying both the marginal and epistatic effects among genetic markers, such as Single Nucleotide Polymorphisms (SNPs), has been extensively integrated in Genome-Wide Association Studies (GWAS) to help derive โ€œcausalโ€ genetic risk factors and their interactions, which play critical roles in life and disease systems. Identifying โ€œsynergisticโ€ interactions with respect to the outcome of interest can help accurate phenotypic prediction and understand the underlying mechanism of system behavior. Many statistical measures for estimating synergistic interactions have been proposed in the literature for such a purpose. However, except for empirical performance, there is still no theoretical analysis on the power and limitation of these synergistic interaction measures. Results In this paper, it is shown that the existing information-theoretic multivariate synergy depends on a small subset of the interaction parameters in the model, sometimes on only one interaction parameter. In addition, an adjusted version of multivariate synergy is proposed as a new measure to estimate the interactive effects, with experiments conducted over both simulated data sets and a real-world GWAS data set to show the effectiveness. Conclusions We provide rigorous theoretical analysis and empirical evidence on why the information-theoretic multivariate synergy helps with identifying genetic risk factors via synergistic interactions. We further establish the rigorous sample complexity analysis on detecting interactive effects, confirmed by both simulated and real-world data sets.https://deepblue.lib.umich.edu/bitstream/2027.42/142802/1/12864_2018_Article_4552.pd

    Designing and sample size calculation in presence of heterogeneity in biological studies involving high-throughput data.

    Get PDF
    The designing and determination of sample size are important for conducting high-throughput biological experiments such as proteomics experiments and RNA-Seq expression studies, thus leading to better understanding of complex mechanisms underlying various biological processes. The variations in the biological data or technical approaches to data collection lead to heterogeneity for the samples under study. We critically worked on the issues of technical and biological heterogeneity. The quantitative measurements based on liquid chromatography (LC) coupled with mass spectrometry (MS) often suffer from the problem of missing values (MVs) and data heterogeneity. We considered a proteomics data set generated from human kidney biopsy material to investigate the technical effects of sample preparation and the quantitative MS. We studied the effect of tissue storage methods (TSMs) and tissue extraction methods (TEMs) on data analysis. There are two TSMs: frozen (FR) and FFPE (formalin-fixed paraffin embedded); and three TEMs: MAX, TX followed by MAX and SDS followed by MAX. We assessed the impact of different strategies to analyze the data while considering heterogeneity and MVs. We found that the FFPE is better than that of FR for tissue storage. We also found that the one-step TEM (MAX) is better than those of two-steps TEMs. Furthermore, we found the imputation method is a better approach than excluding the proteins with MVs or using unbalanced design. We introduce a web application, PWST (Proteomics Workflow Standardization Tool) to standardize the proteomics workflow. The tool will be helpful in deciding the most suitable choice for each step and studying the variability associated with technical steps as well as the effects of continuous variables. We have used the special cases of general linear model - ANCOVA and ANOVA with fixed effects to study the effects due to various sources of variability. We introduce an interactive tool, โ€œSATP: Statistical Analysis Tool for Proteomicsโ€, for analyzing proteomics expression data that is scalable to large clinical proteomic studies. The user can perform differential expression analysis of proteomics data either at the protein or peptide level using multiple approaches. We have developed statistical approaches for calculating sample size for proteomics experiments under allocation and cost constraints. We have developed R programs and a shiny app โ€œSSCP: Sample Size Calculator for Proteomics Experimentโ€ for computing sample sizes. We have proposed statistical approaches for calculating sample size for RNA-Seq experiments considering allocation and cost. We have developed R programs and shiny apps to calculate sample size for conducting RNA-Seq experiments

    Application and Extension of Weighted Quantile Sum Regression for the Development of a Clinical Risk Prediction Tool

    Get PDF
    In clinical settings, the diagnosis of medical conditions is often aided by measurement of various serum biomarkers through the use of laboratory tests. These biomarkers provide information about different aspects of a patientโ€™s health and the overall function of different organs. In this dissertation, we develop and validate a weighted composite index that aggregates the information from a variety of health biomarkers covering multiple organ systems. The index can be used for predicting all-cause mortality and could also be used as a holistic measure of overall physiological health status. We refer to it as the Health Status Metric (HSM). Validation analysis shows that the HSM is predictive of long-term mortality risk and exhibits a robust association with concurrent chronic conditions, recent hospital utilization, and self-rated health. We develop the HSM using Weighted Quantile Sum (WQS) regression (Gennings et al., 2013; Carrico, 2013), a novel penalized regression technique that imposes nonnegativity and unit-sum constraints on the coefficients used to weight index components. In this dissertation, we develop a number of extensions to the WQS regression technique and apply them to the construction of the HSM. We introduce a new guided approach for the standardization of index components which accounts for potential nonlinear relationships with the outcome of interest. An extended version of the WQS that accommodates interaction effects among index components is also developed and implemented. In addition, we demonstrate that ensemble learning methods borrowed from the field of machine learning can be used to improve the predictive power of the WQS index. Specifically, we show that the use of techniques such as weighted bagging, the random subspace method and stacked generalization in conjunction with the WQS model can produce an index with substantially enhanced predictive accuracy. Finally, practical applications of the HSM are explored. A comparative study is performed to evaluate the feasibility and effectiveness of a number of โ€˜real-timeโ€™ imputation strategies in potential software applications for computing the HSM. In addition, the efficacy of the HSM as a predictor of hospital readmission is assessed in a cohort of emergency department patients

    Principal Component Analysis in the Era of ยซOmicsยป Data

    Get PDF

    Multi-omic biomarker discovery and network analyses to elucidate the molecular mechanisms of lung cancer premalignancy

    Get PDF
    Lung cancer (LC) is the leading cause of cancer death in the US, claiming over 160,000 lives annually. Although CT screening has been shown to be efficacious in reducing mortality, the limited access to screening programs among high-risk individuals and the high number of false positives contribute to low survival rates and increased healthcare costs. As a result, there is an urgent need for preventative therapeutics and novel interception biomarkers that would enhance current methods for detection of early-stage LC. This thesis addresses this challenge by examining the hypothesis that transcriptomic changes preceding the onset of LC can be identified by studying bronchial premalignant lesions (PMLs) and the normal-appearing airway epithelial cells altered in their presence (i.e., the PML-associated airway field of injury). PMLs are the presumed precursors of lung squamous cell carcinoma (SCC) whose presence indicates an increased risk of developing SCC and other subtypes of LC. Here, I leverage high-throughput mRNA and miRNA sequencing data from bronchial brushings and lesion biopsies to develop biomarkers of PML presence and progression, and to understand regulatory mechanisms driving early carcinogenesis. First, I utilized mRNA sequencing data from normal-appearing airway brushings to build a biomarker predictive of PML presence. After verifying the power of the 200-gene biomarker to detect the presence of PMLs, I evaluated its capacity to predict PML progression and detect presence of LC (Aim 1). Next, I identified likely regulatory mechanisms associated with PML severity and progression, by evaluating miRNA expression and gene coexpression modules containing their targets in bronchial lesion biopsies (Aim2). Lastly, I investigated the preservation of the PML-associated miRNAs and gene modules in the airway field of injury, highlighting an emergent link between the airway field and the PMLs (Aim 3). Overall, this thesis suggests a multi-faceted utility of PML-associated genomic signatures as markers for stratification of high-risk smokers in chemoprevention trials, markers for early detection of lung cancer, and novel chemopreventive targets, and yields valuable insights into early lung carcinogenesis by characterizing mRNA and miRNA expression alterations that contribute to premalignant disease progression towards LC.2020-01-2

    ๋ฉ”ํƒ€๋ถ„์„ ์ „๋žต์„ ํ™œ์šฉํ•œ ์ „์‚ฌ์ฒด์ƒ ๋ฐ”์ด์˜ค๋งˆ์ปค์˜ ์„ ๋ณ„

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ž์—ฐ๊ณผํ•™๋Œ€ํ•™ ํ˜‘๋™๊ณผ์ • ์ƒ๋ฌผ์ •๋ณดํ•™์ „๊ณต, 2019. 2. ๊น€ํฌ๋ฐœ.The Next Generation Sqeuencing (NGS) decade resulted in explosive advancements in technology and on knowledge in the bioinformatic area of science. The timely manner of sequencing together with its cheap prices supported the accumulation of a massive pool of biological data, which lead to new findings. Much more complicated study designs along with the advanced statistical analyses have been proposed, which are responsible for the rise of bioinformatics to one of the fastest growing fields of interdisciplinary science. Inevitably, determining appropriate statistical models and summary methods is directly dependent on the experimental designs. As the results of those studies have to be presented and understood by many specialists in different communities, the summary techniques and presentations are also crucial. Meta analytical approaches on complex study designs can simplify the statistical models and enable appropriate deduction techniques in candidate filtering. The most credible candidates can be detected via multiple testing correction and other guidelines on error pruning. However, suggesting study-specific candidates or understanding the employed models and choosing presentation methods are solely on the analysts discretion so far. In this thesis, the meta-analysis includes 1) multi-population data analysis that analyzes the populations separately (split data analysis), 2) different test methods or statistical models are used for a same dataset, 3) combining and results from an independent study. The major objective is on curating the multiple results into a study-specific biomarker of interest, using meta-analytical approaches. Chapter 2 holds the idea of meta-analysis in a sense that the program itself is made for comparison and summarization of p-values from several test results. The study itself is the first step into the meta-analytical strategies in biomarker selection. It is the most primitive chapter of the thesis, but can be used to compare the meta-analytically defined biomarkers in Chapter 3, for example. A basic set of plots is employed to highlight the most concordant results in different statistical models and tests. The incorporated pairwise scatter plot of the first module simply illustrates the correlation of p-values between a pair of tests or models. In the next module, interactive p-value thresholds are shown in the selected scatter plot, and the results are summarized in a Venn diagram. In the final module, a heatmap-like plot shows comprehensive results of all models/tests used in the study and pinpoints which candidates are concordantly significant in those results. The GUI-program proposed in the chapter is applicable to all studies that generate p-values or other statistics, and is demonstrated under several platforms and designs: microarray, GWAS, RNA-Seq, and family-based study. In Chapter 3, the final candidate genes comprise significant DEGs between male and female cattle in two of the employed pipelines. In the RNA-seq protocol, selection of mRNA relies on the poly-A tails of the reads. Unfortunately, some non-coding RNAs, including the lncRNAs, can be transcribed and have poly-A tails. In this case, transcripts from the lncRNAs are not distinguishable from those of the mRNAs. The chapter elucidates that the inclusion of a lncRNA annotation in the upstream RNA-seq process results in a dramatic difference in significant candidate lists and that the conventional pipeline neglects the quantification of ambiguous gene expression, which may result in erroneous interpretation. The effect of lncRNA annotation is also different among tissues, and such tissue-specific patterns have been attested by the concordance of significance in two different DEG analysis pipelines. In conclusion, we suggest genes that were unaffected by the annotation as most credible, from the original candidates where only the mRNA annotation is used (conventional pipeline). In Chapter 4, a sugar substitute that displays anti-inflammatory/obesity effect is analyzed at a gene-level. A normal diet group (ND), high-fat diet group (HFD), and high-fat diet with D-allulose intake group (ALL) from two tissues, liver and epididymal fat (eWAT), are used for the study. The chapter describes crosstalk genes, which are inter-tissue co-expressed genes that are defined to have concordant regulation pattern between liver and eWAT in this study. The two tissues are chosen for their known interaction. The meta-analytical approach here is to summarize the expression profiles in two different tissues, and to draw the concordantly regulated gene expression between-tissues. Furthermore, the study-specific candidates are the Recovered genes that are initially up- or down-regulated by the high fat diet group, but reverts back to normal-level after D-allulose intake. These genes, selected from the pool of cross-talk genes, showed a correlation with the two inflammation-related genera: Lactobacillus and Coprococcus. For this study, much of the extraneous factors (i.e. exercise, food intake, etc.) are well controlled as it is a mouse study, and such rebound of gene expression can be thought of as the outcome of D-allulose intake. The study employs 3 statistical models for liver and eWAT each, and correlation test to derive the recovered genes through meta-analysis of those models. The final 20 RecGs are concordantly expressed in technical validation by qRT-PCR in both tissues. In displaying the candidates, a modified version of the volcano plot has been proposedthe lava plot, which incorporates p-value, fold-change, and a factor in the statistical model (in this study, the tissue factor has been illustrated). The plot highlights the direction of expression regulation, with fold-change, and the significance of the statistical test with color-coded p-values of two tissues for each point (a gene). For Chapter 5, integration of Trait associated genes and differentially expressed genes requires 4 TAG models and 3 DEG models for each tissues. The study-specific biomarker in this chapter is defined as toggles genes, which are body weight-related in all diet groups, and have specific expression pattern in the high fat diet (HFD) group. Of the genes that have HFD-specific expression pattern, those in direct relation or association with body-weight are a more plausible candidate for obesity. The chapter focuses on the TAGs (based on raw p-value) that are significant DEGs after multiple testing correction. By testing only the significant TAGs in the DEG analysis, I could gain statistical power. Such hierarchical approach is only advantageous when the p-values are adjustedraw p-values from the second analyses will be the same even if more genes are used. By reducing the number of tests in the second step of the hierarchical pipeline, statistical power is gained, and reliable candidates can be detected in larger numbers. From Chapters 2 to 5, various meta-analytical techniques have been suggested and illustrated through NGS datasets. By integrating multiple statistical models and multi-class biomarkers, I have simplified scientific ideas that are specific to the datasets, and derived candidate biomarkers by defining a pipeline to integrate the results. Simple variations in the pipeline and plot characteristics helped to fuse ideas that have not been handled before. Given the results, I anticipate that researchers conducting -omics analyses with or without advanced knowledge in statistics or programming can employ my meta-analytical approaches and plots to efficiently highlight and present their works to a broad spectrum of audiences.์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„์€ ์ƒ๋ฌผ์ •๋ณดํ•™์„ ํฌํ•จํ•œ ์ƒ๋ช…๊ณผํ•™ ๋ถ„์•ผ์— ๊ธฐ์ˆ ์ ์œผ๋กœ๋‚˜ ์ง€์‹์ ์œผ๋กœ ๋น„์•ฝ์ ์ธ ๋ฐœ์ „์„ ๊ฐ€์ ธ์™”๋‹ค. ๋˜ํ•œ, ์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„์€ ๊ทธ ์‹ ์†์„ฑ๊ณผ ์ €๋ ดํ•œ ๋น„์šฉ์œผ๋กœ ์ธํ•ด ์ˆ˜๋งŽ์€ ์ƒ๋ฌผํ•™์  ๋ฐ์ดํ„ฐ์˜ ์ƒ์‚ฐ๊ณผ ์ด์— ๊ด€ํ•œ ์—ฐ๊ตฌ์— ํ™œ์šฉ๋˜์–ด ์™”๋‹ค. ์ด๋Š” ํ•„์—ฐ์ ์œผ๋กœ ๋Œ€์šฉ๋Ÿ‰ ์ž๋ฃŒ๋ฅผ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋Š” ๋ณต์žกํ•œ ํ†ต๊ณ„์  ๋ถ„์„ ๊ธฐ๋ฒ•์˜ ๋ฐœ์ „์œผ๋กœ ์ด์–ด์กŒ์œผ๋ฉฐ, ์ƒ๋ฌผ์ •๋ณดํ•™ ์ด๋ผ๋Š” ์‹ ์ƒ ๋ถ„์•ผ์˜ ๋ฐœ์ „์„ ์ด‰์ง„ํ•˜๋Š” ์›๋™๋ ฅ์ด ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ณต์žกํ•œ ๋Œ€์šฉ๋Ÿ‰ ์ž๋ฃŒ๊ตฌ์กฐ ๋ฐ ํ†ต๊ณ„์  ๋ถ„์„ ๊ธฐ๋ฒ•์€ ์—ฐ๊ตฌ์„ค๊ณ„๋‚˜ ๋‚ด์šฉ์— ๋Œ€ํ•œ ์ง๊ด€์ ์ธ ์ดํ•ด๋ฅผ ๋ฐฉํ•ดํ•  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ํŠนํžˆ ์ƒ๋ฌผ์ •๋ณดํ•™์„ ๋„๊ตฌ๋กœ์„œ ํ™œ์šฉํ•˜๋Š” ๋น„์ „๊ณต์ž์˜ ์—ฐ๊ตฌ์— ์ปค๋‹ค๋ž€ ๊ฑธ๋ฆผ๋Œ์ด ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฉ”ํƒ€๋ถ„์„์„ ์‚ฌ์šฉํ•œ ์ ํ•ฉํ•œ ํ†ต๊ณ„ ๋ชจํ˜• ๊ตฌ์ถ•๊ณผ ๋ฐ”์ด์˜ค๋งˆ์ปค ์„ ๋ณ„ ๊ฐ™์€ ์ƒ๋ฌผ์ •๋ณดํ•™์  ๋ถ„์„ํŒŒ์ดํ”„๋ผ์ธ์€ ์—ฐ๊ตฌ์ž์˜ ์—ฐ๊ตฌ ๋‚ด์šฉ๊ณผ ์ž๋ฃŒ๋ฅผ ์ž˜ ๋Œ€๋ณ€ํ•ด ์ค„ ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค. ํ˜„์žฌ, ๋ถ„์„ ๋ฐฉ๋ฒ•๋ก ๊ณผ ํ”„๋กœ๊ทธ๋žจ์€ ๋งŽ์ด ์ œ์‹œ๋˜์–ด ์žˆ๋Š” ์ƒํƒœ์ด์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๊ธฐ์ˆ ๋“ค์„ ์—ฐ๊ตฌ์ž๊ฐ€ ์‹ค์ œ ์—ฐ๊ตฌ์— ์–ด๋–ป๊ฒŒ ํšจ๊ณผ์ ์œผ๋กœ ์ ์šฉํ•  ๊ฒƒ์ธ๊ฐ€๋Š” ์ž๋ฃŒ ํŠน์ด์ ์ด๋ฉฐ, ๊ทธ ๋ถ„์„๊ฒฐ๊ณผ์˜ ํ•ด์„์€ ์—ฌ์ „ํžˆ ์—ฐ๊ตฌ์ž์˜ ์žฌ๋Ÿ‰์— ๋‹ฌ๋ ค์žˆ๋‹ค. ์ด ํ•™์œ„๋…ผ๋ฌธ์€ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ค๊ณ„ ์ƒํ™ฉ์—์„œ ๊ฐ๊ฐ์˜ ์„ค๊ณ„์— ๋ถ€ํ•ฉํ•˜๋Š” ์˜๋ฏธ ์žˆ๋Š” ํ›„๋ณด ์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•ด ๋‚ด๊ธฐ ์œ„ํ•œ ๋ฉ”ํƒ€๋ถ„์„๊ธฐ๋ฒ•์„ ์ค‘์ ์„ ๋‘๊ณ  ์žˆ๋‹ค. 2์žฅ์—์„œ๋Š” ์ƒ๋ฌผ์ •๋ณดํ•™ ๋ถ„์„์—์„œ p๊ฐ’์— ๋Œ€ํ•œ ๋ฉ”ํƒ€ ๋ถ„์„์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ํŠนํžˆ, ๋‹ค์–‘ํ•œ ํ†ต๊ณ„ ๋ชจํ˜•๊ณผ ๊ฒ€์ฆ์—์„œ ๋‚˜์˜จ ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ต ๋ฐ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•๊ณผ ์—ฌ๋Ÿฌ ๋…๋ฆฝ๋œ ํ†ต๊ณ„๊ฒ€์ฆ ๊ฒฐ๊ณผ์—์„œ ๋™์‹œ์— ์œ ์˜ํ•œ ํ›„๋ณด ์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜๋Š” ์˜ˆ์ œ๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ๋˜ํ•œ ์ด ์žฅ์—์„œ ์ œ์‹œ๋œ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ GUI (Graphic User Interface) ๊ธฐ๋ฐ˜ ํ”„๋กœ๊ทธ๋žจ์„ microarray, GWAS, RNA-seq, ๊ฐ€์กฑ ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ๋“ฑ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ์— ์ ์šฉํ•จ์œผ๋กœ์จ, ์ œ์‹œ๋œ ํ”„๋กœ๊ทธ๋žจ์ด p๊ฐ’์„ ํฌํ•จํ•œ ๋‹ค์–‘ํ•œ ํ†ต๊ณ„์น˜์— ๊ธฐ๋ฐ˜ํ•œ ์—ฐ๊ตฌ์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๋‹ค. 3์žฅ์—์„œ๋Š” mRNA-seq ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ long non-coding RNA (lncRNA) ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ๋ถ„์„๊ฒฐ๊ณผ์˜ ๋ฌธ์ œ์ ๊ณผ ์ด์— ํƒ€๊ฒฉ์„ ์ž…์ง€ ์•Š๋Š” ๋ฐ”์ด์˜ค๋งˆ์ปค ์„ ๋ณ„์„ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ mRNA-seq ํ”„๋กœํ† ์ฝœ์—์„œ mRNA๋ฅผ ์„ ํƒ์ ์œผ๋กœ ๋ถ„๋ฆฌํ•ด ๋‚ด๋Š” ๋ฐฉ๋ฒ•์€ poly-A tail์„ ์ด์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ lncRNA๋ฅผ ํฌํ•จํ•œ ์ผ๋ถ€ non-coding RNA ๋“ค๋„ mRNA ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ „์‚ฌ๊ณผ์ •์—์„œ poly-A tail์„ ๊ฐ€์ง„๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์— RNA-seq ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ lncRNA ์™€ mRNA ๋Š” ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„๋˜์ง€ ์•Š๋Š”๋‹ค. ์ด ์žฅ์—์„œ๋Š” RNA-seq ๋ฐ์ดํ„ฐ ๋ถ„์„๊ณผ์ •์—์„œ lncRNA annotation ์˜ ๊ณ ๋ ค ์œ ๋ฌด๊ฐ€ ์ตœ์ข… ๊ฒฐ๊ณผ์ธ ์ฐจ๋“ฑ ๋ฐœํ˜„ ์œ ์ „์ž ๊ฒฐ๊ณผ์— ์ƒ๋‹นํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ์œผ๋กœ์จ, lncRNA๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์€ ๊ธฐ์กด์˜ ๋ถ„์„๋ฐฉ๋ฒ•์ด ํ›„๋ณด ์œ ์ „์ž ๋ฐœ๊ตด์— ๋ณ€์ˆ˜๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ์„ ๋ฐํ˜”๋‹ค. ๋”๋ถˆ์–ด, lncRNA annotation์ด ํ›„๋ณด ์œ ์ „์ž ๊ฒฐ๊ณผ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ์กฐ์ง ๋ณ„๋กœ ๋‹ค๋ฅธ ์–‘์ƒ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค๋Š” ๊ฒƒ์„ ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ์ ์ธ ์ฐจ๋“ฑ๋ฐœํ˜„ ์œ ์ „์ž ๋ถ„์„๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค. ๊ฒฐ๋ก ์ ์œผ๋กœ lncRNA annotation ์ •๋ณด์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” ์œ ์ „์ž๋“ค์ด mRNA-seq ์‹คํ—˜์„ค๊ณ„๋ชฉ์ ์— ๊ฐ€์žฅ ๋ถ€ํ•ฉ๋˜๋Š” ํ›„๋ณด ์œ ์ „์ž ์ž„์„ ์ œ์‹œํ•˜์˜€๋‹ค. 4์žฅ์—์„œ๋Š” ํ•ญ ์—ผ์ฆ๊ณผ ๋น„๋งŒ์— ํšจ๊ณผ๋ฅผ ๋ณด์ด๋Š” ๊ฐ๋ฏธ๋ฃŒ๋ฅผ ๋จน์ธ ์ฅ์— ๋Œ€ํ•œ RNA-seq๋ฐ Metagenome ๋ถ„์„์„ ํ†ตํ•ด ์‹คํ—˜ ๋ชฉ์ ์— ๋ถ€ํ•ฉํ•˜๋Š” ํ›„๋ณด์œ ์ „์ž๋ฐœ๊ตด ๊ณผ์ •์„ ๋‹ค๋ฃจ์—ˆ๋‹ค. ์ •์ƒ์‹์ด์ง‘๋‹จ (ND), ๊ณ ์ง€๋ฐฉ์‹์ด์ง‘๋‹จ (HFD), D-allulose์™€ ๊ณ ์ง€๋ฐฉ์‹์ด์ง‘๋‹จ (ALL) ์— ๋Œ€ํ•ด ๊ฐ๊ฐ 2๊ฐœ์˜ ์กฐ์ง์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, ๊ณ ์ง€๋ฐฉ์‹์ด์— ๋Œ€ํ•œ D-allulose์˜ ํšจ๊ณผ์™€ ๋ฐ€์ ‘ํ•˜๊ฒŒ ๊ด€๋ จ๋œ ํ›„๋ณด์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜๊ธฐ ์œ„ํ•˜์—ฌ Recovery gene (RecG) ์„ ์ •์˜ํ•˜์˜€๋‹ค. RecG ์€ ๊ฐœ๋…์ ์œผ๋กœ ๊ณ ์ง€๋ฐฉ์‹์ด์˜ ์œ ์ „์ž ๋ฐœํ˜„ ์ƒํƒœ์—์„œ D-allulose๋ฅผ ์„ญ์ทจํ–ˆ์„ ๋•Œ ์ •์ƒ์ƒํƒœ๋กœ ๋Œ์•„๊ฐ€๋Š” ์œ ์ „์ž๋ฅผ ๋งํ•˜์—ฌ, ์‹ค์ œ ๋ถ„์„์—์„œ๋Š” ๋‘ ์กฐ์ง ๋ชจ๋‘์—์„œ HFD ์ง‘๋‹จ์˜ ๋ฐœํ˜„์ด ๋‹ค๋ฅธ ๋‘ ์ง‘๋‹จ์— ๋น„ํ•ด ์œ ์˜ํ•˜๊ฒŒ ๋†’๊ฑฐ๋‚˜ ๋‚ฎ๊ณ , ND ์™€ ALL ์ง‘๋‹จ์—์„œ๋Š” ๋ฐœํ˜„๋Ÿ‰์ด ์ฐจ์ด๊ฐ€ ์—†์œผ๋ฉฐ, ์—ผ์ฆ๊ณผ ๊ด€๋ จ๋œ ์œ ์ „์ž๋กœ ์ •์˜ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์ด๋Ÿฌํ•œ RecG ์˜ ๋ฐœํ˜„ ์–‘์ƒ์„ ํšจ๊ณผ์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๊ธฐ์กด์˜ Volcano plot์„ ๋ณ€ํ˜•ํ•œ Lava plot ์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. Lava plot ์€ Volcano plot๊ณผ ๊ฐ™์ด ๊ฐ ์œ ์ „์ž์— ๋Œ€ํ•œ p-value, fold-change ์ •๋ณด๋ฅผ ๋ณด์—ฌ์คŒ๊ณผ ๋™์‹œ์—, ํ†ต๊ณ„๋ชจํ˜•์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ๊ณ ๋ คํ•œ ์š”์ธ(์—ฌ๊ธฐ์„œ๋Š” ์กฐ์ง)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋‹ค. RecG์˜ ์—ผ์ฆ ๊ด€๋ จ ๋ฏธ์ƒ๋ฌผ๊ณผ์˜ ๊ด€๋ จ์„ฑ์„ Metagenome ์„ ํ†ตํ•ด ํ™•์ธํ•˜์˜€๊ณ , qRT-PCR์„ ํ†ตํ•ด ์ตœ์ข… ํ›„๋ณด RecG๊ฐ€ ๋‘ ์กฐ์ง์—์„œ RNA-seq ๋ฐ์ดํ„ฐ์™€ ๋™์ผํ•œ ์–‘์ƒ์œผ๋กœ ๋ฐœํ˜„ํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 5์žฅ์—์„œ๋Š” ์•ž์„œ ๋ฐœ๊ตด๋œ HFD ์ง‘๋‹จ ํŠน์ด์  ์œ ์ „์ž๋“ค์˜ ๋ฐœํ˜„์ด ๋ชธ๋ฌด๊ฒŒ์™€ ๊ด€๋ จ์ด ์žˆ๋Š”์ง€๋ฅผ ๋ถ„์„ํ•˜์˜€๋‹ค. ๋จผ์ € Raw p๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ํ˜•์งˆ(๋ชธ๋ฌด๊ฒŒ)๊ณผ ๊ด€๋ จ๋œ ํ›„๋ณด์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜๊ณ , adjusted p๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ๋ฐœ๊ตด๋œ ํ›„๋ณด ์œ ์ „์ž์—์„œ ๊ณ ์ง€๋ฐฉ์‹์ด์™€ ๊ด€๋ จ๋œ ์ตœ์ข… ํ›„๋ณด ์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋‹จ๊ณ„์  ๋ถ„์„ ๋ฐฉ๋ฒ•์€ ์‹คํ—˜์˜ ์ตœ์ข… ๋ชฉ์ ์ด ํ˜•์งˆ๊ณผ ์—ฐ๊ด€๋œ ์œ ์ „์ž(์—ฌ๊ธฐ์„œ๋Š” ๋ชธ๋ฌด๊ฒŒ)์ธ ๊ฒฝ์šฐ์— 1์ฐจ์ ์œผ๋กœ ํ›„๋ณด์œ ์ „์ž๋ฅผ ์ค„์—ฌ์คŒ์œผ๋กœ์จ ๊ฒ€์ •๋ ฅ์„ ๋†’์—ฌ ์คŒ๊ณผ ๋™์‹œ์— ๋” ๋งŽ์€ ํ›„๋ณด์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ, ๋‹จ๊ณ„์  ๋ถ„์„ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋ชธ๋ฌด๊ฒŒ์™€ ๊ณ ์ง€๋ฐฉ์‹์ด ๋ชจ๋‘์— ๊ด€๋ จ์ด ์žˆ๋Š” ํ›„๋ณด์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜์˜€์œผ๋ฉฐ, ๊ทธ ๊ธฐ๋Šฅ์ด ์—ผ์ฆ ๋˜๋Š” ์ข…์–‘๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ด ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์ œ2์žฅ์—์„œ๋ถ€ํ„ฐ 5์žฅ์— ๊ฑธ์ณ ์ฐจ์„ธ๋Œ€ ์—ผ๊ธฐ์„œ์—ด ๋ถ„์„ ์ž๋ฃŒ์— ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ๋ฉ”ํƒ€๋ถ„์„๊ธฐ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์ž๋ฃŒ์— ๊ฑธ๋งž๋Š” ๋ฐ”์ด์˜ค๋งˆ์ปค์˜ ์„ ๋ณ„๊ณผ ์‹ ๋ขฐ์„ฑ ์žˆ๋Š” ํ›„๋ณด์œ ์ „์ž๋ฅผ ๋ฐœ๊ตดํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ๋ฒ•๊ณผ ๋”๋ถˆ์–ด, ํšจ๊ณผ์ ์ธ ์‹œ๊ฐํ™” ๊ธฐ๋ฒ•์„ ํ†ตํ•ด ๊ณผํ•™์  ์—ฐ๊ตฌ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ์ง๊ด€์  ์ดํ•ด๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ์•ˆ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋˜ํ•œ ๊ธฐ์กด ๋ถ„์„ ๋ฐ ์‹œ๊ฐํ™” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๊ฐ„๋‹จํ•œ ๋ณ€ํ˜•์„ ํ†ตํ•ด ๊ธฐ์กด์— ๋‹ค๋ค„์ง€์ง€ ์•Š์•˜๋˜ ์—ฌ๋Ÿฌ ์ƒ๋ฌผํ•™์  ์ฃผ์ œ๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ์œตํ•ฉํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์˜ ํŒŒ์ดํ”„๋ผ์ธ๋“ค์€ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์˜ ์—ฐ๊ตฌ์ž๋“ค์ด OMICS ๋ถ„์„์„ ์ˆ˜ํ–‰ํ•  ๋•Œ ์—ฐ๊ตฌ๊ฒฐ๊ณผ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ œ์‹œํ•˜๋Š” ๋ฐ์— ๋„์›€์„ ์ค„ ๊ฒƒ์ด๋ผ ๊ธฐ๋Œ€๋œ๋‹ค.Contents ABSTRACT I CONTENTS VII LIST OF TABLES IX LIST OF FIGURES X CHAPTER 1. LITERATURE REVIEW 1 1.1 NEXT GENERATION SEQUENCING (NGS) 2 1.2 RNA SEQUENCING OR WHOLE TRANSCRIPTOME SHOTGUN SEQUENCING 15 1.3 BIOMARKER SELECTION 23 CHAPTER 2. GRACOMICS: SOFTWARE FOR GRAPHICAL COMPARISON OF MULTIPLE RESULTS WITH OMICS DATA 26 2.1 ABSTRACT 27 2.2 INTRODUCTION 29 2.3 MATERIALS AND METHODS 32 2.4 RESULTS AND DISCUSSION 53 2.5 GRACOMICS INSTRUCTION MANUAL (DOWNLOADED) 63 CHAPTER 3. MULTI-TISSUE OBSERVATION OF THE LONG NON-CODING RNA EFFECTS ON SEXUALLY BIASED GENE EXPRESSION IN CATTLE. 63 3.1 ABSTRACT 64 3.2 INTRODUCTION 66 3.3 MATERIALS AND METHODS 69 3.4 RESULTS AND DISCUSSION 75 CHAPTER 4. DISCOVERING/TRACING THE ANTI-INFLAMMATORY MECHANISM/TRIGGER OF D-ALLULOSE: PROFILE STUDY OF MICORBIOME COMPOSITION AND MRNA EXPRESSION IN DIET-INDUCED OBESE MICE 99 4.1 ABSTRACT 100 4.2 INTRODUCTION 101 4.3 MATERIALS AND METHODS 103 4.4 RESULTS AND DISCUSSION 112 CHAPTER 5. TRACING THE INFLAMMATORY EFFECTS OF HIGH FAT DIET IN OBESITY RELATED TRAITS IN DIET-INDUCED OBESE MICE VIA TRAIT ASSOCIATED GENE DETECTION 139 5.1 ABSTRACT 140 5.2 INTRODUCTION 142 5.3 MATERIALS AND METHODS 144 5.4 RESULTS AND DISCUSSION 150 CHAPTER 6. GENERAL DISCUSSION 162 REFERENCES 167 KOREAN SUMMARY(๊ตญ๋ฌธ ์ดˆ๋ก) 185Docto

    Microarray Data Mining and Gene Regulatory Network Analysis

    Get PDF
    The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity
    • โ€ฆ
    corecore