3,574 research outputs found

    RNA μƒν˜Έμž‘μš© 및 DNA μ„œμ—΄μ˜ 정보해독을 μœ„ν•œ κΈ°κ³„ν•™μŠ΅ 기법

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀,2020. 2. κΉ€μ„ .생물체 κ°„ ν‘œν˜„ν˜•μ˜ μ°¨μ΄λŠ” 각 개체의 μœ μ „μ  정보 μ°¨μ΄λ‘œλΆ€ν„° κΈ°μΈν•œλ‹€. μœ μ „μ  μ •λ³΄μ˜ 변화에 λ”°λΌμ„œ, 각 μƒλ¬Όμ²΄λŠ” μ„œλ‘œ λ‹€λ₯Έ μ’…μœΌλ‘œ μ§„ν™”ν•˜κΈ°λ„ ν•˜κ³ , 같은 병에 κ±Έλ¦° ν™˜μžλΌλ„ μ„œλ‘œ λ‹€λ₯Έ μ˜ˆν›„λ₯Ό 보이기도 ν•œλ‹€. 이처럼 μ€‘μš”ν•œ 생물학적 μ •λ³΄λŠ” λŒ€μš©λŸ‰ μ‹œν€€μ‹± 뢄석 기법 등을 톡해 λ‹€μ–‘ν•œ 였믹슀 λ°μ΄ν„°λ‘œ μΈ‘μ •λœλ‹€. κ·ΈλŸ¬λ‚˜, 였믹슀 λ°μ΄ν„°λŠ” 고차원 νŠΉμ§• 및 μ†Œκ·œλͺ¨ ν‘œλ³Έ 데이터이기 λ•Œλ¬Έμ—, 였믹슀 λ°μ΄ν„°λ‘œλΆ€ν„° 생물학적 정보λ₯Ό ν•΄μ„ν•˜λŠ” 것은 맀우 μ–΄λ €μš΄ λ¬Έμ œμ΄λ‹€. 일반적으둜, 데이터 νŠΉμ§•μ˜ κ°œμˆ˜κ°€ μƒ˜ν”Œμ˜ κ°œμˆ˜λ³΄λ‹€ λ§Žμ„ λ•Œ, 였믹슀 λ°μ΄ν„°μ˜ 해석을 κ°€μž₯ λ‚œν•΄ν•œ κΈ°κ³„ν•™μŠ΅ λ¬Έμ œλ“€ 쀑 ν•˜λ‚˜λ‘œ λ§Œλ“­λ‹ˆλ‹€. λ³Έ λ°•μ‚¬ν•™μœ„ 논문은 κΈ°κ³„ν•™μŠ΅ 기법을 ν™œμš©ν•˜μ—¬ 고차원적인 생물학적 λ°μ΄ν„°λ‘œλΆ€ν„° 생물학적 정보λ₯Ό μΆ”μΆœν•˜κΈ° μœ„ν•œ μƒˆλ‘œμš΄ 생물정보학 방법듀을 κ³ μ•ˆν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. 첫 번째 μ—°κ΅¬λŠ” DNA μ„œμ—΄μ„ ν™œμš©ν•˜μ—¬ μ’… κ°„ 비ꡐ와 λ™μ‹œμ— DNA μ„œμ—΄μƒμ— μžˆλŠ” λ‹€μ–‘ν•œ 지역에 λ‹΄κΈ΄ 생물학적 정보λ₯Ό μœ μ „μ  κ΄€μ μ—μ„œ ν•΄μ„ν•΄λ³΄κ³ μž ν•˜μ˜€λ‹€. 이λ₯Ό μœ„ν•΄, μˆœμœ„ 기반 k 단어 λ¬Έμžμ—΄ 비ꡐ방법, RKSS 컀널을 κ°œλ°œν•˜μ—¬ λ‹€μ–‘ν•œ κ²Œλ†ˆ μƒμ˜ μ§€μ—­μ—μ„œ μ—¬λŸ¬ μ’… κ°„ 비ꡐ μ‹€ν—˜μ„ μˆ˜ν–‰ν•˜μ˜€λ‹€. RKSS 컀널은 기쑴의 k 단어 λ¬Έμžμ—΄ 컀널을 ν™•μž₯ν•œ κ²ƒμœΌλ‘œ, k 길이 λ‹¨μ–΄μ˜ μˆœμœ„ 정보와 μ’… κ°„ 곡톡점을 ν‘œν˜„ν•˜λŠ” 비ꡐ기쀀점 κ°œλ…μ„ ν™œμš©ν•˜μ˜€λ‹€. k 단어 λ¬Έμžμ—΄ 컀널은 k의 길이에 따라 단어 μˆ˜κ°€ κΈ‰μ¦ν•˜μ§€λ§Œ, 비ꡐ기쀀점은 κ·Ήμ†Œμˆ˜μ˜ λ‹¨μ–΄λ‘œ 이루어져 μžˆμœΌλ―€λ‘œ μ„œμ—΄ κ°„ μœ μ‚¬λ„λ₯Ό κ³„μ‚°ν•˜λŠ” 데 ν•„μš”ν•œ κ³„μ‚°λŸ‰μ„ 효율적으둜 쀄일 수 μžˆλ‹€. κ²Œλ†ˆ μƒμ˜ μ„Έ 지역에 λŒ€ν•΄μ„œ μ‹€ν—˜μ„ μ§„ν–‰ν•œ κ²°κ³Ό, RKSS 컀널은 기쑴의 컀널에 λΉ„ν•΄ μ’… κ°„ μœ μ‚¬λ„ 및 차이λ₯Ό 효율적으둜 계산할 수 μžˆμ—ˆλ‹€. λ˜ν•œ, RKSS 컀널은 μ‹€ν—˜μ— μ‚¬μš©λœ 생물학적 지역에 ν¬ν•¨λœ 생물학적 μ •λ³΄λŸ‰ 차이λ₯Ό 생물학적 지식과 λΆ€ν•©λ˜λŠ” μˆœμ„œλ‘œ 비ꡐ할 수 μžˆμ—ˆλ‹€. 두 번째 μ—°κ΅¬λŠ” 생물학적 λ„€νŠΈμ›Œν¬λ₯Ό 톡해 λ³΅μž‘ν•˜κ²Œ μ–½νžŒ μœ μ „μž μƒν˜Έμž‘μš© κ°„ 정보λ₯Ό ν•΄μ„ν•˜μ—¬, 더 λ‚˜μ•„κ°€ 생물학적 κΈ°λŠ₯ 해석을 톡해 μ•”μ˜ μ•„ν˜•μ„ λΆ„λ₯˜ν•˜κ³ μž ν•˜μ˜€λ‹€. 이λ₯Ό μœ„ν•΄, κ·Έλž˜ν”„ μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬μ™€ μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜μ„ ν™œμš©ν•˜μ—¬ νŒ¨μŠ€μ›¨μ΄ 기반 해석 κ°€λŠ₯ν•œ μ•” μ•„ν˜• λΆ„λ₯˜ λͺ¨λΈ(GCN+MAE)을 κ³ μ•ˆν•˜μ˜€λ‹€. κ·Έλž˜ν”„ μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬λ₯Ό ν†΅ν•΄μ„œ 생물학적 사전 지식인 νŒ¨μŠ€μ›¨μ΄ 정보λ₯Ό ν•™μŠ΅ν•˜μ—¬ λ³΅μž‘ν•œ μœ μ „μž μƒν˜Έμž‘μš© 정보λ₯Ό 효율적으둜 λ‹€λ£¨μ—ˆλ‹€. λ˜ν•œ, μ—¬λŸ¬ νŒ¨μŠ€μ›¨μ΄ 정보λ₯Ό μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜μ„ 톡해 해석 κ°€λŠ₯ν•œ μˆ˜μ€€μœΌλ‘œ λ³‘ν•©ν•˜μ˜€λ‹€. λ§ˆμ§€λ§‰μœΌλ‘œ, ν•™μŠ΅ν•œ νŒ¨μŠ€μ›¨μ΄ 레벨 정보λ₯Ό 보닀 λ³΅μž‘ν•˜κ³  λ‹€μ–‘ν•œ μœ μ „μž 레벨둜 효율적으둜 μ „λ‹¬ν•˜κΈ° μœ„ν•΄μ„œ λ„€νŠΈμ›Œν¬ μ „νŒŒ μ•Œκ³ λ¦¬μ¦˜μ„ ν™œμš©ν•˜μ˜€λ‹€. λ‹€μ„― 개의 μ•” 데이터에 λŒ€ν•΄ GCN+MAE λͺ¨λΈμ„ μ μš©ν•œ κ²°κ³Ό, 기쑴의 μ•” μ•„ν˜• λΆ„λ₯˜ λͺ¨λΈλ“€λ³΄λ‹€ λ‚˜μ€ μ„±λŠ₯을 λ³΄μ˜€μœΌλ©° μ•” μ•„ν˜• 특이적인 νŒ¨μŠ€μ›¨μ΄ 및 생물학적 κΈ°λŠ₯을 λ°œκ΅΄ν•  수 μžˆμ—ˆλ‹€. μ„Έ 번째 μ—°κ΅¬λŠ” νŒ¨μŠ€μ›¨μ΄λ‘œλΆ€ν„° μ„œλΈŒ νŒ¨μŠ€μ›¨μ΄/λ„€νŠΈμ›Œν¬λ₯Ό μ°ΎκΈ° μœ„ν•œ 연ꡬ닀. νŒ¨μŠ€μ›¨μ΄λ‚˜ 생물학적 λ„€νŠΈμ›Œν¬μ— 단일 생물학적 κΈ°λŠ₯이 μ•„λ‹ˆλΌ λ‹€μ–‘ν•œ 생물학적 κΈ°λŠ₯이 ν¬ν•¨λ˜μ–΄ μžˆμŒμ— μ£Όλͺ©ν•˜μ˜€λ‹€. 단일 κΈ°λŠ₯을 μ§€λ‹Œ μœ μ „μž 쑰합을 μ°ΎκΈ° μœ„ν•΄μ„œ 생물학적 λ„€νŠΈμ›Œν¬μƒμ—μ„œ 쑰건 특이적인 μœ μ „μž λͺ¨λ“ˆμ„ 찾고자 ν•˜μ˜€μœΌλ©° MIDASλΌλŠ” 도ꡬλ₯Ό κ°œλ°œν•˜μ˜€λ‹€. νŒ¨μŠ€μ›¨μ΄λ‘œλΆ€ν„° μœ μ „μž μƒν˜Έμž‘μš© κ°„ ν™œμ„±λ„λ₯Ό μœ μ „μž λ°œν˜„λŸ‰κ³Ό λ„€νŠΈμ›Œν¬ ꡬ쑰λ₯Ό 톡해 κ³„μ‚°ν•˜μ˜€λ‹€. κ³„μ‚°λœ ν™œμ„±λ„λ“€μ„ ν™œμš©ν•˜μ—¬ 닀쀑 ν΄λž˜μŠ€μ—μ„œ μ„œλ‘œ λ‹€λ₯΄κ²Œ ν™œμ„±ν™”λœ μ„œλΈŒ νŒ¨μŠ€λ“€μ„ 톡계적 기법에 κΈ°λ°˜ν•˜μ—¬ λ°œκ΅΄ν•˜μ˜€λ‹€. λ˜ν•œ, μ–΄ν…μ…˜ λ©”μ»€λ‹ˆμ¦˜κ³Ό κ·Έλž˜ν”„ μ»¨λ³Όλ£¨μ…˜ λ„€νŠΈμ›Œν¬λ₯Ό ν†΅ν•΄μ„œ ν•΄λ‹Ή 연ꡬλ₯Ό νŒ¨μŠ€μ›¨μ΄λ³΄λ‹€ 더 큰 생물학적 λ„€νŠΈμ›Œν¬μ— ν™•μž₯ν•˜λ €κ³  μ‹œλ„ν•˜μ˜€λ‹€. μœ λ°©μ•” 데이터에 λŒ€ν•΄ μ‹€ν—˜μ„ μ§„ν–‰ν•œ κ²°κ³Ό, MIDAS와 λ”₯λŸ¬λ‹ λͺ¨λΈμ„ 닀쀑 ν΄λž˜μŠ€μ—μ„œ 차이가 λ‚˜λŠ” μœ μ „μž λͺ¨λ“ˆμ„ 효과적으둜 μΆ”μΆœν•  수 μžˆμ—ˆλ‹€. 결둠적으둜, λ³Έ λ°•μ‚¬ν•™μœ„ 논문은 DNA μ„œμ—΄μ— λ‹΄κΈ΄ 진화적 μ •λ³΄λŸ‰ 비ꡐ, νŒ¨μŠ€μ›¨μ΄ 기반 μ•” μ•„ν˜• λΆ„λ₯˜, 쑰건 특이적인 μœ μ „μž λͺ¨λ“ˆ λ°œκ΅΄μ„ μœ„ν•œ μƒˆλ‘œμš΄ κΈ°κ³„ν•™μŠ΅ 기법을 μ œμ•ˆν•˜μ˜€λ‹€.Phenotypic differences among organisms are mainly due to the difference in genetic information. As a result of genetic information modification, an organism may evolve into a different species and patients with the same disease may have different prognosis. This important biological information can be observed in the form of various omics data using high throughput instrument technologies such as sequencing instruments. However, interpretation of such omics data is challenging since omics data is with very high dimensions but with relatively small number of samples. Typically, the number of dimensions is higher than the number of samples, which makes the interpretation of omics data one of the most challenging machine learning problems. My doctoral study aims to develop new bioinformatics methods for decoding information in these high dimensional data by utilizing machine learning algorithms. The first study is to analyze the difference in the amount of information between different regions of the DNA sequence. To achieve the goal, a ranked-based k-spectrum string kernel, RKSS kernel, is developed for comparative and evolutionary comparison of various genomic region sequences among multiple species. RKSS kernel extends the existing k-spectrum string kernel by utilizing rank information of k-mers and landmarks of k-mers that represents a species. By using a landmark as a reference point for comparison, the number of k-mers needed to calculating sequence similarities is dramatically reduced. In the experiments on three different genomic regions, RKSS kernel captured more reliable distances between species according to genetic information contents of the target region. Also, RKSS kernel was able to rearrange each region to match a biological common insight. The second study aims to efficiently decode complex genetic interactions using biological networks and, then, to classify cancer subtypes by interpreting biological functions. To achieve the goal, a pathway-based deep learning model using graph convolutional network and multi-attention based ensemble (GCN+MAE) for cancer subtype classification is developed. In order to efficiently reduce the relationships between genes using pathway information, GCN+MAE is designed as an explainable deep learning structure using graph convolutional network and attention mechanism. Extracted pathway-level information of cancer subtypes is transported into gene-level again by network propagation. In the experiments of five cancer data sets, GCN+MAE showed better cancer subtype classification performances and captured subtype-specific pathways and their biological functions. The third study is to identify sub-networks of a biological pathway. The goal is to dissect a biological pathway into multiple sub-networks, each of which is to be of a single functional unit. To achieve the goal, a condition-specific sub-module detection method in a biological network, MIDAS (MIning Differentially Activated Subpaths) is developed. From the pathway, edge activities are measured by explicit gene expression and network topology. Using the activities, differentially activated subpaths are explored by a statistical approach. Also, by extending this idea on graph convolutional network, different sub-networks are highlighted by attention mechanisms. In the experiment with breast cancer data, MIDAS and the deep learning model successfully decomposed gene-level features into sub-modules of single functions. In summary, my doctoral study proposes new computational methods to compare genomic DNA sequences as information contents, to model pathway-based cancer subtype classifications and regulations, and to identify condition-specific sub-modules among multiple cancer subtypes.Chapter 1 Introduction 1 1.1 Biological questions with genetic information 2 1.1.1 Biological Sequences 2 1.1.2 Gene expression 2 1.2 Formulating computational problems for the biological questions 3 1.2.1 Decoding biological sequences by k-mer vectors 3 1.2.2 Interpretation of complex relationships between genes 7 1.3 Three computational problems for the biological questions 9 1.4 Outline of the thesis 14 Chapter 2 Ranked k-spectrum kernel for comparative and evolutionary comparison of DNA sequences 15 2.1 Motivation 16 2.1.1 String kernel for sequence comparison 17 2.1.2 Approach: RKSS kernel 19 2.2 Methods 21 2.2.1 Mapping biological sequences to k-mer space: the k-spectrum string kernel 23 2.2.2 The ranked k-spectrum string kernel with a landmark 24 2.2.3 Single landmark-based reconstruction of phylogenetic tree 27 2.2.4 Multiple landmark-based distance comparison of exons, introns, CpG islands 29 2.2.5 Sequence Data for analysis 30 2.3 Results 31 2.3.1 Reconstruction of phylogenetic tree on the exons, introns, and CpG islands 31 2.3.2 Landmark space captures the characteristics of three genomic regions 38 2.3.3 Cross-evaluation of the landmark-based feature space 45 Chapter 3 Pathway-based cancer subtype classification and interpretation by attention mechanism and network propagation 46 3.1 Motivation 47 3.2 Methods 52 3.2.1 Encoding biological prior knowledge using Graph Convolutional Network 52 3.2.2 Re-producing comprehensive biological process by Multi-Attention based Ensemble 53 3.2.3 Linking pathways and transcription factors by network propagation with permutation-based normalization 55 3.3 Results 58 3.3.1 Pathway database and cancer data set 58 3.3.2 Evaluation of individual GCN pathway models 60 3.3.3 Performance of ensemble of GCN pathway models with multi-attention 60 3.3.4 Identification of TFs as regulator of pathways and GO term analysis of TF target genes 67 Chapter 4 Detecting sub-modules in biological networks with gene expression by statistical approach and graph convolutional network 70 4.1 Motivation 70 4.1.1 Pathway based analysis of transcriptome data 71 4.1.2 Challenges and Summary of Approach 74 4.2 Methods 78 4.2.1 Convert single KEGG pathway to directed graph 79 4.2.2 Calculate edge activity for each sample 79 4.2.3 Mining differentially activated subpath among classes 80 4.2.4 Prioritizing subpaths by the permutation test 82 4.2.5 Extension: graph convolutional network and class activation map 83 4.3 Results 84 4.3.1 Identifying 36 subtype specific subpaths in breast cancer 86 4.3.2 Subpath activities have a good discrimination power for cancer subtype classification 88 4.3.3 Subpath activities have a good prognostic power for survival outcomes 90 4.3.4 Comparison with an existing tool, PATHOME 91 4.3.5 Extension: detection of subnetwork on PPI network 98 Chapter 5 Conclusions 101 ꡭ문초둝 127Docto

    Characterisation of chromatin modifiers in endometrial cancer

    Get PDF
    Chromatin organization is a critical regulator of gene expression and cell phenotype, and is frequently dysregulated in cancer. Endometrial cancer (EC) is the most common gynecological malignancy, and casues significant morbidity and mortality. EC is notable for recurrent alterations in chromatin including the ARID1A gene – a key component of the SWI/SNF remodeling complex – has emerged as a prevalent driver in EC, along with other remodelers such as CHD4 and BCOR. However, a systematic analysis of chromatin modifier alterations and their functional consequences in EC has not been done. This thesis presents a comprehensive investigation of genomic alterations in chromatin modifiers using whole genome sequencing (WGS) data from the Genomics England(GEL) 100,000 Genomes Project, the largest EC cohort to date. I demonstrate that while mutational processes vary across molecular subtypes, numerous chromatin modifiers are consistently altered across all subtypes. These genomic alterations frequently occur in different subunits of the same complex, such as alterations in CHD3, CHD4 and MBD3, subunits of the ATP-dependent chromatin remodeling complex NuRD. Additionally, I examine the correlation between driver mutations and patient survival, revealing that mutations in PBRM1 and CHD4 are associated with an increased risk of death after accounting for age, molecular subtype, and tumor mutation and copy number alteration burden. To complement the correlative analysis, I employ CRISPR-Cas9 gene editing to study the functional consequences of perturbations in selected chromatin modifiers (ARID1A, ARID1B, ARID5B, EP300, KMT2C, and SETD1B) in normal and malignant endometrial cells using transcriptomic and chromatin accessibility data. Furthermore, I explore the implications of the N1459S BCOR mutation, a hotspot mutation near-unique to EC. Considering the frequent occurrence of ARID1A mutations in malignant tissues and their absence in normal endometrium, I investigate tumor heterogeneity in endometrial cancer. I discuss the limitations of current methodologies and propose a deep learning approach to uncover the hidden evolutionary trajectories. With additional research, this approach could potentially facilitate understanding the sequence in which driver alterations occur. In summary, this work presents a resource for investigating chromatin organization in EC. The functional analyses using gene editing techniques confirm that EC-associated drivers disrupt essential cellular processes involved in oncogenesis. By providing the first systematic correlative and functional analyses of chromatin modifiers in EC, this thesis offers novel insights into EC biology

    Medoidshift clustering applied to genomic bulk tumor data.

    Get PDF
    Despite the enormous medical impact of cancers and intensive study of their biology, detailed characterization of tumor growth and development remains elusive. This difficulty occurs in large part because of enormous heterogeneity in the molecular mechanisms of cancer progression, both tumor-to-tumor and cell-to-cell in single tumors. Advances in genomic technologies, especially at the single-cell level, are improving the situation, but these approaches are held back by limitations of the biotechnologies for gathering genomic data from heterogeneous cell populations and the computational methods for making sense of those data. One popular way to gain the advantages of whole-genome methods without the cost of single-cell genomics has been the use of computational deconvolution (unmixing) methods to reconstruct clonal heterogeneity from bulk genomic data. These methods, too, are limited by the difficulty of inferring genomic profiles of rare or subtly varying clonal subpopulations from bulk data, a problem that can be computationally reduced to that of reconstructing the geometry of point clouds of tumor samples in a genome space. Here, we present a new method to improve that reconstruction by better identifying subspaces corresponding to tumors produced from mixtures of distinct combinations of clonal subpopulations. We develop a nonparametric clustering method based on medoidshift clustering for identifying subgroups of tumors expected to correspond to distinct trajectories of evolutionary progression. We show on synthetic and real tumor copy-number data that this new method substantially improves our ability to resolve discrete tumor subgroups, a key step in the process of accurately deconvolving tumor genomic data and inferring clonal heterogeneity from bulk data

    INVESTIGATING INVASION IN DUCTAL CARCINOMA IN SITU WITH TOPOGRAPHICAL SINGLE CELL GENOME SEQUENCING

    Get PDF
    Synchronous Ductal Carcinoma in situ (DCIS-IDC) is an early stage breast cancer invasion in which it is possible to delineate genomic evolution during invasion because of the presence of both in situ and invasive regions within the same sample. While laser capture microdissection studies of DCIS-IDC examined the relationship between the paired in situ (DCIS) and invasive (IDC) regions, these studies were either confounded by bulk tissue or limited to a small set of genes or markers. To overcome these challenges, we developed Topographic Single Cell Sequencing (TSCS), which combines laser-catapulting with single cell DNA sequencing to measure genomic copy number profiles from single tumor cells while preserving their spatial context. We applied TSCS to sequence 1,293 single cells from 10 synchronous DCIS patients. We also applied deep-exome sequencing to the in situ, invasive and normal tissues for the DCIS-IDC patients. Previous bulk tissue studies had produced several conflicting models of tumor evolution. Our data support a multiclonal invasion model, in which genome evolution occurs within the ducts and gives rise to multiple subclones that escape the ducts into the adjacent tissues to establish the invasive carcinomas. In summary, we have developed a novel method for single cell DNA sequencing, which preserves spatial context, and applied this method to understand clonal evolution during the transition between carcinoma in situ to invasive ductal carcinoma

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    Integrated Machine Learning and Bioinformatics Approaches for Prediction of Cancer-Driving Gene Mutations

    Get PDF
    Cancer arises from the accumulation of somatic mutations and genetic alterations in cell division checkpoints and apoptosis, this often leads to abnormal tumor proliferation. Proper classification of cancer-linked driver mutations will considerably help our understanding of the molecular dynamics of cancer. In this study, we compared several cancer-specific predictive models for prediction of driver mutations in cancer-linked genes that were validated on canonical data sets of functionally validated mutations and applied to a raw cancer genomics data. By analyzing pathogenicity prediction and conservation scores, we have shown that evolutionary conservation scores play a pivotal role in the classification of cancer drivers and were the most informative features in the driver mutation classification. Through extensive comparative analysis with structure-functional experiments and multicenter mutational calling data from PanCancer Atlas studies, we have demonstrated the robustness of our models and addressed the validity of computational predictions. We evaluated the performance of our models using the standard diagnostic metrics such as sensitivity, specificity, area under the curve and F-measure. To address the interpretability of cancer-specific classification models and obtain novel insights about molecular signatures of driver mutations, we have complemented machine learning predictions with structure-functional analysis of cancer driver mutations in several key tumor suppressor genes and oncogenes. Through the experiments carried out in this study, we found that evolutionary-based features have the strongest signal in the machine learning classification VII of driver mutations and provide orthogonal information to the ensembled-based scores that are prominent in the ranking of feature importance

    Deep Learning Models for Predicting Phenotypic Traits and Diseases from Omics Data

    Get PDF
    Computational analysis of high-throughput omics data, such as gene expressions, copy number alterations and DNA methylation (DNAm), has become popular in disease studies in recent decades because such analyses can be very helpful to predict whether a patient has certain disease or its subtypes. However, due to the high-dimensional nature of the data sets with hundreds of thousands of variables and very small number of samples, traditional machine learning approaches, such as support vector machines (SVMs) and random forests, have limitations to analyze these data efficiently. In this chapter, we reviewed the progress in applying deep learning algorithms to solve some biological questions. The focus is on potential software tools and public data sources for the tasks. Particularly, we show some case studies using deep neural network (DNN) models for classifying molecular subtypes of breast cancer and DNN-based regression models to account for interindividual variation in triglyceride concentrations measured at different visits of peripheral blood samples using DNAm profiles. We show that integration of multi-omics profiles into DNN-based learning methods could improve the prediction of the molecular subtypes of breast cancer. We also demonstrate the superiority of our proposed DNN models over the SVM model for predicting triglyceride concentrations

    Ductal carcinoma in situ of the breast: the importance of morphologic and molecular interactions.

    Get PDF
    Ductal carcinoma in situ (DCIS) of the breast is a lesion characterized by significant heterogeneity, in terms of morphology, immunohistochemical staining, molecular signatures, and clinical expression. For some patients, surgical excision provides adequate treatment, but a subset of patients will experience recurrence of DCIS or progression to invasive ductal carcinoma (IDC). Recent years have seen extensive research aimed at identifying the molecular events that characterize the transition from normal epithelium to DCIS and IDC. Tumor epithelial cells, myoepithelial cells, and stromal cells undergo alterations in gene expression, which are most important in the early stages of breast carcinogenesis. Epigenetic modifications, such as DNA methylation, together with microRNA alterations, play a major role in these genetic events. In addition, tumor proliferation and invasion is facilitated by the lesional microenvironment, which includes stromal fibroblasts and macrophages that secrete growth factors and angiogenesis-promoting substances. Characterization of DCIS on a molecular level may better account for the heterogeneity of these lesions and how this manifests as differences in patient outcome and response to therapy. Molecular assays originally developed for assessing likelihood of recurrence in IDC are recently being applied to DCIS, with promising results. In the future, the classification of DCIS will likely incorporate molecular findings along with histologic and immunohistochemical features, allowing for personalized prognostic information and therapeutic options for patients with DCIS. This review summarizes current data regarding the molecular characterization of DCIS and discusses the potential clinical relevance
    • …
    corecore