1,244 research outputs found

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    A Machine Learning Model for Discovery of Protein Isoforms as Biomarkers

    Get PDF
    Prostate cancer is the most common cancer in men. One in eight Canadian men will be diagnosed with prostate cancer in their lifetime. The accurate detection of the disease’s subtypes is critical for providing adequate therapy; hence, it is critical for increasing both survival rates and quality of life. Next generation sequencing can be beneficial when studying cancer. This technology generates a large amount of data that can be used to extract information about biomarkers. This thesis proposes a model that discovers protein isoforms for different stages of prostate cancer progression. A tool has been developed that utilizes RNA-Seq data to infer open reading frames (ORFs) corresponding to transcripts. These ORFs are used as features for classificatio. A quantification measurement, Adaptive Fragments Per Kilobase of transcript per Million mapped reads (AFPKM), is proposed to compute the expression level for ORFs. The new measurement considers the actual length of the ORF and the length of the transcript. Using these ORFs and the new expression measure, several classifiers were built using different machine learning techniques. That enabled the identification of some protein isoforms related to prostate cancer progression. The biomarkers have had a great impact on the discrimination of prostate cancer stages and are worth further investigation

    USING MACHINE LEARNING TECHNIQUES FOR FINDING MEANINGFUL TRANSCRIPTS IN PROSTATE CANCER PROGRESSION

    Get PDF
    Prostate Cancer is one of the most common types of cancer among Canadian men. Next generation sequencing that uses RNA-Seq can be valuable in studying cancer, since it provides large amounts of data as a source for information about biomarkers. For these reasons, we have chosen RNA-Seq data for prostate cancer progression in our study. In this research, we propose a new method for finding transcripts that can be used as genomic features. In this regard, we have gathered a very large amount of transcripts. There are a large number of transcripts that are not quite relevant, and we filter them by applying a feature selection algorithm. The results are then processed through a machine learning technique for classification such as the support vector machine which is used to classify different stages of prostate cancer. Finally, we have identified potential transcripts associated with prostate cancer progression. Ideally, these transcripts can be used for improving diagnosis, treatment, and drug development

    Machine learning and computational methods to identify molecular and clinical markers for complex diseases – case studies in cancer and obesity

    Get PDF
    In biomedical research, applied machine learning and bioinformatics are the essential disciplines heavily involved in translating data-driven findings into medical practice. This task is especially accomplished by developing computational tools and algorithms assisting in detection and clarification of underlying causes of the diseases. The continuous advancements in high-throughput technologies coupled with the recently promoted data sharing policies have contributed to presence of a massive wealth of data with remarkable potential to improve human health care. In concordance with this massive boost in data production, innovative data analysis tools and methods are required to meet the growing demand. The data analyzed by bioinformaticians and computational biology experts can be broadly divided into molecular and conventional clinical data categories. The aim of this thesis was to develop novel statistical and machine learning tools and to incorporate the existing state-of-the-art methods to analyze bio-clinical data with medical applications. The findings of the studies demonstrate the impact of computational approaches in clinical decision making by improving patients risk stratification and prediction of disease outcomes. This thesis is comprised of five studies explaining method development for 1) genomic data, 2) conventional clinical data and 3) integration of genomic and clinical data. With genomic data, the main focus is detection of differentially expressed genes as the most common task in transcriptome profiling projects. In addition to reviewing available differential expression tools, a data-adaptive statistical method called Reproducibility Optimized Test Statistic (ROTS) is proposed for detecting differential expression in RNA-sequencing studies. In order to prove the efficacy of ROTS in real biomedical applications, the method is used to identify prognostic markers in clear cell renal cell carcinoma (ccRCC). In addition to previously known markers, novel genes with potential prognostic and therapeutic role in ccRCC are detected. For conventional clinical data, ensemble based predictive models are developed to provide clinical decision support in treatment of patients with metastatic castration resistant prostate cancer (mCRPC). The proposed predictive models cover treatment and survival stratification tasks for both trial-based and realworld patient cohorts. Finally, genomic and conventional clinical data are integrated to demonstrate the importance of inclusion of genomic data in predictive ability of clinical models. Again, utilizing ensemble-based learners, a novel model is proposed to predict adulthood obesity using both genetic and social-environmental factors. Overall, the ultimate objective of this work is to demonstrate the importance of clinical bioinformatics and machine learning for bio-clinical marker discovery in complex disease with high heterogeneity. In case of cancer, the interpretability of clinical models strongly depends on predictive markers with high reproducibility supported by validation data. The discovery of these markers would increase chance of early detection and improve prognosis assessment and treatment choice

    Machine Learning Approaches for Identifying Cancer Biomarkers Using Next Generation Sequencing

    Get PDF
    Identifying biomarkers that can be used to classify certain disease stages or predict when a disease becomes more aggressive is one of the most important applications of machine learning. Next generation sequencing (NGS) is a state-of-the-art method that enables fast sequencing of DNA or RNA samples. The output usually contains a very large file that consists of base pairs of DNA or RNA. The generated data can be analyzed to provide gene expression, chromosome counting, detection of mutations on the genes, and detecting levels of copy number variations or alterations in specific genes, just as examples. NGS is leading the way to explore the human genome, enabling the future of personalized medicine. In this thesis, a demonstration is done on how machine learning is used extensively to identify genes that can be used to predict prostate cancer stages with very high accuracy, using gene expression. We have also been successful in predicting the location of prostate tumors based on gene expression. In addition, traditional biomarker identification approaches, typically, use machine learning techniques to identify a number of genes and macromolecules as biomarkers that can be used to diagnose specific diseases or states of diseases with very high accuracy, using molecular measurements such as mutations, gene expression, copy number variations, and others. However, experts\u27 opinions and knowledge is required to validate such findings. We, therefore, also introduce a new machine learning model that incorporates a knowledge-assisted system used to integrate the findings of the DisGeNET database, which is a framework that contains proven relationships among diseases and genes. The machine learning pipeline starts by reducing the number of features using a filter-based feature selection method. The DisGeNET database is used to score each gene related to the given cancer name. Then, a wrapper-based feature-selection algorithm picks the best set of genes with the highest classification accuracy. The method has been able to retrieve key genes from multiple data sets that classify with very high accuracy, while being biologically relevant, and no human intervention needed. Initial results provide a high area-under-the-curve with a handful of genes that are already proven to be related to the relevant disease and state based on the latest published medical findings. The proposed methods results provide biomarkers that can be verified in wet lab environments and can then be further analyzed and studied for diagnostic purposes

    Computational functional prediction of novel long noncoding RNA in TCGA Glioblastoma multiforme sample

    Get PDF
    According to international human genome sequencing consortium 2004[43], it was known that only less than 2% of the total human genome code for proteins. This ignited quite a surprise in the scientific community. Since then, a lot of researchers are attracted towards the noncoding part of the genome. There are explosion of researches addressing the role of the 98% of the human untranslated regions of the genome. This shows that the transcription is not only limited to the protein coding regions of the genome rather more than 90% of the genome are likely to be transcribed. [43] This will result in the transcription of tens and thousands of the long noncoding RNAs (lncRNAs) with little or no coding potential. However, the molecular mechanism and function of long noncoding RNAs are still an open research topic. Although the functions of limited lncRNAs are identified, there is still a gap in identifying the function of novel lncRNAs. This project implements different computational methods to predict the function of novel lncRNAs identified from TCGA glioblastoma multiforme samples. The methods used in this functional prediction include both expression and sequence-based analysis approach. In expression-based analysis, the co-expressing genes with lncRNAs are used to predict the possible functional relation. In sequence based analysis, the gene-protein and lncRNA-protein interactions together with miRNA-lncRNA interactions are considered towards the possible functional predictions. The result from the integrated functional prediction on the novel lncRNAs show that TCGA_gbm3-153501 novel lncRNA which is co-expressed together with the THBS1 gene with correlation coefficient of more that 0.5 is predicted to function in cell-cell and cell-to-matrix interactions, platelet aggregation, angiogenesis, and tumorigenesis. [202] MSI1, RBM3 and RBM8A are RNA binding proteins (RBPs) that have binding site on both the first top five differentially expressed lncRNAs which are TCGA_gbm-2-104096501, TCGA_gbm-3-153501, TCGA_gbm-5-63687001 and TCGA_gbm-17-10671251 and IGF2 which is among the top 10 differentially expressed genes. Therefore, these lncRNAs are predicted to have functional role in cell proliferation and maintenance of stem cells in the central nervous system

    Computational functional prediction of novel long noncoding RNA in TCGA Glioblastoma multiforme sample

    Get PDF
    According to international human genome sequencing consortium 2004[43], it was known that only less than 2% of the total human genome code for proteins. This ignited quite a surprise in the scientific community. Since then, a lot of researchers are attracted towards the noncoding part of the genome. There are explosion of researches addressing the role of the 98% of the human untranslated regions of the genome. This shows that the transcription is not only limited to the protein coding regions of the genome rather more than 90% of the genome are likely to be transcribed. [43] This will result in the transcription of tens and thousands of the long noncoding RNAs (lncRNAs) with little or no coding potential. However, the molecular mechanism and function of long noncoding RNAs are still an open research topic. Although the functions of limited lncRNAs are identified, there is still a gap in identifying the function of novel lncRNAs. This project implements different computational methods to predict the function of novel lncRNAs identified from TCGA glioblastoma multiforme samples. The methods used in this functional prediction include both expression and sequence-based analysis approach. In expression-based analysis, the co-expressing genes with lncRNAs are used to predict the possible functional relation. In sequence based analysis, the gene-protein and lncRNA-protein interactions together with miRNA-lncRNA interactions are considered towards the possible functional predictions. The result from the integrated functional prediction on the novel lncRNAs show that TCGA_gbm3-153501 novel lncRNA which is co-expressed together with the THBS1 gene with correlation coefficient of more that 0.5 is predicted to function in cell-cell and cell-to-matrix interactions, platelet aggregation, angiogenesis, and tumorigenesis. [202] MSI1, RBM3 and RBM8A are RNA binding proteins (RBPs) that have binding site on both the first top five differentially expressed lncRNAs which are TCGA_gbm-2-104096501, TCGA_gbm-3-153501, TCGA_gbm-5-63687001 and TCGA_gbm-17-10671251 and IGF2 which is among the top 10 differentially expressed genes. Therefore, these lncRNAs are predicted to have functional role in cell proliferation and maintenance of stem cells in the central nervous system

    Discovering cancer-associated transcripts by RNA sequencing

    Full text link
    High-throughput sequencing of poly-adenylated RNA (RNA-Seq) in human cancers shows remarkable potential to identify uncharacterized aspects of tumor biology, including gene fusions with therapeutic significance and disease markers such as long non-coding RNA (lncRNA) species. However, the analysis of RNA-Seq data places unprecedented demands upon computational infrastructures and algorithms, requiring novel bioinformatics approaches. To meet these demands, we present two new open-source software packages - ChimeraScan and AssemblyLine - designed to detect gene fusion events and novel lncRNAs, respectively. RNA-Seq studies utilizing ChimeraScan led to discoveries of new families of recurrent gene fusions in breast cancers and solitary fibrous tumors. Further, ChimeraScan was one of the key components of the repertoire of computational tools utilized in data analysis for MI-ONCOSEQ, a clinical sequencing initiative to identify potentially informative and actionable mutations in cancer patients’ tumors. AssemblyLine, by contrast, reassembles RNA sequencing data into full-length transcripts ab initio. In head-to-head analyses AssemblyLine compared favorably to existing ab initio approaches and unveiled abundant novel lncRNAs, including antisense and intronic lncRNAs disregarded by previous studies. Moreover, we used AssemblyLine to define the prostate cancer transcriptome from a large patient cohort and discovered myriad lncRNAs, including 121 prostate cancer-associated transcripts (PCATs) that could potentially serve as novel disease markers. Functional studies of two PCATs - PCAT-1 and SChLAP1 - revealed cancer-promoting roles for these lncRNAs. PCAT1, a lncRNA expressed from chromosome 8q24, promotes cell proliferation and represses the tumor suppressor BRCA2. SChLAP1, located in a chromosome 2q31 ‘gene desert’, independently predicts poor patient outcomes, including metastasis and cancer-specific mortality. Mechanistically, SChLAP1 antagonizes the genome-wide localization and regulatory functions of the SWI/SNF chromatin-modifying complex. Collectively, this work demonstrates the utility of ChimeraScan and AssemblyLine as open-source bioinformatics tools. Our applications of ChimeraScan and AssemblyLine led to the discovery of new classes of recurrent and clinically informative gene fusions, and established a prominent role for lncRNAs in coordinating aggressive prostate cancer, respectively. We expect that the methods and findings described herein will establish a precedent for RNA-Seq-based studies in cancer biology and assist the research community at large in making similar discoveries.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120814/1/mkiyer_1.pd

    Prognostic biomarker detection, machine learning bias correction, and differential coexpression module detection

    Get PDF
    In this thesis, we present three projects on prognosis biomarker detection, machine learning bias correction and identification of differential coexpression modules in complex diseases. In the first project, we aimed to identify fusion transcripts that are of predictive value on prostate cancer prognosis, an important task to avoid overtreatment to patients. We discovered eight fusion transcripts from 19 RNA-seq datasets and validated its predictive value on >200 patients from three sites (Pittsburgh, Stanford and Wisconsin). The constructed prediction model showed consistently high accuracy on predicting prostate cancer recurrence and aggressiveness in all three cohorts. In the second project, we consider a common practice to apply many (up to several hundred) machine learning classifiers to a dataset and report the best cross-validated accuracy. We demonstrated a downward bias using this approach and proposed an inverse power law (IPL) method to correct the bias. The method was compared with several existing methods using simulation and real datasets and showed superior performance. For the third study, we developed a computational algorithm (MetaDiffNetwork) to identify coexpressioin modules that are consistently differential across disease conditions in multiple transcriptomic studies. We demonstrated good performance of the algorithm using simulated data and applied it to combine eight major depressive disorder studies (cases vs. controls) and four breast cancer studies (ER+ vs. ER-). The identified modules were validated by existing knowledge of disease pathways. These modules can be used to help generate new hypotheses regarding suspected disease genes. In conclusion, the three areas of research covered in this thesis are critical bioinformatic elements for biomedical applications and can be used to help understand the underlying disease mechanism and ultimately improve patient treatment

    Characterizing and reassembling the COPD and ILD transcriptome using RNA-Seq

    Full text link
    Chronic Obstructive Pulmonary Disease (COPD) is the 3rd leading cause of death in the US, and idiopathic pulmonary fibrosis (IPF), a type of Interstitial Lung Disease (ILD), is a fast acting, irreversible disease that leads to mortality within 3-5 years. RNA-sequencing provides the opportunity to quantitatively examine the sequences of millions mRNAs, and offers the potential to gain unprecedented insights into the structure of chronic non-malignant lung disease transcriptome. By identifying changes in splicing and novel loci expression associated with disease, we may be able to gain a better understanding of their pathogenesis, identify novel disease-specific biomarkers, and find better targets for therapy. Using RNA-seq data that our group generated on 281 human lung tissue samples (47=Control, 131=COPD, 103=ILD), I initially defined the transcriptomic landscape of lung tissue by identifying which genes were expressed in each tissue sample. I used a mixture model to separate genes into reliable and not reliable expression. Next, I employed reads that overlapped splice junctions in a linear model interaction term to identify disease-specific differential splicing. I identified alternatively spliced genes between control and disease tissues and validated three (PDGFA, NUMB, SCEL) of these genes with qPCR and nanostring (a hybridization-based barcoding technique used to quantify transcripts). Finally, I implemented and improved a pipeline to perform transcriptome assembly using Cufflinks that led to the identification of 1,855 novel loci that did not overlap with UCSC, Vega, and Ensembl annotations. The loci were classified into potential coding and non-coding loci (191 and 1,664, respectively). Expression analysis revealed that there were 120 IPF-associated and 10 emphysema-associated differentially expressed (q < 0.01) novel loci. RNA-seq provides a high-resolution transcript-level view of the pulmonary transcriptome and its modification in lung disease. It has enabled a new understanding of the lung transcriptome structure because it measures not only the transcripts we know but also the ones we do not know. The approaches and improvements I have employed have identified these novel targets and make possible further downstream functional analysis that could identify better targets for therapy and lead to an even better understanding of chronic lung disease pathogenesis.2031-01-01T00:00:00
    • …
    corecore