142 research outputs found

    Machine Learning Approaches for Cancer Analysis

    Get PDF
    In addition, we propose many machine learning models that serve as contributions to solve a biological problem. First, we present Zseq, a linear time method that identifies the most informative genomic sequences and reduces the number of biased sequences, sequence duplications, and ambiguous nucleotides. Zseq finds the complexity of the sequences by counting the number of unique k-mers in each sequence as its corresponding score and also takes into the account other factors, such as ambiguous nucleotides or high GC-content percentage in k-mers. Based on a z-score threshold, Zseq sweeps through the sequences again and filters those with a z-score less than the user-defined threshold. Zseq is able to provide a better mapping rate; it reduces the number of ambiguous bases significantly in comparison with other methods. Evaluation of the filtered reads has been conducted by aligning the reads and assembling the transcripts using the reference genome as well as de novo assembly. The assembled transcripts show a better discriminative ability to separate cancer and normal samples in comparison with another state-of-the-art method. Studying the abundance of select mRNA species throughout prostate cancer progression may provide some insight into the molecular mechanisms that advance the disease. In the second contribution of this dissertation, we reveal that the combination of proper clustering, distance function and Index validation for clusters are suitable in identifying outlier transcripts, which show different trending than the majority of the transcripts, the trending of the transcript is the abundance throughout different stages of prostate cancer. We compare this model with standard hierarchical time-series clustering method based on Euclidean distance. Using time-series profile hierarchical clustering methods, we identified stage-specific mRNA species termed outlier transcripts that exhibit unique trending patterns as compared to most other transcripts during disease progression. This method is able to identify those outliers rather than finding patterns among the trending transcripts compared to the hierarchical clustering method based on Euclidean distance. A wet-lab experiment on a biomarker (CAM2G gene) confirmed the result of the computational model. Genes related to these outlier transcripts were found to be strongly associated with cancer, and in particular, prostate cancer. Further investigation of these outlier transcripts in prostate cancer may identify them as potential stage-specific biomarkers that can predict the progression of the disease. Breast cancer, on the other hand, is a widespread type of cancer in females and accounts for a lot of cancer cases and deaths in the world. Identifying the subtype of breast cancer plays a crucial role in selecting the best treatment. In the third contribution, we propose an optimized hierarchical classification model that is used to predict the breast cancer subtype. Suitable filter feature selection methods and new hybrid feature selection methods are utilized to find discriminative genes. Our proposed model achieves 100% accuracy for predicting the breast cancer subtypes using the same or even fewer genes. Studying breast cancer survivability among different patients who received various treatments may help understand the relationship between the survivability and treatment therapy based on gene expression. In the fourth contribution, we have built a classifier system that predicts whether a given breast cancer patient who underwent some form of treatment, which is either hormone therapy, radiotherapy, or surgery will survive beyond five years after the treatment therapy. Our classifier is a tree-based hierarchical approach that partitions breast cancer patients based on survivability classes; each node in the tree is associated with a treatment therapy and finds a predictive subset of genes that can best predict whether a given patient will survive after that particular treatment. We applied our tree-based method to a gene expression dataset that consists of 347 treated breast cancer patients and identified potential biomarker subsets with prediction accuracies ranging from 80.9% to 100%. We have further investigated the roles of many biomarkers through the literature. Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. Discovery of gene indicators can be a crucial step in predicting survivability and handling of breast cancer patients. In the fifth contribution, we propose a hierarchical clustering method to separate dissimilar groups of genes in time-series data as outliers. These isolated outliers, genes that trend differently from other genes, can serve as potential biomarkers of breast cancer survivability. In the last contribution, we introduce a method that uses machine learning techniques to identify transcripts that correlate with prostate cancer development and progression. We have isolated transcripts that have the potential to serve as prognostic indicators and may have significant value in guiding treatment decisions. Our study also supports PTGFR, NREP, scaRNA22, DOCK9, FLVCR2, IK2F3, USP13, and CLASP1 as potential biomarkers to predict prostate cancer progression, especially between stage II and subsequent stages of the disease

    An Adaptive Clustering Algorithm for Gene Expression Time-Series Data Analysis

    Get PDF
    Studying gene expression through various time intervals of breast cancer survival may provide insights into the recovery of the patients. In this work, we propose a hierarchical clustering method used to separate dissimilar groups of genes in time-series data, which have the furthest distances from the rest of the genes throughout dierent time intervals. The isolated outliers(genes that trend dierently from other genes) can serve as potential biomarkers of breast cancer survivability. We partition the time axis (time points) into bins of length six months starting from 1-6 up to 337-342 month intervals and, for each gene, we average its expression level over all patients who appear in a survival bin. Gene expressions throughout those time points are cubic spline interpolated to create a trending prole for each gene. First, we universally align the gene expression proles to minimize the total area between them. Then, we cluster them using a sliding window approach and hierarchical clustering based on minimum vertical distances. To the best of our knowledge, this work is the rst time-series model that is built on the survival time of patients after the treatment. With this approach, we identied 46 genes (including 24 oncogenes and 18 tumor suppressor genes) as potential biomarkers of breast cancer survivability

    Machine Learning Approaches for Breast Cancer Survivability Prediction

    Get PDF
    Breast cancer is one of the leading causes of cancer death in women. If not diagnosed early, the 5-year survival rate of patients is just about 26\%. Furthermore, patients with similar phenotypes can respond differently to the same therapies, which means the therapies might not work well for some of them. Identifying biomarkers that can help predict a cancer class with high accuracy is at the heart of breast cancer studies because they are targets of the treatments and drug development. Genomics data have been shown to carry useful information for breast cancer diagnosis and prognosis, as well as uncovering the disease’s mechanism. Machine learning methods are powerful tools to find such information. Feature selection methods are often utilized in supervised learning and unsupervised learning tasks to deal with data containing a large number of features in which only a small portion of them are useful to the classification task. On the other hand, analyzing only one type of data, without reference to the existing knowledge about the disease and the therapies, might mislead the findings. Effective data integration approaches are necessary to uncover this complex disease. In this thesis, we apply and develop machine learning methods to identify meaningful biomarkers for breast cancer survivability prediction after a certain treatment. They include applying feature selection methods on gene-expression data to derived gene-signatures, where the initial genes are collected concerning the mechanism of some drugs used breast cancer therapies. We also propose a new feature selection method, named PAFS, and apply it to discover accurate biomarkers. In addition, it has been increasingly supported that, sub-network biomarkers are more robust and accurate than gene biomarkers. We proposed two network-based approaches to identify sub-network biomarkers for breast cancer survivability prediction after a treatment. They integrate gene-expression data with protein-protein interactions during the optimal sub-network searching process and use cancer-related genes and pathways to prioritize the extracted sub-networks. The sub-network search space is usually huge and many proteins interact with thousands of other proteins. Thus, we apply some heuristics to avoid generating and evaluating redundant sub-networks

    Computational Approaches to Assessing Clinical Relevance Of Preclinical Cancer Models

    Get PDF
    Preclinical cancer models, such as tumour-derived cell lines and animal models, are essential in cancer research. Consistently used as a platform to investigate mechanism of action, they can identify potential biomarkers prior to clinical trials where similar exploration is more complicated and expensive. However, whilst cell lines are the most used preclinical model, their applicability in certain settings is questioned because of the difficulty of aligning the appropriate cell lines with a clinically relevant disease segment. I developed a methodology for systematic cancer cell line scoring based on patient sample subtypes and analysis of the causative elements of the subtype differentiation in cancer. Machine learning classifiers I tailored to multi-omics nature of cancer have been highly accurate in predicting the subtype of new patient samples. Applying those models to cancer cell lines reslted in a clinically based cancer cell line relevance score. The majority of cell line scores were in line with the literature, but there were several misclassified cells. Exploring the causative elements of the underlying biology, I confirmed the oncogenic nature of the features driving the classification. Additionally, through differential expression analysis, the nature of some of the misclassified breast cancer cell lines was elucidated–they were poorly representative of their receptor-positive type despite having HER2 receptor expressed. One of those cell lines, JIMT-1, has been shown to be resistant to HER2-targeted treatment, thus making the misclassification of my model more clinically relevant than the receptor statuses of the cell line itself. Through several distance metrics I have expanded on the binary nature of the classifying methods and identified more and less suitable cell lines not just by their score, but also by how close they are to the patient samples. The core aspects of my methodology have been implemented as an online tool, a Shiny application, in order to allow others to leverage my methods and findings

    Multi-omic biomarker discovery and network analyses to elucidate the molecular mechanisms of lung cancer premalignancy

    Get PDF
    Lung cancer (LC) is the leading cause of cancer death in the US, claiming over 160,000 lives annually. Although CT screening has been shown to be efficacious in reducing mortality, the limited access to screening programs among high-risk individuals and the high number of false positives contribute to low survival rates and increased healthcare costs. As a result, there is an urgent need for preventative therapeutics and novel interception biomarkers that would enhance current methods for detection of early-stage LC. This thesis addresses this challenge by examining the hypothesis that transcriptomic changes preceding the onset of LC can be identified by studying bronchial premalignant lesions (PMLs) and the normal-appearing airway epithelial cells altered in their presence (i.e., the PML-associated airway field of injury). PMLs are the presumed precursors of lung squamous cell carcinoma (SCC) whose presence indicates an increased risk of developing SCC and other subtypes of LC. Here, I leverage high-throughput mRNA and miRNA sequencing data from bronchial brushings and lesion biopsies to develop biomarkers of PML presence and progression, and to understand regulatory mechanisms driving early carcinogenesis. First, I utilized mRNA sequencing data from normal-appearing airway brushings to build a biomarker predictive of PML presence. After verifying the power of the 200-gene biomarker to detect the presence of PMLs, I evaluated its capacity to predict PML progression and detect presence of LC (Aim 1). Next, I identified likely regulatory mechanisms associated with PML severity and progression, by evaluating miRNA expression and gene coexpression modules containing their targets in bronchial lesion biopsies (Aim2). Lastly, I investigated the preservation of the PML-associated miRNAs and gene modules in the airway field of injury, highlighting an emergent link between the airway field and the PMLs (Aim 3). Overall, this thesis suggests a multi-faceted utility of PML-associated genomic signatures as markers for stratification of high-risk smokers in chemoprevention trials, markers for early detection of lung cancer, and novel chemopreventive targets, and yields valuable insights into early lung carcinogenesis by characterizing mRNA and miRNA expression alterations that contribute to premalignant disease progression towards LC.2020-01-2

    Statistical Meta-Analysis of Risk Factors for Endometrial Can cer and Development of a Risk Prediction Model Using an Artificial Neural Network Algorithm

    Get PDF
    Objectives: In this study we wished to determine the rank order of risk factors for endometrial cancer and calculate a pooled risk and percentage risk for each factor using a statistical meta-analysis approach. The next step was to design a neural network computer model to predict the overall increase or decreased risk of cancer for individual patients. This would help to determine whether this prediction could be used as a tool to decide if a patient should be considered for testing and to predict diagnosis, as well as to suggest prevention measures to patients. Design: A meta-analysis of existing data was carried out to calculate relative risk, followed by design and implementation of a risk prediction computational model based on a neural network algorithm. Setting: Meta-analysis data were collated from various settings from around the world. Primary data to test the model were collected from a hospital clinic setting. Participants: Data from 40 patients notes currently suspected of having endometrial cancer and undergoing investigations and treatment were collected to test the software with their cancer diagnosis not revealed to the software developers. Main outcome measures: The forest plots allowed an overall relative risk and percentage risk to be calculated from all the risk data gathered from the studies. A neural network computational model to determine percentage risk for individual patients was developed, implemented, and evaluated. Results: The results show that the greatest percentage increased risk was due to BMI being above 25, with the risk increasing as BMI increases. A BMI of 25 or over gave an increased risk of 2.01%, a BMI of 30 or over gave an increase of 5.24%, and a BMI of 40 or over led to an increase of 6.9%. PCOS was the second highest increased risk at 4.2%. Diabetes, which is incidentally also linked to an increased BMI, gave a significant increased risk along with null parity and noncontinuous HRT of 1.54%, 1.2%, and 0.56% respectively. Decreased risk due to contraception was greatest with IUD (intrauterine device) and IUPD (intrauterine progesterone device) at −1.34% compared to −0.9% with oral. Continuous HRT at −0.75% and parity at −0.9% also decreased the risk. Using open-source patient data to test our computational model to determine risk, our results showed that the model is 98.6% accurate with an algorithm sensitivity 75% on average. Conclusions: In this study, we successfully determined the rank order of risk factors for endometrial cancer and calculated a pooled risk and risk percentage for each factor using a statistical meta-analysis approach. Then, using a computer neural network model system, we were able to model the overall increase or decreased risk of cancer and predict the cancer diagnosis for particular patients to an accuracy of over 98%. The neural network model developed in this study was shown to be a potentially useful tool in determining the percentage risk and predicting the possibility of a given patient developing endometrial cancer. As such, it could be a useful tool for clinicians to use in conjunction with other biomarkers in determining which patients warrant further preventative interventions to avert progressing to endometrial cancer. This result would allow for a reduction in the number of unnecessary invasive tests on patients. The model may also be used to suggest interventions to decrease the risk for a particular patient. The sensitivity of the model limits it at this stage due to the small percentage of positive cases in the datasets; however, since this model utilizes a neural network machine learning algorithm, it can be further improved by providing the system with more and larger datasets to allow further refinement of the neural network

    2018 Symposium Brochure

    Get PDF
    This dissertation explores the mean field Heisenberg spin system and its evolution in time. We first study the system in equilibrium, where we explore the tool known as Stein's method, used for determining convergence rates to thermodynamic limits, both in an example proof for a mean field Ising system and in tightening a previous result for the equilibrium mean field Heisenberg system. We then model the evolution of the mean field Heisenberg model using Glauber dynamics and use this method to test the equilibrium results of two previous papers, uncovering a typographical error in one. Agreement in other aspects between theory and our simulations validates our approach in the equilibrium case. Next, we compare the evolution of the Heisenberg system under Glauber dynamics to a number of forms of Brownian motion and determine that Brownian motion is a poor match in most situations. Turning back to Stein's method, we consider what sort of proof regarding the behavior of the mean field Heisenberg model over time is obtainable and look at several possible routes to that path. We finish up by offering a Stein's method approach to understanding the evolution of the mean field Heisenberg model and offer some insight into its convergence in time to a thermodynamic limit. This demonstrates the potential usefulness of Stein's method in understanding the finite time behavior of evolving systems. In our efforts, we encounter several holes in current mathematical and physical knowledge. In particular, we suggest the development of tools for Markov chains currently unavailable and the development of a more physically based algorithm for the evolution of Heisenberg systems. These projects lie beyond the scope of this dissertation but it is our hope that these ideas may be useful to future research
    • …
    corecore