309 research outputs found

    Clinical data mining and classification

    Get PDF
    Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de ComputadoresDeterminar os genes que contribuem para o desenvolvimento de certas doenças, como o cancro, é um objectivo importante na vanguarda da investigação clínica de hoje. Isto pode fornecer conhecimentos sobre como as doenças se desenvolvem, pode levar a novos tratamentos e a testes de diagnóstico que detectam doenças mais cedo no seu desenvolvimento, aumentando as hipóteses de recuperação dos pacientes. Hoje em dia, muitos conjuntos de dados de expressão genética estão disponíveis publicamente. Estes consistem geralmente em dados de microarray com informação sobre a activação (ou não) de milhares de genes, em pacientes específicos, que exibem uma determinada doença. No entanto, estes conjuntos de dados clínicos consistem em vetores de características de elevada dimensionalidade, o que levanta dificuldades à análise humana clínica e à interpretabilidade - dadas as grandes quantidades de características e as quantidades comparativamente pequenas de instâncias, é difícil identificar os genes mais relevantes relacionados com a presença de uma determinada doença. Nesta tese, exploramos a utilização da discretização de características, selecção de características e técnicas de classificação aplicadas ao problema de identificação do conjunto mais relevante de características (genes), dentro de conjuntos de dados de microarray, que podem prever a presença de uma dada doença. Construímos um pipeline onde aplicamos diferentes técnicas de discretização, selecção e classificação, a diferentes conjuntos de dados, e comparamos/interpretamos os resultados obtidos com cada combinação de técnicas. Na maioria dos conjuntos de dados, conseguimos obter erros de classificação mais baixos aplicando quer técnicas de discretização quer técnicas de selecção (mas não ambas). Ao aplicar técnicas de selecção, conseguimos também reduzir o número de características alimentadas a cada classificador, mantendo ou melhorando os resultados da classificação. Estes pequenos subconjuntos de genes são assim mais fáceis de interpretar pelos especialistas clínicos humanos, melhorando a explicabilidade dos resultados.Determining which genes contribute to the development of certain diseases, such as cancer, is an important goal in the forefront of today’s clinical research. This can provide insights on how diseases develop, can lead to new treatments and to diagnostic tests that detect diseases earlier in their development, increasing patients chances of recovery. Today, many gene expression datasets are publicly available. These generally consist of DNA microarray data with information on the activation (or not) of thousands of genes, in specific patients, that exhibit a certain disease. However, these clinical datasets consist of high-dimensional feature vectors, which raises difficulties for clinical human analysis and interpretability - given the large amounts of features and the comparatively small amounts of instances, it is difficult to identify the most relevant genes related to the presence of a particular disease. In this thesis, we explore the usage of feature discretization, feature selection, and classification techniques applied towards the problem of identifying the most relevant set of features (genes), within DNA microarray datasets, that can predict the presence of a given disease. We propose a machine learning pipeline with different feature discretization, feature selection, and classification techniques, to different datasets, and compare/interpret the achieved results with different combinations of techniques. On most datasets, we were able to obtain lower classification errors by applying either feature discretization or feature selection techniques (but not both). When applying feature selection techniques, we were also able to reduce the number of features fed to each classifier, while maintaining or improving the classification results. These smaller subsets of genes are thus easier to interpret by human clinical experts, improving the explainability of the results.N/

    Identifying genomic signatures for predicting breast cancer outcomes

    Get PDF
    Predicting the risk for recurrence in breast cancer patients is a critical task in clinics. Recent developments in DNA microarrays have fostered tremendous advances in molecular diagnosis and prognosis of breast cancer.;The first part of our study was based on a novel approach of considering the level of genomic instability as one of the most powerful predictors of clinical outcome. A systematic technique was presented to explore whether there is a linkage between the degree of genomic instability, gene expression patterns, and clinical outcomes by considering the following hypotheses; first, the degree of genomic instability is reflected by an aneuploidy-specific gene signature; second, this signature is robust and allows breast cancer prediction of clinical outcomes. The first hypothesis was tested by gene expression profiling of 48 breast tumors with varying degrees of genomic instability. A supervised machine learning approach of employing a combination of feature selection algorithms was used to identify a 12-gene genomic instability signature from a set of 7657 genes. The second hypothesis was tested by performing patient stratification on published breast cancer datasets using the genomic instability signature. The results concluded that patients with genomically stable breast carcinomas had considerably longer disease-free survival times compared to those with genomically unstable tumors. The gene signature generated significant patient stratification with distinct relapse-free and overall survival (log-rank tests; p \u3c 0.05; n = 469). It was independent of clinical-pathological parameters and provided additional prognostic information within sub-groups defined by each of them.;The importance of selecting patients at high risk for recurrence for more aggressive therapy was realized in the second part of the study, considering the fact that breast cancer patients with advanced stages receive chemotherapy, but only half of them benefit from it. The FDA recently approved the first gene test for cancer; MammaPrint, for node-negative primary breast cancer. Oncotype DX is a commercially available gene test for tamoxifen-treated, node-negative, and estrogen receptor-positive breast cancer. These signatures are specific for early stage breast cancers. A population-based approach to the molecular prognosis of breast cancer is needed for more rational therapy for breast cancer patients. A 28-gene expression signature was identified in our previous study using a population-based approach. Using this signature, a patient-stratification scheme was developed by employing the nearest centroid classification algorithm. It generated a significant stratification with distinct relapse-free survival (log-rank tests; p \u3c 0.05; n = 1337) and overall survival (log-rank tests; p \u3c 0.05; n = 806), based on the transcriptional profiles that were produced on a diverse range of microarray platforms. This molecular classification scheme could enable physicians to make treatment decisions based on specific characteristics of patients and their tumor, rather than population statistics. It could further refine subgroups defined by traditional clinical-pathological parameters into prognostic risk groups. It was unclear, whether a common gene set could predict a poor outcome in breast and ovarian cancer, the most common malignancies in women. The 28-gene signature generated significant prognostic categorization in ovarian cancers (log-rank tests; p \u3c 0.0001; n = 124), thus, confirming the clinical applicability of the gene signature to predict breast and ovarian cancer recurrence

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Classification between normal and tumor tissues based on the pair-wise gene expression ratio

    Get PDF
    BACKGROUND: Precise classification of cancer types is critically important for early cancer diagnosis and treatment. Numerous efforts have been made to use gene expression profiles to improve precision of tumor classification. However, reliable cancer-related signals are generally lacking. METHOD: Using recent datasets on colon and prostate cancer, a data transformation procedure from single gene expression to pair-wise gene expression ratio is proposed. Making use of the internal consistency of each expression profiling dataset this transformation improves the signal to noise ratio of the dataset and uncovers new relevant cancer-related signals (features). The efficiency in using the transformed dataset to perform normal/tumor classification was investigated using feature partitioning with informative features (gene annotation) as discriminating axes (single gene expression or pair-wise gene expression ratio). Classification results were compared to the original datasets for up to 10-feature model classifiers. RESULTS: 82 and 262 genes that have high correlation to tissue phenotype were selected from the colon and prostate datasets respectively. Remarkably, data transformation of the highly noisy expression data successfully led to lower the coefficient of variation (CV) for the within-class samples as well as improved the correlation with tissue phenotypes. The transformed dataset exhibited lower CV when compared to that of single gene expression. In the colon cancer set, the minimum CV decreased from 45.3% to 16.5%. In prostate cancer, comparable CV was achieved with and without transformation. This improvement in CV, coupled with the improved correlation between the pair-wise gene expression ratio and tissue phenotypes, yielded higher classification efficiency, especially with the colon dataset – from 87.1% to 93.5%. Over 90% of the top ten discriminating axes in both datasets showed significant improvement after data transformation. The high classification efficiency achieved suggested that there exist some cancer-related signals in the form of pair-wise gene expression ratio. CONCLUSION: The results from this study indicated that: 1) in the case when the pair-wise expression ratio transformation achieves lower CV and higher correlation to tissue phenotypes, a better classification of tissue type will follow. 2) the comparable classification accuracy achieved after data transformation suggested that pair-wise gene expression ratio between some pairs of genes can identify reliable markers for cancer

    A Survey of Machine Learning Approaches Applied to Gene Expression Analysis for Cancer Prediction

    Get PDF
    Machine learning approaches are powerful techniques commonly employed for developing cancer prediction models using associated gene expression and mutation data. Our survey provides a comprehensive review of recent cancer studies that have employed gene expression data from several cancer types (breast, lung, kidney, ovarian, liver, central nervous system and gallbladder) for survival prediction,tumor identification and stratification. We also provide an overview of biomarker studies that are associated with these cancer types. The survey captures multiple aspects of machine learning associated cancer studies,including cancer classification, cancer prediction, identification of biomarker genes, microarray, and RNA-Seq data.We discuss the technical issues with current cancer prediction models and the corresponding measurement tools for determining the activity levels of gene expression between cancerous tissues and noncancerous tissues. Additionally, we investigate how identifying putative biomarker gene expression patterns can aid in predicting future risk of cancer and inform the provision of personalized treatment

    Explainable Artificial Intelligence based Ensemble Machine Learning for Ovarian Cancer Stratification using Electronic Health Records

    Get PDF
    The purpose of this study is to show how ensemble learning-driven machine learning algorithms outperform individual machine learning algorithms at predicting ovarian cancer on a biomarker dataset. Additionally, this study provides model explanations using explainable Artificial Intelligence methods, The method involved gathering and combining 49 risk factors from 349 patients. We hypothesize that ensemble machine learning systems are superior to individual Machine Learning systems in predicting ovarian cancer. The Machine Learning system consists of five individual Machine Learning and five ensemble Machine Learning systems were trained using K-10 cross validation protocols. These training models were then used to predict the development of benign ovarian tumors and ovarian cancer tumors patients. The AUC and Accuracy metrics for ensemble machine learning increased by 19% and 16%. The MCC and Kappa scores for ensemble Machine Learning also increased over individual machine learning by 29% and 33%, respectively. As a result, we draw the conclusion that ensembled-based algorithms outperform individual machine learning in terms of ovarian carcinoma prediction

    Analysis of Microarray Data using Machine Learning Techniques on Scalable Platforms

    Get PDF
    Microarray-based gene expression profiling has been emerged as an efficient technique for classification, diagnosis, prognosis, and treatment of cancer disease. Frequent changes in the behavior of this disease, generate a huge volume of data. The data retrieved from microarray cover its veracities, and the changes observed as time changes (velocity). Although, it is a type of high-dimensional data which has very large number of features rather than number of samples. Therefore, the analysis of microarray high-dimensional dataset in a short period is very much essential. It often contains huge number of data, only a fraction of which comprises significantly expressed genes. The identification of the precise and interesting genes which are responsible for the cause of cancer is imperative in microarray data analysis. Most of the existing schemes employ a two phase process such as feature selection/extraction followed by classification. Our investigation starts with the analysis of microarray data using kernel based classifiers followed by feature selection using statistical t-test. In this work, various kernel based classifiers like Extreme learning machine (ELM), Relevance vector machine (RVM), and a new proposed method called kernel fuzzy inference system (KFIS) are implemented. The proposed models are investigated using three microarray datasets like Leukemia, Breast and Ovarian cancer. Finally, the performance of these classifiers are measured and compared with Support vector machine (SVM). From the results, it is revealed that the proposed models are able to classify the datasets efficiently and the performance is comparable to the existing kernel based classifiers. As the data size increases, to handle and process these datasets becomes very bottleneck. Hence, a distributed and a scalable cluster like Hadoop is needed for storing (HDFS) and processing (MapReduce as well as Spark) the datasets in an efficient way. The next contribution in this thesis deals with the implementation of feature selection methods, which are able to process the data in a distributed manner. Various statistical tests like ANOVA, Kruskal-Wallis, and Friedman tests are implemented using MapReduce and Spark frameworks which are executed on the top of Hadoop cluster. The performance of these scalable models are measured and compared with the conventional system. From the results, it is observed that the proposed scalable models are very efficient to process data of larger dimensions (GBs, TBs, etc.), as it is not possible to process with the traditional implementation of those algorithms. After selecting the relevant features, the next contribution of this thesis is the scalable viii implementation of the proximal support vector machine classifier, which is an efficient variant of SVM. The proposed classifier is implemented on the two scalable frameworks like MapReduce and Spark and executed on the Hadoop cluster. The obtained results are compared with the results obtained using conventional system. From the results, it is observed that the scalable cluster is well suited for the Big data. Furthermore, it is concluded that Spark is more efficient than MapReduce due to its an intelligent way of handling the datasets through Resilient distributed dataset (RDD) as well as in-memory processing and conventional system to analyze the Big datasets. Therefore, the next contribution of the thesis is the implementation of various scalable classifiers base on Spark. In this work various classifiers like, Logistic regression (LR), Support vector machine (SVM), Naive Bayes (NB), K-Nearest Neighbor (KNN), Artificial Neural Network (ANN), and Radial basis function network (RBFN) with two variants hybrid and gradient descent learning algorithms are proposed and implemented using Spark framework. The proposed scalable models are executed on Hadoop cluster as well as conventional system and the results are investigated. From the obtained results, it is observed that the execution of the scalable algorithms are very efficient than conventional system for processing the Big datasets. The efficacy of the proposed scalable algorithms to handle Big datasets are investigated and compared with the conventional system (where data are not distributed, kept on standalone machine and processed in a traditional manner). The comparative analysis shows that the scalable algorithms are very efficient to process Big datasets on Hadoop cluster rather than the conventional system

    Gene expression classifiers and out-of-class samples detection

    Get PDF
    The proper application of statistics, machine learning, and data-mining techniques in routine clinical diagnostics to classify diseases using their genetic expression profile is still a challenge. One critical issue is the overall inability of most state-of-the-art classifiers to identify out-of-class samples, i.e., samples that do not belong to any of the available classes. This paper shows a possible explanation for this problem and suggests how, by analyzing the distribution of the class probability estimates generated by a classifier, it is possible to build decision rules able to significantly improve its performance

    VARIATIONS IN MICROARRAY BASED GENE EXPRESSION PROFILING: IDENTIFYING SOURCES AND IMPROVING RESULTS

    Get PDF
    Two major issues hinder the application of microarray based gene expression profiling in clinical laboratories as a diagnostic or prognostic tool. The first issue is the sheer volume and high-dimensionality of gene expression data from microarray experiments, which require advanced algorithms to extract meaningful gene expression patterns that correlate with biological impact. The second issue is the substantial amount of variation in microarray gene expression data, which impairs the performance of analysis method and makes sharing or integrating microarray data very difficult. Variations can be introduced by all possible sources including the DNA microarray technology itself and the experimental procedures. Many of these variations have not been characterized, measured, or linked to the sources. In the first part of this dissertation, a decision tree learning method was demonstrated to perform as well as more popularly accepted classification methods in partitioning cancer samples with microarray data. More importantly, results demonstrate that variation introduced into microarray data by tissue sampling and tissue handling compromised the performance of classification methods. In the second part of this dissertation, variations introduced by the T7 based in vitro transcription labeling methods were investigated in detail. Results demonstrated that individual amplification methods significantly biased gene expression data even though the methods compared in this study were all derivatives of the T7 RNA polymerase based in vitro transcription labeling approach. Variations observed can be partially explained by the number of biotinylated nucleotides used for labeling and the incubation time of the in vitro transcription experiments. These variations can generate discordant gene expression results even using the same RNA samples and cannot be corrected by post experiment analysis including advanced normalization techniques. Studies in this dissertation stress the concept that experimental and analytical methods must work together. This dissertation also emphasizes the importance of standardizing the DNA microarray technology and experimental procedures in order to optimize gene expression analysis and create quality standards compatible with the clinical application of this technology. These findings should be taken into account especially when comparing data from different platforms, and in standardizing protocols for clinical applications in pathology

    Identification of Biomarkers for Esophageal Squamous Cell Carcinoma Using Feature Selection and Decision Tree Methods

    Get PDF
    Esophageal squamous cell cancer (ESCC) is one of the most common fatal human cancers. The identification of biomarkers for early detection could be a promising strategy to decrease mortality. Previous studies utilized microarray techniques to identify more than one hundred genes; however, it is desirable to identify a small set of biomarkers for clinical use. This study proposes a sequential forward feature selection algorithm to design decision tree models for discriminating ESCC from normal tissues. Two potential biomarkers of RUVBL1 and CNIH were identified and validated based on two public available microarray datasets. To test the discrimination ability of the two biomarkers, 17 pairs of expression profiles of ESCC and normal tissues from Taiwanese male patients were measured by using microarray techniques. The classification accuracies of the two biomarkers in all three datasets were higher than 90%. Interpretable decision tree models were constructed to analyze expression patterns of the two biomarkers. RUVBL1 was consistently overexpressed in all three datasets, although we found inconsistent CNIH expression possibly affected by the diverse major risk factors for ESCC across different areas
    corecore