307 research outputs found

    Comparison of Two Output-Coding Strategies for Multi-Class Tumor Classification Using Gene Expression Data and Latent Variable Model as Binary Classifier

    Get PDF
    Multi-class cancer classification based on microarray data is described. A generalized output-coding scheme based on One Versus One (OVO) combined with Latent Variable Model (LVM) is used. Results from the proposed One Versus One (OVO) outputcoding strategy is compared with the results obtained from the generalized One Versus All (OVA) method and their efficiencies of using them for multi-class tumor classification have been studied. This comparative study was done using two microarray gene expression data: Global Cancer Map (GCM) dataset and brain cancer (BC) dataset. Primary feature selection was based on fold change and penalized t-statistics. Evaluation was conducted with varying feature numbers. The OVO coding strategy worked quite well with the BC data, while both OVO and OVA results seemed to be similar for the GCM data. The selection of output coding methods for combining binary classifiers for multi-class tumor classification depends on the number of tumor types considered, the discrepancies between the tumor samples used for training as well as the heterogeneity of expression within the cancer subtypes used as training data

    Computational models and approaches for lung cancer diagnosis

    Full text link
    The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, the aim of this study is to developed novel lung cancer diagnostic models. New algorithms are proposed to analyse the biological data and extract knowledge that assists in achieving accurate diagnosis results

    Differential prioritization between relevance and redundancy in correlation-based feature selection techniques for multiclass gene expression data

    Get PDF
    BACKGROUND: Due to the large number of genes in a typical microarray dataset, feature selection looks set to play an important role in reducing noise and computational cost in gene expression-based tissue classification while improving accuracy at the same time. Surprisingly, this does not appear to be the case for all multiclass microarray datasets. The reason is that many feature selection techniques applied on microarray datasets are either rank-based and hence do not take into account correlations between genes, or are wrapper-based, which require high computational cost, and often yield difficult-to-reproduce results. In studies where correlations between genes are considered, attempts to establish the merit of the proposed techniques are hampered by evaluation procedures which are less than meticulous, resulting in overly optimistic estimates of accuracy. RESULTS: We present two realistically evaluated correlation-based feature selection techniques which incorporate, in addition to the two existing criteria involved in forming a predictor set (relevance and redundancy), a third criterion called the degree of differential prioritization (DDP). DDP functions as a parameter to strike the balance between relevance and redundancy, providing our techniques with the novel ability to differentially prioritize the optimization of relevance against redundancy (and vice versa). This ability proves useful in producing optimal classification accuracy while using reasonably small predictor set sizes for nine well-known multiclass microarray datasets. CONCLUSION: For multiclass microarray datasets, especially the GCM and NCI60 datasets, DDP enables our filter-based techniques to produce accuracies better than those reported in previous studies which employed similarly realistic evaluation procedures

    Privacy Preserving Data Mining For Horizontally Distributed Medical Data Analysis

    Get PDF
    To build reliable prediction models and identify useful patterns, assembling data sets from databases maintained by different sources such as hospitals becomes increasingly common; however, it might divulge sensitive information about individuals and thus leads to increased concerns about privacy, which in turn prevents different parties from sharing information. Privacy Preserving Distributed Data Mining (PPDDM) provides a means to address this issue without accessing actual data values to avoid the disclosure of information beyond the final result. In recent years, a number of state-of-the-art PPDDM approaches have been developed, most of which are based on Secure Multiparty Computation (SMC). SMC requires expensive communication cost and sophisticated secure computation. Besides, the mining progress is inevitable to slow down due to the increasing volume of the aggregated data. In this work, a new framework named Privacy-Aware Non-linear SVM (PAN-SVM) is proposed to build a PPDDM model from multiple data sources. PAN-SVM employs the Secure Sum Protocol to protect privacy at the bottom layer, and reduces the complex communication and computation via Nystrom matrix approximation and Eigen decomposition methods at the medium layer. The top layer of PAN-SVM speeds up the whole algorithm for large scale datasets. Based on the proposed framework of PAN-SVM, a Privacy Preserving Multi-class Classifier is built, and the experimental results on several benchmark datasets and microarray datasets show its abilities to improve classification accuracy compared with a regular SVM. In addition, two Privacy Preserving Feature Selection methods are also proposed based on PAN-SVM, and tested by using benchmark data and real world data. PAN-SVM does not depend on a trusted third party; all participants collaborate equally. Many experimental results show that PAN-SVM can not only effectively solve the problem of collaborative privacy-preserving data mining by building non-linear classification rules, but also significantly improve the performance of built classifiers

    A novel computational framework for fast, distributed computing and knowledge integration for microarray gene expression data analysis

    Get PDF
    The healthcare burden and suffering due to life-threatening diseases such as cancer would be significantly reduced by the design and refinement of computational interpretation of micro-molecular data collected by bioinformaticians. Rapid technological advancements in the field of microarray analysis, an important component in the design of in-silico molecular medicine methods, have generated enormous amounts of such data, a trend that has been increasing exponentially over the last few years. However, the analysis and handling of these data has become one of the major bottlenecks in the utilization of the technology. The rate of collection of these data has far surpassed our ability to analyze the data for novel, non-trivial, and important knowledge. The high-performance computing platform, and algorithms that utilize its embedded computing capacity, has emerged as a leading technology that can handle such data-intensive knowledge discovery applications. In this dissertation, we present a novel framework to achieve fast, robust, and accurate (biologically-significant) multi-class classification of gene expression data using distributed knowledge discovery and integration computational routines, specifically for cancer genomics applications. The research presents a unique computational paradigm for the rapid, accurate, and efficient selection of relevant marker genes, while providing parametric controls to ensure flexibility of its application. The proposed paradigm consists of the following key computational steps: (a) preprocess, normalize the gene expression data; (b) discretize the data for knowledge mining application; (c) partition the data using two proposed methods: partitioning with overlapped windows and adaptive selection; (d) perform knowledge discovery on the partitioned data-spaces for association rule discovery; (e) integrate association rules from partitioned data and knowledge spaces on distributed processor nodes using a novel knowledge integration algorithm; and (f) post-analysis and functional elucidation of the discovered gene rule sets. The framework is implemented on a shared-memory multiprocessor supercomputing environment, and several experimental results are demonstrated to evaluate the algorithms. We conclude with a functional interpretation of the computational discovery routines for enhanced biological physiological discovery from cancer genomics datasets, while suggesting some directions for future research

    Identifying genomic signatures for predicting breast cancer outcomes

    Get PDF
    Predicting the risk for recurrence in breast cancer patients is a critical task in clinics. Recent developments in DNA microarrays have fostered tremendous advances in molecular diagnosis and prognosis of breast cancer.;The first part of our study was based on a novel approach of considering the level of genomic instability as one of the most powerful predictors of clinical outcome. A systematic technique was presented to explore whether there is a linkage between the degree of genomic instability, gene expression patterns, and clinical outcomes by considering the following hypotheses; first, the degree of genomic instability is reflected by an aneuploidy-specific gene signature; second, this signature is robust and allows breast cancer prediction of clinical outcomes. The first hypothesis was tested by gene expression profiling of 48 breast tumors with varying degrees of genomic instability. A supervised machine learning approach of employing a combination of feature selection algorithms was used to identify a 12-gene genomic instability signature from a set of 7657 genes. The second hypothesis was tested by performing patient stratification on published breast cancer datasets using the genomic instability signature. The results concluded that patients with genomically stable breast carcinomas had considerably longer disease-free survival times compared to those with genomically unstable tumors. The gene signature generated significant patient stratification with distinct relapse-free and overall survival (log-rank tests; p \u3c 0.05; n = 469). It was independent of clinical-pathological parameters and provided additional prognostic information within sub-groups defined by each of them.;The importance of selecting patients at high risk for recurrence for more aggressive therapy was realized in the second part of the study, considering the fact that breast cancer patients with advanced stages receive chemotherapy, but only half of them benefit from it. The FDA recently approved the first gene test for cancer; MammaPrint, for node-negative primary breast cancer. Oncotype DX is a commercially available gene test for tamoxifen-treated, node-negative, and estrogen receptor-positive breast cancer. These signatures are specific for early stage breast cancers. A population-based approach to the molecular prognosis of breast cancer is needed for more rational therapy for breast cancer patients. A 28-gene expression signature was identified in our previous study using a population-based approach. Using this signature, a patient-stratification scheme was developed by employing the nearest centroid classification algorithm. It generated a significant stratification with distinct relapse-free survival (log-rank tests; p \u3c 0.05; n = 1337) and overall survival (log-rank tests; p \u3c 0.05; n = 806), based on the transcriptional profiles that were produced on a diverse range of microarray platforms. This molecular classification scheme could enable physicians to make treatment decisions based on specific characteristics of patients and their tumor, rather than population statistics. It could further refine subgroups defined by traditional clinical-pathological parameters into prognostic risk groups. It was unclear, whether a common gene set could predict a poor outcome in breast and ovarian cancer, the most common malignancies in women. The 28-gene signature generated significant prognostic categorization in ovarian cancers (log-rank tests; p \u3c 0.0001; n = 124), thus, confirming the clinical applicability of the gene signature to predict breast and ovarian cancer recurrence
    • …
    corecore