125 research outputs found

    Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection

    Get PDF
    Lopez-Rincon A, Martinez-Archundia M, Martinez-Ruiz GU, Schönhuth A, Tonda A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC Bioinformatics. 2019;20(1): 480

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Ensemble feature learning of genomic data using support vector machine

    Full text link
    © 2016 Anaissi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. The identification of a subset of genes having the ability to capture the necessary information to distinguish classes of patients is crucial in bioinformatics applications. Ensemble and bagging methods have been shown to work effectively in the process of gene selection and classification. Testament to that is random forest which combines random decision trees with bagging to improve overall feature selection and classification accuracy. Surprisingly, the adoption of these methods in support vector machines has only recently received attention but mostly on classification not gene selection. This paper introduces an ensemble SVM-Recursive Feature Elimination (ESVM-RFE) for gene selection that follows the concepts of ensemble and bagging used in random forest but adopts the backward elimination strategy which is the rationale of RFE algorithm. The rationale behind this is, building ensemble SVM models using randomly drawn bootstrap samples from the training set, will produce different feature rankings which will be subsequently aggregated as one feature ranking. As a result, the decision for elimination of features is based upon the ranking of multiple SVM models instead of choosing one particular model. Moreover, this approach will address the problem of imbalanced datasets by constructing a nearly balanced bootstrap sample. Our experiments show that ESVM-RFE for gene selection substantially increased the classification performance on five microarray datasets compared to state-of-the-art methods. Experiments on the childhood leukaemia dataset show that an average 9% better accuracy is achieved by ESVM-RFE over SVM-RFE, and 5% over random forest based approach. The selected genes by the ESVM-RFE algorithm were further explored with Singular Value Decomposition (SVD) which reveals significant clusters with the selected data

    A Machine Learning Framework for Identifying Molecular Biomarkers from Transcriptomic Cancer Data

    Get PDF
    Cancer is a complex molecular process due to abnormal changes in the genome, such as mutation and copy number variation, and epigenetic aberrations such as dysregulations of long non-coding RNA (lncRNA). These abnormal changes are reflected in transcriptome by turning oncogenes on and tumor suppressor genes off, which are considered cancer biomarkers. However, transcriptomic data is high dimensional, and finding the best subset of genes (features) related to causing cancer is computationally challenging and expensive. Thus, developing a feature selection framework to discover molecular biomarkers for cancer is critical. Traditional approaches for biomarker discovery calculate the fold change for each gene, comparing expression profiles between tumor and healthy samples, thus failing to capture the combined effect of the whole gene set. Also, these approaches do not always investigate cancer-type prediction capabilities using discovered biomarkers. In this work, we proposed a machine learning-based framework to address all of the above challenges in discovering lncRNA biomarkers. First, we developed a machine learning pipeline that takes lncRNA expression profiles of cancer samples as input and outputs a small set of key lncRNAs that can accurately predict multiple cancer types. A significant innovation of our work is its ability to identify biomarkers without using healthy samples. However, this initial framework cannot identify cancer-specific lncRNAs. Second, we extended our framework to identify cancer type and subtype-specific lncRNAs. Third, we proposed to use a state-of-the-art deep learning algorithm concrete autoencoder (CAE) in an unsupervised setting, which efficiently identifies a subset of the most informative features. However, CAE does not identify reproducible features in different runs due to its stochastic nature. Thus, we proposed a multi-run CAE (mrCAE) to identify a stable set of features to address this issue. Our deep learning-based pipeline significantly extended the previous state-of-the-art feature selection techniques. Finally, we showed that discovered biomarkers are biologically relevant using literature review and prognostically significant using survival analyses. The discovered novel biomarkers could be used as a screening tool for different cancer diagnoses and as therapeutic targets

    Identification of potential biomarkers to differentially diagnose solid pseudopapillary tumors and pancreatic malignancies via a gene regulatory network

    Get PDF
    Additional file 1: In-degree distribution for GRN. X-axis represents the in-degree for a certain node. A node of in-degree x means that this node is regulated by a total number of x other nodes. Y-axis represents the total number of network nodes which has an in-degree of x. The red curve was the fitting to the power law distribution. (A): The in-degree distribution for sub-GRN in which only miRNAs are included as regulators and the in-degree for each node (miRNAs and protein coding genes) was calculated in this sub-GRN. The in-degree ranges from 0 to 27. (B): In-degree distribution for sub-GRN in which only TFs are included as regulators

    Platelet Diagnostics:A novel liquid biomarker

    Get PDF
    The aim of this thesis is to find a novel liquid biomarker for the detection of cancer and to optimize treatment. The first chapter gives an introduction to the oncology biomarker field and focuses on platelets and their role in cancer. In part 1, we evaluate extracellular vesicles (EVs). EVs are small vesicles released by all types of cells, including tumor cells, into the circulation. They carry protein kinases and can be isolated from plasma. We demonstrate that AKT and ERK kinase protein levels in EVs reflect the cellular expression levels and treatment with kinase inhibitors alters their concentration, depending on the clinical response to the drug. Therefore, EVs may provide a promising biomarker biosource for monitoring of treatment responses. Part 2 starts with reviews describing the function and role of platelets in greater depth. Chapter 3 focusses on thrombocytogenesis and several biological processes in which platelets play a role. Furthermore, the RNA processing machineries harboured by platelets are discussed. Both chapter 3 and 4 evaluate the change platelets undergo after being exposed to tumor and its environment. The exchange of biomolecules with tumor cells results in educated platelets, so-called tumor educated platelets (TEPs). TEPs play a role in several hallmarks of cancer and have the ability to respond to systemic alterations making them an interesting biomarker. In chapter 5 the diagnostic potential of platelets is first discussed. We determine their potential by sequencing the RNA of 283 platelet samples, of which 228 are patients with cancer, and 55 are healthy controls. We reach an accuracy of 96%. Furthermore, we are able to pinpoint the location of the primary tumor with an accuracy of 71%. In part 3, our developed thromboSeq platform is taken to the next level. Several potential confounding factors are taken into account such as age and comorbidity. We show that particle-swarm optimization (PSO)-enhanced algorithms enable efficient selection of RNA biomarker panels. In a validation cohort we apply these algorithms to non-small-cell lung cancer and reach an accuracy of 88% in late stage (n=518) and early-stage 81% accuracy. Finally, in chapter 7 we describe our wet- and dry-lab protocols in detail. This includes platelet RNA isolation, mRNA amplification, and preparation for next-generation sequencing. The dry-lab protocol describes the automated FASTQ file pre-processing to quantified gene counts, quality controls, data normalization and correction, and swarm intelligence-enhanced support vector machine (SVM) algorithm development. Part 4 focuses on central nervous system (CNS) malignancies especially on glioblastoma. Chapter 8 gives an overview of the different liquid biomarkers for diffuse glioma, the most common primary CNS malignancy. In chapter 9 we assess the specificity of the platelet education due to glioblastoma by comparing the RNA profile of TEPs from glioblastoma patients with a neuroinflammatory disease and brain metastasis patients. This results in a detection accuracy of 80%. Secondly, analysis of patients with glioblastoma versus healthy controls in an independent validation series provide a detection accuracy of 95%. Furthermore, we describe the potential value of platelets as a monitoring biomarker for patients with glioma, distinguishing pseudoprogression from real tumor progression. In part 5 thromboSeq is applied to breast cancer diagnostics both as a screening tool in the general population and in a high risk population, BRCA mutated women. In chapter 11 we first apply our technique to an inflammatory condition, multiple sclerosis (MS). Platelet RNA is used as input for the development of a diagnostic MS classifier capable of detecting MS with 80% accuracy in the independent validation series. In the final part we conclude this thesis with a general discussion of the main findings and suggestions for future research

    Systems Analytics and Integration of Big Omics Data

    Get PDF
    A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome

    Microarray-based Multiclass Classification using Relative Expression Analysis

    Get PDF
    Microarray gene expression profiling has led to a proliferation of statistical learning methods proposed for a variety of problems related to biological and clinical discoveries. One major problem is to identify gene expression-based biological markers for class discovery and prediction of complex diseases such as cancer. For example, expression patterns of genes are discovered to be associated with phenotypes (e.g., classes of disease) through statistical learning models. Early hopes that well-developed methods such as support vector machines would completely revolutionize the field have been moderated by the difficulties of analyzing microarray data. Hence, new and effective approaches need to be developed to address some common limitations encountered by current methods. This thesis is focused on improving statistical learning on microarray data through rank-based methodologies. The relative expression analysis introduced in Chapter 1 is the central concept for methodological development where the relative expression ordering (i.e., the relative ranks of expression levels) of genes is investigated instead of analyzing the actual expression values of individual genes. Supervised learning problems are studied where classification models are built for differentiating disease states. An unsupervised learning task is also examined in which subclasses are discovered by cluster analysis at the molecular level. Both types of problems under study consist of multiple classes. In Chapter 2, a novel rank-based classifier named Top Scoring Set (TSS) is developed for microarray classification of multiple disease states. It generalizes the Top Scoring Pair (TSP) method for binary classification problems to the multiclass case. Its main advantage lies in the simplicity and power of its decision rule, which provides transparent boundaries and allows for potential biological interpretations. Since TSS requires a dimension reduction in the training process, a greedy search algorithm is proposed to perform a fast search over the feature space. In addition, ensemble classification based on TSS is also investigated. In Chapter 3, an efficient and biologically meaningful dimension reduction for the TSS classifier is introduced using the publicly available pathway databases. Pre-defined functional gene groups are analyzed for microarray classification. The pathway-based TSS classifier is validated on an extremely large cohort of leukemia cancer patients. Also, the unsupervised learning ability of relative expression analysis is studied and a rank-based clustering approach is introduced to identify molecularly distinct subtypes of breast cancer patients. Based on the clustering results, the TSP classifier is used for predicting the subtypes of individual breast cancer tumors. These rank-based methods provide an independent validation for the current identification of breast cancer subtypes. Overall, this thesis provides developments and validations of statistical learning methods based on relative expression analysis
    • …
    corecore