262 research outputs found

    Unconventional machine learning of genome-wide human cancer data

    Full text link
    Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences

    Leukemia multiclass assessment and classification from Microarray and RNA-seq technologies integration at gene expression level

    Get PDF
    In more recent years, a significant increase in the number of available biological experiments has taken place due to the widespread use of massive sequencing data. Furthermore, the continuous developments in the machine learning and in the high performance computing areas, are allowing a faster and more efficient analysis and processing of this type of data. However, biological information about a certain disease is normally widespread due to the use of different sequencing technologies and different manufacturers, in different experiments along the years around the world. Thus, nowadays it is of paramount importance to attain a correct integration of biologically-related data in order to achieve genuine benefits from them. For this purpose, this work presents an integration of multiple Microarray and RNA-seq platforms, which has led to the design of a multiclass study by collecting samples from the main four types of leukemia, quantified at gene expression. Subsequently, in order to find a set of differentially expressed genes with the highest discernment capability among different types of leukemia, an innovative parameter referred to as coverage is presented here. This parameter allows assessing the number of different pathologies that a certain gen is able to discern. It has been evaluated together with other widely known parameters under assessment of an ANOVA statistical test which corroborated its filtering power when the identified genes are subjected to a machine learning process at multiclass level. The optimal tuning of gene extraction evaluated parameters by means of this statistical test led to the selection of 42 highly relevant expressed genes. By the use of minimum- Redundancy Maximum-Relevance (mRMR) feature selection algorithm, these genes were reordered and assessed under the operation of four different classification techniques. Outstanding results were achieved by taking exclusively the first ten genes of the ranking into consideration. Finally, specific literature was consulted on this last subset of genes, revealing the occurrence of practically all of them with biological processes related to leukemia. At sight of these results, this study underlines the relevance of considering a new parameter which facilitates the identification of highly valid expressed genes for simultaneously discerning multiple types of leukemia.This work was supported by Project TIN2015-71873-R (Spanish Ministry of Economy and Competitiveness -MINECO- and the European Regional Development Fund -ERDF) and Junta de AndalucıŽa (P12–TIC–2082)

    Multiplatform biomarker identification using a data-driven approach enables single-sample classification

    Get PDF
    Background: High-throughput gene expression profiles have allowed discovery of potential biomarkers enabling early diagnosis, prognosis and developing individualized treatment. However, it remains a challenge to identify a set of reliable and reproducible biomarkers across various gene expression platforms and laboratories for single sample diagnosis and prognosis. We address this need with our Data-Driven Reference (DDR) approach, which employs stably expressed housekeeping genes as references to eliminate platform-specific biases and non-biological variabilities. Results: Our method identifies biomarkers with “built-in” features, and these features can be interpreted consistently regardless of profiling technology, which enable classification of single-sample independent of platforms. Validation with RNA-seq data of blood platelets shows that DDR achieves the superior performance in classification of six different tumor types as well as molecular target statuses (such as MET or HER2-positive, and mutant KRAS, EGFR or PIK3CA) with smaller sets of biomarkers. We demonstrate on the three microarray datasets that our method is capable of identifying robust biomarkers for subgrouping medulloblastoma samples with data perturbation due to different microarray platforms. In addition to identifying the majority of subgroup-specific biomarkers in CodeSet of nanoString, some potential new biomarkers for subgrouping medulloblastoma were detected by our method. Conclusions: In this study, we present a simple, yet powerful data-driven method which contributes significantly to identification of robust cross-platform gene signature for disease classification of single-patient to facilitate precision medicine. In addition, our method provides a new strategy for transcriptome analysis

    Development of a simple artificial intelligence method to accurately subtype breast cancers based on gene expression barcodes

    Get PDF
    >Magister Scientiae - MScINTRODUCTION: Breast cancer is a highly heterogeneous disease. The complexity of achieving an accurate diagnosis and an effective treatment regimen lies within this heterogeneity. Subtypes of the disease are not simply molecular, i.e. hormone receptor over-expression or absence, but the tumour itself is heterogeneous in terms of tissue of origin, metastases, and histopathological variability. Accurate tumour classification vastly improves treatment decisions, patient outcomes and 5-year survival rates. Gene expression studies aided by transcriptomic technologies such as microarrays and next-generation sequencing (e.g. RNA-Sequencing) have aided oncology researcher and clinician understanding of the complex molecular portraits of malignant breast tumours. Mechanisms governing cancers, which include tumorigenesis, gene fusions, gene over-expression and suppression, cellular process and pathway involvementinvolvement, have been elucidated through comprehensive analyses of the cancer transcriptome. Over the past 20 years, gene expression signatures, discovered with both microarray and RNA-Seq have reached clinical and commercial application through the development of tests such as MammaprintÂź, OncotypeDXÂź, and FoundationOneÂź CDx, all which focus on chemotherapy sensitivity, prediction of cancer recurrence, and tumour mutational level. The Gene Expression Barcode (GExB) algorithm was developed to allow for easy interpretation and integration of microarray data through data normalization with frozen RMA (fRMA) preprocessing and conversion of relative gene expression to a sequence of 1's and 0's. Unfortunately, the algorithm has not yet been developed for RNA-Seq data. However, implementation of the GExB with feature-selection would contribute to a machine-learning based robust breast cancer and subtype classifier. METHODOLOGY: For microarray data, we applied the GExB algorithm to generate barcodes for normal breast and breast tumour samples. A two-class classifier for malignancy was developed through feature-selection on barcoded samples by selecting for genes with 85% stable absence or presence within a tissue type, and differentially stable between tissues. A multi-class feature-selection method was employed to identify genes with variable expression in one subtype, but 80% stable absence or presence in all other subtypes, i.e. 80% in n-1 subtypes. For RNA-Seq data, a barcoding method needed to be developed which could mimic the GExB algorithm for microarray data. A z-score-to-barcode method was implemented and differential gene expression analysis with selection of the top 100 genes as informative features for classification purposes. The accuracy and discriminatory capability of both microarray-based gene signatures and the RNA-Seq-based gene signatures was assessed through unsupervised and supervised machine-learning algorithms, i.e., K-means and Hierarchical clustering, as well as binary and multi-class Support Vector Machine (SVM) implementations. RESULTS: The GExB-FS method for microarray data yielded an 85-probe and 346-probe informative set for two-class and multi-class classifiers, respectively. The two-class classifier predicted samples as either normal or malignant with 100% accuracy and the multi-class classifier predicted molecular subtype with 96.5% accuracy with SVM. Combining RNA-Seq DE analysis for feature-selection with the z-score-to-barcode method, resulted in a two-class classifier for malignancy, and a multi-class classifier for normal-from-healthy, normal-adjacent-tumour (from cancer patients), and breast tumour samples with 100% accuracy. Most notably, a normal-adjacent-tumour gene expression signature emerged, which differentiated it from normal breast tissues in healthy individuals. CONCLUSION: A potentially novel method for microarray and RNA-Seq data transformation, feature selection and classifier development was established. The universal application of the microarray signatures and validity of the z-score-to-barcode method was proven with 95% accurate classification of RNA-Seq barcoded samples with a microarray discovered gene expression signature. The results from this comprehensive study into the discovery of robust gene expression signatures holds immense potential for further R&F towards implementation at the clinical endpoint, and translation to simpler and cost-effective laboratory methods such as qtPCR-based tests

    Bioinformatics applied to human genomics and proteomics: development of algorithms and methods for the discovery of molecular signatures derived from omic data and for the construction of co-expression and interaction networks

    Get PDF
    [EN] The present PhD dissertation develops and applies Bioinformatic methods and tools to address key current problems in the analysis of human omic data. This PhD has been organised by main objectives into four different chapters focused on: (i) development of an algorithm for the analysis of changes and heterogeneity in large-scale omic data; (ii) development of a method for non-parametric feature selection; (iii) integration and analysis of human protein-protein interaction networks and (iv) integration and analysis of human co-expression networks derived from tissue expression data and evolutionary profiles of proteins. In the first chapter, we developed and tested a new robust algorithm in R, called DECO, for the discovery of subgroups of features and samples within large-scale omic datasets, exploring all feature differences possible heterogeneity, through the integration of both data dispersion and predictor-response information in a new statistic parameter called h (heterogeneity score). In the second chapter, we present a simple non-parametric statistic to measure the cohesiveness of categorical variables along any quantitative variable, applicable to feature selection in all types of big data sets. In the third chapter, we describe an analysis of the human interactome integrating two global datasets from high-quality proteomics technologies: HuRI (a human protein-protein interaction network generated by a systematic experimental screening based on Yeast-Two-Hybrid technology) and Cell-Atlas (a comprehensive map of subcellular localization of human proteins generated by antibody imaging). This analysis aims to create a framework for the subcellular localization characterization supported by the human protein-protein interactome. In the fourth chapter, we developed a full integration of three high-quality proteome-wide resources (Human Protein Atlas, OMA and TimeTree) to generate a robust human co-expression network across tissues assigning each human protein along the evolutionary timeline. In this way, we investigate how old in evolution and how correlated are the different human proteins, and we place all them in a common interaction network. As main general comment, all the work presented in this PhD uses and develops a wide variety of bioinformatic and statistical tools for the analysis, integration and enlighten of molecular signatures and biological networks using human omic data. Most of this data corresponds to sample cohorts generated in recent biomedical studies on specific human diseases

    Differential expression of long non-coding RNAs are related to proliferation and histological diversity in follicular lymphomas

    Get PDF
    "This is the peer reviewed version of the following article: Roisman, Alejandro, et al. "Differential expression of long non‐coding RNA s are related to proliferation and histological diversity in follicular lymphomas." British journal of haematology (2018), which has been published in final form at https://doi.org/10.1111/bjh.15656. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Self-Archiving."Long non-coding RNAs (lncRNAs) comprise a family of non-coding transcripts that are emerging as relevant gene expression regulators of different processes, including tumour development. To determine the possible contribution of lncRNA to the pathogenesis of follicular lymphoma (FL) we performed RNA-sequencing at high depth sequencing in primary FL samples ranging from grade 1-3A to aggressive grade 3B variants using unpurified (n = 16) and purified (n = 12) tumour cell suspensions from nodal samples. FL grade 3B had a significantly higher number of differentially expressed lncRNAs (dif-lncRNAs) with potential target coding genes related to cell cycle regulation. Nine out of the 18 selected dif-lncRNAs were validated by quantitative real time polymerase chain reaction in an independent series (n = 43) of FL. RP4-694A7.2 was identified as the top deregulated lncRNA potentially involved in cell proliferation. RP4-694A7.2 silencing in the WSU-FSCCL FL cell line reduced cell proliferation due to a block in the G1/S phase. The relationship between RP4-694A7.2 and proliferation was confirmed in primary samples as its expression levels positively related to the Ki-67 proliferation index. In summary, lncRNAs are differentially expressed across the clinico-biological spectrum of FL and a subset of them, related to cell cycle, may participate in cell proliferation regulation in these tumours.Peer ReviewedPostprint (author's final draft

    Differential expression of long non-coding RNAs are related to proliferation and histological diversity in follicular lymphomas

    Get PDF
    Long non‐coding RNAs (lncRNAs) comprise a family of non‐coding transcripts that are emerging as relevant gene expression regulators of different processes, including tumour development. To determine the possible contribution of lncRNA to the pathogenesis of follicular lymphoma (FL) we performed RNA‐sequencing at high depth sequencing in primary FL samples ranging from grade 1‐3A to aggressive grade 3B variants using unpurified (n = 16) and purified (n = 12) tumour cell suspensions from nodal samples. FL grade 3B had a significantly higher number of differentially expressed lncRNAs (dif‐lncRNAs) with potential target coding genes related to cell cycle regulation. Nine out of the 18 selected dif‐lncRNAs were validated by quantitative real time polymerase chain reaction in an independent series (n = 43) of FL. RP4‐694A7.2 was identified as the top deregulated lncRNA potentially involved in cell proliferation. RP4‐694A7.2 silencing in the WSU‐FSCCL FL cell line reduced cell proliferation due to a block in the G1/S phase. The relationship between RP4‐694A7.2 and proliferation was confirmed in primary samples as its expression levels positively related to the Ki‐67 proliferation index. In summary, lncRNAs are differentially expressed across the clinico‐biological spectrum of FL and a subset of them, related to cell cycle, may participate in cell proliferation regulation in these tumours
    • 

    corecore