58 research outputs found

    En-PaFlower: An Ensemble Approach using PSO and Flower Pollination Algorithm for Cancer Diagnosis

    Get PDF
    Machine learning now is used across many sectors and provides consistently precise predictions. The machine learning system is able to learn effectively because the training dataset contains examples of previously completed tasks. After learning how to process the necessary data, researchers have proven that machine learning algorithms can carry out the whole work autonomously. In recent years, cancer has become a major cause of the worldwide increase in mortality. Therefore, early detection of cancer improves the chance of a complete recovery, and Machine Learning (ML) plays a significant role in this perspective. Cancer diagnostic and prognosis microarray dataset is available with the biopsy dataset. Because of its importance in making diagnoses and classifying cancer diseases, the microarray data represents a massive amount. It may be challenging to do an analysis on a large number of datasets, though. As a result, feature selection is crucial, and machine learning provides classification techniques. These algorithms choose the relevant features that help build a more precise categorization model. Accurately classifying diseases is facilitated as a result, which aids in disease prevention. This work aims to synthesize existing knowledge on cancer diagnosis using machine learning techniques into a compact report.  Current research work aims to propose an ensemble-based machine learning model En-PaFlower using Particle Swarm Optimization (PSO) as the feature selection algorithm, Flower Pollination algorithm (FPA) as the optimization algorithm with the majority voting algorithm. Finally, the performance of the proposed algorithm is evaluated over three different types of cancer disease datasets with accuracy, precision, recall, specificity, and F-1 Score etc as the evaluation parameters. The empirical analysis shows that the proposed methodology shows highest accuracy as 95.65%

    Statistical Methods in Cancer Genomics.

    Full text link
    Genomic and proteomic experiments have become widely applied in cancer profiling studies over the past decade. The genomics era is marked by the success of using DNA microarrays to delineate genome-scale gene expression patterns to pinpoint disease mechanism at the molecular level. An increasing number of studies have profiled tumor specimens using distinct microarray platforms and analysis techniques. With the accumulating amount of microarray data, integrative analysis has the potential to identify common gene expression patterns across data sets and tissue types. In this proposal, I introduce a Bayesian mixture model-based approach for meta-analysis of microarray studies. A probabilistic measure of gene differential expression is used as a scaleless quantity for an integrative analysis of DNA microarray data sets across platforms and laboratories. The role of DNA microarrays has been primarily on the discovery side to screen through thousands of genes for potential disease biomarkers. In this respect, Tissue Microarrays (TMAs) have provided a proteomic platform for downstream validation studies of these target discoveries. The other part of this proposal concerns an implementation of measurement error models for patient survival outcome analysis using TMA expression data. Two goals are explored: 1) in a two-stage approach, a Latent Expression Index (LEI) is introduced as a summary index for the TMA repeated expression measures; 2) a joint model of survival and TMA expression data is established via a shared random effect. Bayesian estimation is carried out using a Markov Chain Monte Carlo (MCMC) method. As an extension to the measurement error models, I further propose a Cell Mixture model to allow a wider range of inferences for TMA expression data.Ph.D.BiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57619/2/rlshen_1.pd

    A Multivariate Framework for Variable Selection and Identification of Biomarkers in High-Dimensional Omics Data

    Get PDF
    In this thesis, we address the identification of biomarkers in high-dimensional omics data. The identification of valid biomarkers is especially relevant for personalized medicine that depends on accurate prediction rules. Moreover, biomarkers elucidate the provenance of disease, or molecular changes related to disease. From a statistical point of view the identification of biomarkers is best cast as variable selection. In particular, we refer to variables as the molecular attributes under investigation, e.g. genes, genetic variation, or metabolites; and we refer to observations as the specific samples whose attributes we investigate, e.g. patients and controls. Variable selection in high-dimensional omics data is a complicated challenge due to the characteristic structure of omics data. For one, omics data is high-dimensional, comprising cellular information in unprecedented details. Moreover, there is an intricate correlation structure among the variables due to e.g internal cellular regulation, or external, latent factors. Variable selection for uncorrelated data is well established. In contrast, there is no consensus on how to approach variable selection under correlation. Here, we introduce a multivariate framework for variable selection that explicitly accounts for the correlation among markers. In particular, we present two novel quantities for variable importance: the correlation-adjusted t (CAT) score for classification, and the correlation-adjusted (marginal) correlation (CAR) score for regression. The CAT score is defined as the Mahalanobis-decorrelated t-score vector, and the CAR score as the Mahalanobis-decorrelated correlation between the predictor variables and the outcome. We derive the CAT and CAR score from a predictive point of view in linear discriminant analysis and regression; both quantities assess the weight of a decorrelated and standardized variable on the prediction rule. Furthermore, we discuss properties of both scores and relations to established quantities. Above all, the CAT score decomposes Hotelling’s T 2 and the CAR score the proportion of variance explained. Notably, the decomposition of total variance into explained and unexplained variance in the linear model can be rewritten in terms of CAR scores. To render our approach applicable on high-dimensional omics data we devise an efficient algorithm for shrinkage estimates of the CAT and CAR score. Subsequently, we conduct extensive simulation studies to investigate the performance of our novel approaches in ranking and prediction under correlation. Here, CAT and CAR scores consistently improve over marginal approaches in terms of more true positives selected and a lower model error. Finally, we illustrate the application of CAT and CAR score on real omics data. In particular, we analyze genomics, transcriptomics, and metabolomics data. We ascertain that CAT and CAR score are competitive or outperform state of the art techniques in terms of true positives detected and prediction error

    Hierarchical gene selection and genetic fuzzy system for cancer microarray data classification

    Full text link
    This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice

    Bayesian classification and survival analysis with curve predictors

    Get PDF
    We propose classification models for binary and multicategory data where the predictor is a random function. The functional predictor could be irregularly and sparsely sampled or characterized by high dimension and sharp localized changes. In the former case, we employ Bayesian modeling utilizing flexible spline basis which is widely used for functional regression. In the latter case, we use Bayesian modeling with wavelet basis functions which have nice approximation properties over a large class of functional spaces and can accommodate varieties of functional forms observed in real life applications. We develop an unified hierarchical model which accommodates both the adaptive spline or wavelet based function estimation model as well as the logistic classification model. These two models are coupled together to borrow strengths from each other in this unified hierarchical framework. The use of Gibbs sampling with conjugate priors for posterior inference makes the method computationally feasible. We compare the performance of the proposed models with the naive models as well as existing alternatives by analyzing simulated as well as real data. We also propose a Bayesian unified hierarchical model based on a proportional hazards model and generalized linear model for survival analysis with irregular longitudinal covariates. This relatively simple joint model has two advantages. One is that using spline basis simplifies the parameterizations while a flexible non-linear pattern of the function is captured. The other is that joint modeling framework allows sharing of the information between the regression of functional predictors and proportional hazards modeling of survival data to improve the efficiency of estimation. The novel method can be used not only for one functional predictor case, but also for multiple functional predictors case. Our methods are applied to analyze real data sets and compared with a parameterized regression method

    Diseño de sistemas neurocomputacionales en el ámbito de la Biomedicina

    Get PDF
    El área de la biomedicina es un área extensa en el que las entidades públicas de cada país han invertido y continúan invirtiendo en investigación una gran cantidad de financiación a través de proyectos nacionales, europeos e internacionales. Los avances científicos y tecnológicos registrados en los últimos quince años han permitido profundizar en las bases genéticas y moleculares de enfermedades como el cáncer, y analizar la variabilidad de respuesta de pacientes individuales a diferentes tratamientos oncológicos, estableciendo las bases de lo que hoy se conoce como medicina personalizada. Ésta puede definirse como el diseño y aplicación de estrategias de prevención, diagnóstico y tratamiento adaptadas a un escenario que integra la información del perfil genético, clínico, histopatológico e inmuhistoquímico de cada paciente y patología. Dada la incidencia de la enfermedad de cáncer en la sociedad, y a pesar de que la investigación se ha centrado tradicionalmente en el aspecto de diagnóstico, es relativamente reciente el interés de los investigadores por el estudio del pronóstico de la enfermedad, aspecto integrado en la tendencia creciente de los sistemas nacionales de salud pública hacia un modelo de medicina personalizada y predictiva. El pronóstico puede ser definido como conocimiento previo de un evento antes de su posible aparición, y puede enfocarse a la susceptibilidad, supervivencia y recidiva de la enfermedad. En la literatura, existen trabajos que utilizan modelos neurocomputacionales para la predicción de casuísticas muy particulares como, por ejemplo, la recidiva en cáncer de mama operable, basándose en factores pronóstico de naturaleza clínico-histopatológica. En ellos se demuestra que estos modelos superan en rendimiento a las herramientas estadísticas tradicionalmente utilizadas en análisis de supervivencia por el personal clínico experto. Sin embargo, estos modelos pierden eficacia cuando procesan información de tumores atípicos o subtipos morfológicamente indistinguibles, para los que los factores clínicos e histopatológicos no proporcionan suficiente información discriminatoria. El motivo es la heterogeneidad del cáncer como enfermedad, para la que no existe una causa individual caracterizada, y cuya evolución se ha demostrado que está determinada por factores no sólo clínicos sino también genéticos. Por ello, la integración de los datos clínico-histopatológicos y proteómico-genómica proporcionan una mayor precisión en la predicción en comparación con aquellos modelos que utilizan sólo un tipo de datos, permitiendo llevar a la práctica clínica diaria una medicina personalizada. En este sentido, los datos de perfiles de expresión provenientes de experimentos con plataformas de microarrays de ADN, los datos de microarrays de miRNA, o más recientemente secuenciadores de última generación como RNA-Seq, proporcionan el nivel de detalle y complejidad necesarios para clasificar tumores atípicos estableciendo diferentes pronósticos para pacientes dentro de un mismo grupo protocolizado. El análisis de datos de esta naturaleza representa un verdadero reto para clínicos, biólogos y el resto de la comunidad científica en general dado el gran volumen de información producido por estas plataformas. Por lo general, las muestras resultantes de los experimentos en estas plataformas vienen representadas por un número muy elevado de genes, del orden de miles de ellos. La identificación de los genes más significativos que incorporen suficiente información discriminatoria y que permita el diseño de modelos predictivos sería prácticamente imposible de llevar a cabo sin ayuda de la informática. Es aquí donde surge la Bioinformática, término que hace referencia a cómo se aplica la ciencia de la información en el área de la biomedicina. El objetivo global que se intenta alcanzar en esta tesis consiste, por tanto, en llevar a la práctica clínica diaria una medicina personalizada. Para ello, se utilizarán datos de perfiles de expresión de alguna de las plataformas de microarrays más relevantes con objeto de desarrollar modelos predictivos que permitan obtener una mejora en la capacidad de generalización de los sistemas pronóstico actuales en el ámbito clínico. Del objetivo global de la tesis pueden derivarse tres objetivos parciales: el primero buscará (i) pre-procesar cualquier conjunto de datos en general y, datos de carácter biomédico en particular, para un posterior análisis; el segundo buscará (ii) analizar las principales deficiencias existentes en los sistemas de información actuales de un servicio de oncología para así desarrollar un sistema de información oncológico que cubra todas sus necesidades; y el tercero buscará (iii) desarrollar nuevos modelos predictivos basados en perfiles de expresión obtenidos a partir de alguna plataforma de secuenciación, haciendo hincapié en la capacidad predictiva de estos modelos, la robustez y la relevancia biológica de las firmas genéticas encontradas. Finalmente, se puede concluir que los resultados obtenidos en esta tesis doctoral permitirían ofrecer, en un futuro cercano, una medicina personalizada en la práctica clínica diaria. Los modelos predictivos basados en datos de perfiles de expresión que se han desarrollado en la tesis podrían integrarse en el propio sistema de información oncológico implantado en el Hospital Universitario Virgen de la Victoria (HUVV) de Málaga, fruto de parte del trabajo realizado en esta tesis. Además, se podría incorporar la información proteómico-genómica de cada paciente para poder aprovechar al máximo las ventajas añadidas mencionadas a lo largo de esta tesis. Por otro lado, gracias a todo el trabajo realizado en esta tesis, el doctorando ha podido profundizar y adquirir una extensa formación investigadora en un área tan amplia como es la Bioinformática

    StressGenePred: a twin prediction model architecture for classifying the stress types of samples and discovering stress-related genes in arabidopsis

    Get PDF
    Background Recently, a number of studies have been conducted to investigate how plants respond to stress at the cellular molecular level by measuring gene expression profiles over time. As a result, a set of time-series gene expression data for the stress response are available in databases. With the data, an integrated analysis of multiple stresses is possible, which identifies stress-responsive genes with higher specificity because considering multiple stress can capture the effect of interference between stresses. To analyze such data, a machine learning model needs to be built. Results In this study, we developed StressGenePred, a neural network-based machine learning method, to integrate time-series transcriptome data of multiple stress types. StressGenePred is designed to detect single stress-specific biomarker genes by using a simple feature embedding method, a twin neural network model, and Confident Multiple Choice Learning (CMCL) loss. The twin neural network model consists of a biomarker gene discovery and a stress type prediction model that share the same logical layer to reduce training complexity. The CMCL loss is used to make the twin model select biomarker genes that respond specifically to a single stress. In experiments using Arabidopsis gene expression data for four major environmental stresses, such as heat, cold, salt, and drought, StressGenePred classified the types of stress more accurately than the limma feature embedding method and the support vector machine and random forest classification methods. In addition, StressGenePred discovered known stress-related genes with higher specificity than the Fisher method. Conclusions StressGenePred is a machine learning method for identifying stress-related genes and predicting stress types for an integrated analysis of multiple stress time-series transcriptome data. This method can be used to other phenotype-gene associated studies.This work and publication costs were supported by National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT (No. NRF2017M3C4A7065887), and the Collaborative Genome Program for Fostering New Post-Genome Industry of the National Research Foundation (NRF) funded by the Ministry of Science and ICT (MSIT) (No. NRF-2014M3C9A3063541). This work was supported for W.J. by the Agenda program (No. PJ014307), Rural Development of Administration of Republic of Korea

    Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

    Get PDF
    COST Action CA18131 Cierva Grant IJC2019-042188-I (LM-Z) Estonian Research Council grant PUT 1371The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.publishersversionpublishe

    Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

    Get PDF
    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.This study was supported by COST Action CA18131 “Statistical and machine learning techniques in human microbiome studies”. Estonian Research Council grant PRG548 (JT). Spanish State Research Agency Juan de la Cierva Grant IJC2019-042188-I (LM-Z). EO was founded and OA was supported by Estonian Research Council grant PUT 1371 and EMBO Installation grant 3573. AG was supported by Statutory Research project of the Department of Computer Networks and Systems

    Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment

    Get PDF
    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach
    corecore