807 research outputs found

    Gene set based ensemble methods for cancer classification

    Get PDF
    Diagnosis of cancer very often depends on conclusions drawn after both clinical and microscopic examinations of tissues to study the manifestation of the disease in order to place tumors in known categories. One factor which determines the categorization of cancer is the tissue from which the tumor originates. Information gathered from clinical exams may be partial or not completely predictive of a specific category of cancer. Further complicating the problem of categorizing various tumors is that the histological classification of the cancer tissue and description of its course of development may be atypical. Gene expression data gleaned from micro-array analysis provides tremendous promise for more accurate cancer diagnosis. One hurdle in the classification of tumors based on gene expression data is that the data space is ultra-dimensional with relatively few points; that is, there are a small number of examples with a large number of genes. A second hurdle is expression bias caused by the correlation of genes. Analysis of subsets of genes, known as gene set analysis, provides a mechanism by which groups of differentially expressed genes can be identified. We propose an ensemble of classifiers whose base classifiers are ℓ1-regularized logistic regression models with restriction of the feature space to biologically relevant genes. Some researchers have already explored the use of ensemble classifiers to classify cancer but the effect of the underlying base classifiers in conjunction with biologically-derived gene sets on cancer classification has not been explored

    An agent-based hybrid system for microarray data analysis

    Full text link
    This article reports our experience in agent-based hybrid construction for microarray data analysis. The contributions are twofold: We demonstrate that agent-based approaches are suitable for building hybrid systems in general, and that a genetic ensemble system is appropriate for microarray data analysis in particular. Created using an agent-based framework, this genetic ensemble system for microarray data analysis excels in both sample classification accuracy and gene selection reproducibility.<br /

    Identifying the molecular components that matter: a statistical modelling approach to linking functional genomics data to cell physiology

    Get PDF
    Functional genomics technologies, in which thousands of mRNAs, proteins, or metabolites can be measured in single experiments, have contributed to reshape biological investigations. One of the most important issues in the analysis of the generated large datasets is the selection of relatively small sub-sets of variables that are predictive of the physiological state of a cell or tissue. In this thesis, a truly multivariate variable selection framework using diverse functional genomics data has been developed, characterized, and tested. This framework has also been used to prove that it is possible to predict the physiological state of the tumour from the molecular state of adjacent normal cells. This allows us to identify novel genes involved in cell to cell communication. Then, using a network inference technique networks representing cell-cell communication in prostate cancer have been inferred. The analysis of these networks has revealed interesting properties that suggests a crucial role of directional signals in controlling the interplay between normal and tumour cell to cell communication. Experimental verification performed in our laboratory has provided evidence that one of the identified genes could be a novel tumour suppressor gene. In conclusion, the findings and methods reported in this thesis have contributed to further understanding of cell to cell interaction and multivariate variable selection not only by applying and extending previous work, but also by proposing novel approaches that can be applied to any functional genomics data

    AdaSampling for positive-unlabeled and label noise learning with bioinformatics applications

    Full text link
    © 2018 IEEE. Class labels are required for supervised learning but may be corrupted or missing in various applications. In binary classification, for example, when only a subset of positive instances is labeled whereas the remaining are unlabeled, positive-unlabeled (PU) learning is required to model from both positive and unlabeled data. Similarly, when class labels are corrupted by mislabeled instances, methods are needed for learning in the presence of class label noise (LN). Here we propose adaptive sampling (AdaSampling), a framework for both PU learning and learning with class LN. By iteratively estimating the class mislabeling probability with an adaptive sampling procedure, the proposed method progressively reduces the risk of selecting mislabeled instances for model training and subsequently constructs highly generalizable models even when a large proportion of mislabeled instances is present in the data. We demonstrate the utilities of proposed methods using simulation and benchmark data, and compare them to alternative approaches that are commonly used for PU learning and/or learning with LN. We then introduce two novel bioinformatics applications where AdaSampling is used to: 1) identify kinase-substrates from mass spectrometry-based phosphoproteomics data and 2) predict transcription factor target genes by integrating various next-generation sequencing data

    Classification of microarray gene expression cancer data by using artificial intelligence methods

    Get PDF
    Günümüzde bilgisayar teknolojilerinin gelişmesi ile birçok alanda yapılan çalışmaları etkilemiştir. Moleküler biyoloji ve bilgisayar teknolojilerinde meydana gelen gelişmeler biyoinformatik adlı bilimi ortaya çıkarmıştır. Biyoinformatik alanında meydana gelen hızlı gelişmeler, bu alanda çözülmeyi bekleyen birçok probleme çözüm olma yolunda büyük katkılar sağlamıştır. DNA mikroarray gen ekspresyonlarının sınıflandırılması da bu problemlerden birisidir. DNA mikroarray çalışmaları, biyoinformatik alanında kullanılan bir teknolojidir. DNA mikroarray veri analizi, kanser gibi genlerle alakalı hastalıkların teşhisinde çok etkin bir rol oynamaktadır. Hastalık türüne bağlı gen ifadeleri belirlenerek, herhangi bir bireyin hastalıklı gene sahip olup olmadığı büyük bir başarı oranı ile tespit edilebilir. Bireyin sağlıklı olup olmadığının tespiti için, mikroarray gen ekspresyonları üzerinde yüksek performanslı sınıflandırma tekniklerinin kullanılması büyük öneme sahiptir. DNA mikroarray’lerini sınıflandırmak için birçok yöntem bulunmaktadır. Destek Vektör Makinaları, Naive Bayes, k-En yakın Komşu, Karar Ağaçları gibi birçok istatistiksel yöntemler yaygın olarak kullanlmaktadır. Fakat bu yöntemler tek başına kullanıldığında, mikroarray verilerini sınıflandırmada her zaman yüksek başarı oranları vermemektedir. Bu yüzden mikroarray verilerini sınıflandırmada yüksek başarı oranları elde etmek için yapay zekâ tabanlı yöntemlerin de kullanılması yapılan çalışmalarda görülmektedir. Bu çalışmada, bu istatistiksel yöntemlere ek olarak yapay zekâ tabanlı ANFIS gibi bir yöntemi kullanarak daha yüksek başarı oranları elde etmek amaçlanmıştır. İstatistiksel sınıflandırma yöntemleri olarak K-En Yakın Komşuluk, Naive Bayes ve Destek Vektör Makineleri kullanılmıştır. Burada Göğüs ve Merkezi Sinir Sistemi kanseri olmak üzere iki farklı kanser veri seti üzerinde çalışmalar yapılmıştır. Sonuçlardan elde edilen bilgilere göre, genel olarak yapay zekâ tabanlı ANFIS tekniğinin, istatistiksel yöntemlere göre daha başarılı olduğu tespit edilmiştir

    An Eight-Gene Blood Expression Profile Predicts the Response to Infliximab in Rheumatoid Arthritis

    Get PDF
    BACKGROUND: TNF alpha blockade agents like infliximab are actually the treatment of choice for those rheumatoid arthritis (RA) patients who fail standard therapy. However, a considerable percentage of anti-TNF alpha treated patients do not show a significant clinical response. Given that new therapies for treatment of RA have been recently approved, there is a pressing need to find a system that reliably predicts treatment response. We hypothesized that the analysis of whole blood gene expression profiles of RA patients could be used to build a robust predictor to infliximab therapy. METHODS AND FINDINGS: We performed microarray gene expression analysis on whole blood RNA samples from RA patients starting infliximab therapy (n = 44). The clinical response to infliximab was determined at week 14 using the EULAR criteria. Blood cell populations were determined using flow cytometry at baseline, week 2 and week 14 of treatment. Using complete cross-validation and repeated random sampling we identified a robust 8-gene predictor model (96.6% Leave One Out prediction accuracy, P = 0.0001). Applying this model to an independent validation set of RA patients, we estimated an 85.7% prediction accuracy (75-100%, 95% CI). In parallel, we also observed a significantly higher number of CD4+CD25+ cells (i.e. regulatory T cells) in the responder group compared to the non responder group at baseline (P = 0.0009). CONCLUSIONS: The present 8-gene model obtained from whole blood expression efficiently predicts response to infliximab in RA patients. The application of the present system in the clinical setting could assist the clinician in the selection of the optimal treatment strategy in RA

    Classification of clinical outcomes using high-throughput and clinical informatics.

    Get PDF
    It is widely recognized that many cancer therapies are effective only for a subset of patients. However clinical studies are most often powered to detect an overall treatment effect. To address this issue, classification methods are increasingly being used to predict a subset of patients which respond differently to treatment. This study begins with a brief history of classification methods with an emphasis on applications involving melanoma. Nonparametric methods suitable for predicting subsets of patients responding differently to treatment are then reviewed. Each method has different ways of incorporating continuous, categorical, clinical and high-throughput covariates. For nonparametric and parametric methods, distance measures specific to the method are used to make classification decisions. Approaches are outlined which employ these distances to measure treatment interactions and predict patients more sensitive to treatment. Simulations are also carried out to examine empirical power of some of these classification methods in an adaptive signature design. Results were compared with logistic regression models. It was found that parametric and nonparametric methods performed reasonably well. Relative performance of the methods depends on the simulation scenario. Finally a method was developed to evaluate power and sample size needed for an adaptive signature design in order to predict the subset of patients sensitive to treatment. It is hoped that this study will stimulate more development of nonparametric and parametric methods to predict subsets of patients responding differently to treatment

    Intelligent Computing for Big Data

    Get PDF
    Recent advances in artificial intelligence have the potential to further develop current big data research. The Special Issue on ‘Intelligent Computing for Big Data’ highlighted a number of recent studies related to the use of intelligent computing techniques in the processing of big data for text mining, autism diagnosis, behaviour recognition, and blockchain-based storage

    Statistical Challenges and Methods for Missing and Imbalanced Data

    Get PDF
    Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of this dissertation addresses the dropout phenomenon in single-cell RNA (scRNA) sequencing through a comparative analyses of some existing scRNA sequencing techniques. The second paper of this work focuses on using simulation studies to assess whether it is appropriate to address the issue of non-detects in data using a traditional substitution approach, imputation, or a non-imputation based approach. The final paper of this dissertation presents an efficient strategy to address the issue of imbalance in data at any degree (whether moderate or highly imbalanced) by combining random undersampling with different weighting strategies. We conclude generally, based on findings from this dissertation that, missingness is not always lack of information but interestingness that needs to investigated

    Diseño de sistemas neurocomputacionales en el ámbito de la Biomedicina

    Get PDF
    El área de la biomedicina es un área extensa en el que las entidades públicas de cada país han invertido y continúan invirtiendo en investigación una gran cantidad de financiación a través de proyectos nacionales, europeos e internacionales. Los avances científicos y tecnológicos registrados en los últimos quince años han permitido profundizar en las bases genéticas y moleculares de enfermedades como el cáncer, y analizar la variabilidad de respuesta de pacientes individuales a diferentes tratamientos oncológicos, estableciendo las bases de lo que hoy se conoce como medicina personalizada. Ésta puede definirse como el diseño y aplicación de estrategias de prevención, diagnóstico y tratamiento adaptadas a un escenario que integra la información del perfil genético, clínico, histopatológico e inmuhistoquímico de cada paciente y patología. Dada la incidencia de la enfermedad de cáncer en la sociedad, y a pesar de que la investigación se ha centrado tradicionalmente en el aspecto de diagnóstico, es relativamente reciente el interés de los investigadores por el estudio del pronóstico de la enfermedad, aspecto integrado en la tendencia creciente de los sistemas nacionales de salud pública hacia un modelo de medicina personalizada y predictiva. El pronóstico puede ser definido como conocimiento previo de un evento antes de su posible aparición, y puede enfocarse a la susceptibilidad, supervivencia y recidiva de la enfermedad. En la literatura, existen trabajos que utilizan modelos neurocomputacionales para la predicción de casuísticas muy particulares como, por ejemplo, la recidiva en cáncer de mama operable, basándose en factores pronóstico de naturaleza clínico-histopatológica. En ellos se demuestra que estos modelos superan en rendimiento a las herramientas estadísticas tradicionalmente utilizadas en análisis de supervivencia por el personal clínico experto. Sin embargo, estos modelos pierden eficacia cuando procesan información de tumores atípicos o subtipos morfológicamente indistinguibles, para los que los factores clínicos e histopatológicos no proporcionan suficiente información discriminatoria. El motivo es la heterogeneidad del cáncer como enfermedad, para la que no existe una causa individual caracterizada, y cuya evolución se ha demostrado que está determinada por factores no sólo clínicos sino también genéticos. Por ello, la integración de los datos clínico-histopatológicos y proteómico-genómica proporcionan una mayor precisión en la predicción en comparación con aquellos modelos que utilizan sólo un tipo de datos, permitiendo llevar a la práctica clínica diaria una medicina personalizada. En este sentido, los datos de perfiles de expresión provenientes de experimentos con plataformas de microarrays de ADN, los datos de microarrays de miRNA, o más recientemente secuenciadores de última generación como RNA-Seq, proporcionan el nivel de detalle y complejidad necesarios para clasificar tumores atípicos estableciendo diferentes pronósticos para pacientes dentro de un mismo grupo protocolizado. El análisis de datos de esta naturaleza representa un verdadero reto para clínicos, biólogos y el resto de la comunidad científica en general dado el gran volumen de información producido por estas plataformas. Por lo general, las muestras resultantes de los experimentos en estas plataformas vienen representadas por un número muy elevado de genes, del orden de miles de ellos. La identificación de los genes más significativos que incorporen suficiente información discriminatoria y que permita el diseño de modelos predictivos sería prácticamente imposible de llevar a cabo sin ayuda de la informática. Es aquí donde surge la Bioinformática, término que hace referencia a cómo se aplica la ciencia de la información en el área de la biomedicina. El objetivo global que se intenta alcanzar en esta tesis consiste, por tanto, en llevar a la práctica clínica diaria una medicina personalizada. Para ello, se utilizarán datos de perfiles de expresión de alguna de las plataformas de microarrays más relevantes con objeto de desarrollar modelos predictivos que permitan obtener una mejora en la capacidad de generalización de los sistemas pronóstico actuales en el ámbito clínico. Del objetivo global de la tesis pueden derivarse tres objetivos parciales: el primero buscará (i) pre-procesar cualquier conjunto de datos en general y, datos de carácter biomédico en particular, para un posterior análisis; el segundo buscará (ii) analizar las principales deficiencias existentes en los sistemas de información actuales de un servicio de oncología para así desarrollar un sistema de información oncológico que cubra todas sus necesidades; y el tercero buscará (iii) desarrollar nuevos modelos predictivos basados en perfiles de expresión obtenidos a partir de alguna plataforma de secuenciación, haciendo hincapié en la capacidad predictiva de estos modelos, la robustez y la relevancia biológica de las firmas genéticas encontradas. Finalmente, se puede concluir que los resultados obtenidos en esta tesis doctoral permitirían ofrecer, en un futuro cercano, una medicina personalizada en la práctica clínica diaria. Los modelos predictivos basados en datos de perfiles de expresión que se han desarrollado en la tesis podrían integrarse en el propio sistema de información oncológico implantado en el Hospital Universitario Virgen de la Victoria (HUVV) de Málaga, fruto de parte del trabajo realizado en esta tesis. Además, se podría incorporar la información proteómico-genómica de cada paciente para poder aprovechar al máximo las ventajas añadidas mencionadas a lo largo de esta tesis. Por otro lado, gracias a todo el trabajo realizado en esta tesis, el doctorando ha podido profundizar y adquirir una extensa formación investigadora en un área tan amplia como es la Bioinformática
    corecore