91 research outputs found

    Automated segmentation of tissue images for computerized IHC analysis

    Get PDF
    This paper presents two automated methods for the segmentation ofimmunohistochemical tissue images that overcome the limitations of themanual approach aswell as of the existing computerized techniques. The first independent method, based on unsupervised color clustering, recognizes automatically the target cancerous areas in the specimen and disregards the stroma; the second method, based on colors separation and morphological processing, exploits automated segmentation of the nuclear membranes of the cancerous cells. Extensive experimental results on real tissue images demonstrate the accuracy of our techniques compared to manual segmentations; additional experiments show that our techniques are more effective in immunohistochemical images than popular approaches based on supervised learning or active contours. The proposed procedure can be exploited for any applications that require tissues and cells exploration and to perform reliable and standardized measures of the activity of specific proteins involved in multi-factorial genetic pathologie

    Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data

    Get PDF
    Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests

    Predictive integration of gene functional similarity and co-expression defines treatment response of endothelial progenitor cells

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Endothelial progenitor cells (EPCs) have been implicated in different processes crucial to vasculature repair, which may offer the basis for new therapeutic strategies in cardiovascular disease. Despite advances facilitated by functional genomics, there is a lack of systems-level understanding of treatment response mechanisms of EPCs. In this research we aimed to characterize the EPCs response to adenosine (Ado), a cardioprotective factor, based on the systems-level integration of gene expression data and prior functional knowledge. Specifically, we set out to identify novel biosignatures of Ado-treatment response in EPCs.</p> <p>Results</p> <p>The predictive integration of gene expression data and standardized functional similarity information enabled us to identify new treatment response biosignatures. Gene expression data originated from Ado-treated and -untreated EPCs samples, and functional similarity was estimated with Gene Ontology (GO)-based similarity information. These information sources enabled us to implement and evaluate an integrated prediction approach based on the concept of <it>k</it>-nearest neighbours learning (<it>k</it>NN). The method can be executed by expert- and data-driven input queries to guide the search for biologically meaningful biosignatures. The resulting <it>integrated kNN </it>system identified new candidate EPC biosignatures that can offer high classification performance (areas under the operating characteristic curve > 0.8). We also showed that the proposed models can outperform those discovered by standard gene expression analysis. Furthermore, we report an initial independent <it>in vitro </it>experimental follow-up, which provides additional evidence of the potential validity of the top biosignature.</p> <p>Conclusion</p> <p>Response to Ado treatment in EPCs can be accurately characterized with a new method based on the combination of gene co-expression data and GO-based similarity information. It also exploits the incorporation of human expert-driven queries as a strategy to guide the automated search for candidate biosignatures. The proposed biosignature improves the systems-level characterization of EPCs. The new integrative predictive modeling approach can also be applied to other phenotype characterization or biomarker discovery problems.</p

    Multiclass classification of microarray data samples with a reduced number of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.</p> <p>Results</p> <p>A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.</p> <p>Conclusions</p> <p>A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.</p

    Analysis and Computational Dissection of Molecular Signature Multiplicity

    Get PDF
    Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities

    Multiplicity: an organizing principle for cancers and somatic mutations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the advent of whole-genome analysis for profiling tumor tissue, a pressing need has emerged for principled methods of organizing the large amounts of resulting genomic information. We propose the concept of multiplicity measures on cancer and gene networks to organize the information in a clinically meaningful manner. Multiplicity applied in this context extends Fearon and Vogelstein's multi-hit genetic model of colorectal carcinoma across multiple cancers.</p> <p>Methods</p> <p>Using the Catalogue of Somatic Mutations in Cancer (COSMIC), we construct networks of interacting cancers and genes. Multiplicity is calculated by evaluating the number of cancers and genes linked by the measurement of a somatic mutation. The Kamada-Kawai algorithm is used to find a two-dimensional minimum energy solution with multiplicity as an input similarity measure. Cancers and genes are positioned in two dimensions according to this similarity. A third dimension is added to the network by assigning a maximal multiplicity to each cancer or gene. Hierarchical clustering within this three-dimensional network is used to identify similar clusters in somatic mutation patterns across cancer types.</p> <p>Results</p> <p>The clustering of genes in a three-dimensional network reveals a similarity in acquired mutations across different cancer types. Surprisingly, the clusters separate known causal mutations. The multiplicity clustering technique identifies a set of causal genes with an area under the ROC curve of 0.84 versus 0.57 when clustering on gene mutation rate alone. The cluster multiplicity value and number of causal genes are positively correlated via Spearman's Rank Order correlation (<it>r<sub>s</sub></it>(8) = 0.894, Spearman's <it>t </it>= 17.48, <it>p </it>< 0.05). A clustering analysis of cancer types segregates different types of cancer. All blood tumors cluster together, and the cluster multiplicity values differ significantly (Kruskal-Wallis, <it>H </it>= 16.98, <it>df </it>= 2, <it>p </it>< 0.05).</p> <p>Conclusion</p> <p>We demonstrate the principle of multiplicity for organizing somatic mutations and cancers in clinically relevant clusters. These clusters of cancers and mutations provide representations that identify segregations of cancer and genes driving cancer progression.</p

    A feature selection method for classification within functional genomics experiments based on the proportional overlapping score

    Get PDF
    Background: Microarray technology, as well as other functional genomics experiments, allow simultaneous measurements of thousands of genes within each sample. Both the prediction accuracy and interpretability of a classifier could be enhanced by performing the classification based only on selected discriminative genes. We propose a statistical method for selecting genes based on overlapping analysis of expression data across classes. This method results in a novel measure, called proportional overlapping score (POS), of a feature's relevance to a classification task.Results: We apply POS, along-with four widely used gene selection methods, to several benchmark gene expression datasets. The experimental results of classification error rates computed using the Random Forest, k Nearest Neighbor and Support Vector Machine classifiers show that POS achieves a better performance.Conclusions: A novel gene selection method, POS, is proposed. POS analyzes the expressions overlap across classes taking into account the proportions of overlapping samples. It robustly defines a mask for each gene that allows it to minimize the effect of expression outliers. The constructed masks along-with a novel gene score are exploited to produce the selected subset of genes

    Gene selection for classification of microarray data based on the Bayes error

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy.</p> <p>Results</p> <p>In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes.</p> <p>Conclusion</p> <p>The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.</p

    A comparative analysis of predictive models of morbidity in intensive care unit after cardiac surgery – Part I: model planning

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Different methods have recently been proposed for predicting morbidity in intensive care units (ICU). The aim of the present study was to critically review a number of approaches for developing models capable of estimating the probability of morbidity in ICU after heart surgery. The study is divided into two parts. In this first part, popular models used to estimate the probability of class membership are grouped into distinct categories according to their underlying mathematical principles. Modelling techniques and intrinsic strengths and weaknesses of each model are analysed and discussed from a theoretical point of view, in consideration of clinical applications.</p> <p>Methods</p> <p>Models based on Bayes rule, <it>k-</it>nearest neighbour algorithm, logistic regression, scoring systems and artificial neural networks are investigated. Key issues for model design are described. The mathematical treatment of some aspects of model structure is also included for readers interested in developing models, though a full understanding of mathematical relationships is not necessary if the reader is only interested in perceiving the practical meaning of model assumptions, weaknesses and strengths from a user point of view.</p> <p>Results</p> <p>Scoring systems are very attractive due to their simplicity of use, although this may undermine their predictive capacity. Logistic regression models are trustworthy tools, although they suffer from the principal limitations of most regression procedures. Bayesian models seem to be a good compromise between complexity and predictive performance, but model recalibration is generally necessary. <it>k</it>-nearest neighbour may be a valid non parametric technique, though computational cost and the need for large data storage are major weaknesses of this approach. Artificial neural networks have intrinsic advantages with respect to common statistical models, though the training process may be problematical.</p> <p>Conclusion</p> <p>Knowledge of model assumptions and the theoretical strengths and weaknesses of different approaches are fundamental for designing models for estimating the probability of morbidity after heart surgery. However, a rational choice also requires evaluation and comparison of actual performances of locally-developed competitive models in the clinical scenario to obtain satisfactory agreement between local needs and model response. In the second part of this study the above predictive models will therefore be tested on real data acquired in a specialized ICU.</p

    A Genome-Wide Gene Function Prediction Resource for Drosophila melanogaster

    Get PDF
    Predicting gene functions by integrating large-scale biological data remains a challenge for systems biology. Here we present a resource for Drosophila melanogaster gene function predictions. We trained function-specific classifiers to optimize the influence of different biological datasets for each functional category. Our model predicted GO terms and KEGG pathway memberships for Drosophila melanogaster genes with high accuracy, as affirmed by cross-validation, supporting literature evidence, and large-scale RNAi screens. The resulting resource of prioritized associations between Drosophila genes and their potential functions offers a guide for experimental investigations
    corecore