1,136 research outputs found

    Segmentation and intensity estimation for microarray images with saturated pixels

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Microarray image analysis processes scanned digital images of hybridized arrays to produce the input spot-level data for downstream analysis, so it can have a potentially large impact on those and subsequent analysis. Signal saturation is an optical effect that occurs when some pixel values for highly expressed genes or peptides exceed the upper detection threshold of the scanner software (2<sup>16 </sup>- 1 = 65, 535 for 16-bit images). In practice, spots with a sizable number of saturated pixels are often flagged and discarded. Alternatively, the saturated values are used without adjustments for estimating spot intensities. The resulting expression data tend to be biased downwards and can distort high-level analysis that relies on these data. Hence, it is crucial to effectively correct for signal saturation.</p> <p>Results</p> <p>We developed a flexible mixture model-based segmentation and spot intensity estimation procedure that accounts for saturated pixels by incorporating a censored component in the mixture model. As demonstrated with biological data and simulation, our method extends the dynamic range of expression data beyond the saturation threshold and is effective in correcting saturation-induced bias when the lost information is not tremendous. We further illustrate the impact of image processing on downstream classification, showing that the proposed method can increase diagnostic accuracy using data from a lymphoma cancer diagnosis study.</p> <p>Conclusions</p> <p>The presented method adjusts for signal saturation at the segmentation stage that identifies a pixel as part of the foreground, background or other. The cluster membership of a pixel can be altered versus treating saturated values as truly observed. Thus, the resulting spot intensity estimates may be more accurate than those obtained from existing methods that correct for saturation based on already segmented data. As a model-based segmentation method, our procedure is able to identify inner holes, fuzzy edges and blank spots that are common in microarray images. The approach is independent of microarray platform and applicable to both single- and dual-channel microarrays.</p

    Estimating Gene Signals From Noisy Microarray Images

    Get PDF
    In oligonucleotide microarray experiments, noise is a challenging problem, as biologists now are studying their organisms not in isolation but in the context of a natural environment. In low photomultiplier tube (PMT) voltage images, weak gene signals and their interactions with the background fluorescence noise are most problematic. In addition, nonspecific sequences bind to array spots intermittently causing inaccurate measurements. Conventional techniques cannot precisely separate the foreground and the background signals. In this paper, we propose analytically based estimation technique. We assume a priori spot-shape information using a circular outer periphery with an elliptical center hole. We assume Gaussian statistics for modeling both the foreground and background signals. The mean of the foreground signal quantifies the weak gene signal corresponding to the spot, and the variance gives the measure of the undesired binding that causes fluctuation in the measurement. We propose a foreground-signal and shapeestimation algorithm using the Gibbs sampling method. We compare our developed algorithm with the existing Mann–Whitney (MW)- and expectation maximization (EM)/iterated conditional modes (ICM)-based methods. Our method outperforms the existing methods with considerably smaller mean-square error (MSE) for all signal-to-noise ratios (SNRs) in computer-generated images and gives better qualitative results in low-SNR real-data images. Our method is computationally relatively slow because of its inherent sampling operation and hence only applicable to very noisy-spot images. In a realistic example using our method, we show that the gene-signal fluctuations on the estimated foreground are better observed for the input noisy images with relatively higher undesired bindings

    Crossword: A Fully Automated Algorithm for the Segmentation and Quality Control of Protein Microarray Images

    Get PDF
    Biological assays formatted as microarrays have become a critical tool for the generation of the comprehensive data sets required for systems-level understanding of biological processes. Manual annotation of data extracted from images of microarrays, however, remains a significant bottleneck, particularly for protein microarrays due to the sensitivity of this technology to weak artifact signal. In order to automate the extraction and curation of data from protein microarrays, we describe an algorithm called Crossword that logically combines information from multiple approaches to fully automate microarray segmentation. Automated artifact removal is also accomplished by segregating structured pixels from the background noise using iterative clustering and pixel connectivity. Correlation of the location of structured pixels across image channels is used to identify and remove artifact pixels from the image prior to data extraction. This component improves the accuracy of data sets while reducing the requirement for time-consuming visual inspection of the data. Crossword enables a fully automated protocol that is robust to significant spatial and intensity aberrations. Overall, the average amount of user intervention is reduced by an order of magnitude and the data quality is increased through artifact removal and reduced user variability. The increase in throughput should aid the further implementation of microarray technologies in clinical studies.Camille and Henry Dreyfus Foundation (Camille Dreyfus Teacher-Scholar Award

    STATISTICAL ANALYSIS OF GENE EXPRESSION MICROARRAYS

    Get PDF
    This manuscript is composed of two major sections. In the first section of the manuscript we introduce some of the biological principles that form the bases of cDNA microarrays and explain how the different analytical steps introduce variability and potential biases in gene expression measurements that can sometimes be dificult to properly address. We address statistical issues associated to the measurement of gene expression (e.g., image segmentation, spot identification), to the correction for back-ground fluorescence and to the normalization and re-scaling of data to remove effects of dye, print-tip and others on expression. In this section of the manuscript we also describe the standard statistical approaches for estimating treatment effect on gene expression, and briefly address the multiple comparisons problem, often referred to as the big p small n paradox. In the second major section of the manuscript, we discuss the use of multiple scans as a means to reduce the variability of gene expression estimates. While the use of multiple scans under the same laser and sensor settings has already been proposed (Romualdi et al. 2003), we describe a general hierarchical modeling approach proposed by Love and Carriquiry (2005) that enables use of all the readings obtained under varied laser and sensor settings for each slide in the analyses, even if the number of readings per slide vary across slides. This technique also uses the varied settings to correct for some amount of the censoring discussed in the first section. It is to be expected that when combining scans and correcting for censoring, the estimate of gene expression will have smaller variance than it would have if based on a single spot measurement. In turn, expression estimates with smaller variance are expected to increase the power of statistical tests performed on them

    Bayesian methods for non-gaussian data modeling and applications

    Get PDF
    Finite mixture models are among the most useful machine learning techniques and are receiving considerable attention in various applications. The use of finite mixture models in image and signal processing has proved to be of considerable interest in terms of both theoretical development and in their usefulness in several applications. In most of the applications, the Gaussian density is used in the mixture modeling of data. Although a Gaussian mixture may provide a reasonable approximation to many real-world distributions, it is certainly not always the best approximation especially in image and signal processing applications where we often deal with non-Gaussian data. In this thesis, we propose two novel approaches that may be used in modeling non-Gaussian data. These approaches use two highly flexible distributions, the generalized Gaussian distribution (GGD) and the general Beta distribution, in order to model the data. We are motivated by the fact that these distributions are able to fit many distributional shapes and then can be considered as a useful class of flexible models to address several problems and applications involving measurements and features having well-known marked deviation from the Gaussian shape. For the mixture estimation and selection problem, researchers have demonstrated that Bayesian approaches are fully optimal. The Bayesian learning allows the incorporation of prior knowledge in a formal coherent way that avoids overfitting problems. For this reason, we adopt different Bayesian approaches in order to learn our models parameters. First, we present a fully Bayesian approach to analyze finite generalized Gaussian mixture models which incorporate several standard mixtures, such as Laplace and Gaussian. This approach evaluates the posterior distribution and Bayes estimators using a Gibbs sampling algorithm, and selects the number of components in the mixture using the integrated likelihood. We also propose a fully Bayesian approach for finite Beta mixtures learning using a Reversible Jump Markov Chain Monte Carlo (RJMCMC) technique which simultaneously allows cluster assignments, parameters estimation, and the selection of the optimal number of clusters. We then validate the proposed methods by applying them to different image processing applications

    Data harmonisation for information fusion in digital healthcare: A state-of-the-art systematic review, meta-analysis and future research directions

    Get PDF
    Removing the bias and variance of multicentre data has always been a challenge in large scale digital healthcare studies, which requires the ability to integrate clinical features extracted from data acquired by different scanners and protocols to improve stability and robustness. Previous studies have described various computational approaches to fuse single modality multicentre datasets. However, these surveys rarely focused on evaluation metrics and lacked a checklist for computational data harmonisation studies. In this systematic review, we summarise the computational data harmonisation approaches for multi-modality data in the digital healthcare field, including harmonisation strategies and evaluation metrics based on different theories. In addition, a comprehensive checklist that summarises common practices for data harmonisation studies is proposed to guide researchers to report their research findings more effectively. Last but not least, flowcharts presenting possible ways for methodology and metric selection are proposed and the limitations of different methods have been surveyed for future research

    Detection of copy number variation from array intensity and sequencing read depth using a stepwise Bayesian model

    Get PDF
    Abstract Background Copy number variants (CNVs) have been demonstrated to occur at a high frequency and are now widely believed to make a significant contribution to the phenotypic variation in human populations. Array-based comparative genomic hybridization (array-CGH) and newly developed read-depth approach through ultrahigh throughput genomic sequencing both provide rapid, robust, and comprehensive methods to identify CNVs on a whole-genome scale. Results We developed a Bayesian statistical analysis algorithm for the detection of CNVs from both types of genomic data. The algorithm can analyze such data obtained from PCR-based bacterial artificial chromosome arrays, high-density oligonucleotide arrays, and more recently developed high-throughput DNA sequencing. Treating parameters--e.g., the number of CNVs, the position of each CNV, and the data noise level--that define the underlying data generating process as random variables, our approach derives the posterior distribution of the genomic CNV structure given the observed data. Sampling from the posterior distribution using a Markov chain Monte Carlo method, we get not only best estimates for these unknown parameters but also Bayesian credible intervals for the estimates. We illustrate the characteristics of our algorithm by applying it to both synthetic and experimental data sets in comparison to other segmentation algorithms. Conclusions In particular, the synthetic data comparison shows that our method is more sensitive than other approaches at low false positive rates. Furthermore, given its Bayesian origin, our method can also be seen as a technique to refine CNVs identified by fast point-estimate methods and also as a framework to integrate array-CGH and sequencing data with other CNV-related biological knowledge, all through informative priors.</p
    • …
    corecore