21 research outputs found

    Denoising Individual Bias for Fairer Binary Submatrix Detection

    Get PDF
    Low rank representation of binary matrix is powerful in disentangling sparse individual-attribute associations, and has received wide applications. Existing binary matrix factorization (BMF) or co-clustering (CC) methods often assume i.i.d background noise. However, this assumption could be easily violated in real data, where heterogeneous row- or column-wise probability of binary entries results in disparate element-wise background distribution, and paralyzes the rationality of existing methods. We propose a binary data denoising framework, namely BIND, which optimizes the detection of true patterns by estimating the row- or column-wise mixture distribution of patterns and disparate background, and eliminating the binary attributes that are more likely from the background. BIND is supported by thoroughly derived mathematical property of the row- and column-wise mixture distributions. Our experiment on synthetic and real-world data demonstrated BIND effectively removes background noise and drastically increases the fairness and accuracy of state-of-the arts BMF and CC methods

    Geometric All-way Boolean Tensor Decomposition

    Get PDF
    Boolean tensor has been broadly utilized in representing high dimensional logical data collected on spatial, temporal and/or other relational domains. Boolean Tensor Decomposition (BTD) factorizes a binary tensor into the Boolean sum of multiple rank-1 tensors, which is an NP-hard problem. Existing BTD methods have been limited by their high computational cost, in applications to large scale or higher order tensors. In this work, we presented a computationally efficient BTD algorithm, namely Geometric Expansion for all-order Tensor Factorization (GETF), that sequentially identifies the rank-1 basis components for a tensor from a geometric perspective. We conducted rigorous theoretical analysis on the validity as well as algorithemic efficiency of GETF in decomposing all-order tensor. Experiments on both synthetic and real-world data demonstrated that GETF has significantly improved performance in reconstruction accuracy, extraction of latent structures and it is an order of magnitude faster than other state-of-the-art methods

    M3S: a comprehensive model selection for multi-modal single-cell RNA sequencing data

    Get PDF
    Background Various statistical models have been developed to model the single cell RNA-seq expression profiles, capture its multimodality, and conduct differential gene expression test. However, for expression data generated by different experimental design and platforms, there is currently lack of capability to determine the most proper statistical model. Results We developed an R package, namely Multi-Modal Model Selection (M3S), for gene-wise selection of the most proper multi-modality statistical model and downstream analysis, useful in a single-cell or large scale bulk tissue transcriptomic data. M3S is featured with (1) gene-wise selection of the most parsimonious model among 11 most commonly utilized ones, that can best fit the expression distribution of the gene, (2) parameter estimation of a selected model, and (3) differential gene expression test based on the selected model. Conclusion A comprehensive evaluation suggested that M3S can accurately capture the multimodality on simulated and real single cell data. An open source package and is available through GitHub at https://github.com/zy26/M3S

    Fast and Efficient Boolean Matrix Factorization by Geometric Segmentation

    Get PDF
    Boolean matrix has been used to represent digital information in many fields, including bank transaction, crime records, natural language processing, protein-protein interaction, etc. Boolean matrix factorization (BMF) aims to find an approximation of a binary matrix as the Boolean product of two low rank Boolean matrices, which could generate vast amount of information for the patterns of relationships between the features and samples. Inspired by binary matrix permutation theories and geometric segmentation, we developed a fast and efficient BMF approach, called MEBF (Median Expansion for Boolean Factorization). Overall, MEBF adopted a heuristic approach to locate binary patterns presented as submatrices that are dense in 1's. At each iteration, MEBF permutates the rows and columns such that the permutated matrix is approximately Upper Triangular-Like (UTL) with so-called Simultaneous Consecutive-ones Property (SC1P). The largest submatrix dense in 1 would lie on the upper triangular area of the permutated matrix, and its location was determined based on a geometric segmentation of a triangular. We compared MEBF with other state of the art approaches on data scenarios with different density and noise levels. MEBF demonstrated superior performances in lower reconstruction error, and higher computational efficiency, as well as more accurate density patterns than popular methods such as ASSO, PANDA and Message Passing. We demonstrated the application of MEBF on both binary and non-binary data sets, and revealed its further potential in knowledge retrieving and data denoising

    LTMG: a novel statistical modeling of transcriptional expression states in single-cell RNA-Seq data

    Get PDF
    A key challenge in modeling single-cell RNA-seq data is to capture the diversity of gene expression states regulated by different transcriptional regulatory inputs across individual cells, which is further complicated by largely observed zero and low expressions. We developed a left truncated mixture Gaussian (LTMG) model, from the kinetic relationships of the transcriptional regulatory inputs, mRNA metabolism and abundance in single cells. LTMG infers the expression multi-modalities across single cells, meanwhile, the dropouts and low expressions are treated as left truncated. We demonstrated that LTMG has significantly better goodness of fitting on an extensive number of scRNA-seq data, comparing to three other state-of-the-art models. Our biological assumption of the low non-zero expressions, rationality of the multimodality setting, and the capability of LTMG in extracting expression states specific to cell types or functions, are validated on independent experimental data sets. A differential gene expression test and a co-regulation module identification method are further developed. We experimentally validated that our differential expression test has higher sensitivity and specificity, compared with other five popular methods. The co-regulation analysis is capable of retrieving gene co-regulation modules corresponding to perturbed transcriptional regulations. A user-friendly R package with all the analysis power is available at https://github.com/zy26/LTMGSCA

    A data denoising approach to optimize functional clustering of single cell RNA-sequencing data

    Get PDF
    Single cell RNA-sequencing (scRNA-seq) technology enables comprehensive transcriptomic profiling of thousands of cells with distinct phenotypic and physiological states in a complex tissue. Substantial efforts have been made to characterize single cells of distinct identities from scRNA-seq data, including various cell clustering techniques. While existing approaches can handle single cells in terms of different cell (sub)types at a high resolution, identification of the functional variability within the same cell type remains unsolved. In addition, there is a lack of robust method to handle the inter-subject variation that often brings severe confounding effects for the functional clustering of single cells. In this study, we developed a novel data denoising and cell clustering approach, namely CIBS, to provide biologically explainable functional classification for scRNA-seq data. CIBS is based on a systems biology model of transcriptional regulation that assumes a multi-modality distribution of the cells' activation status, and it utilizes a Boolean matrix factorization approach on the discretized expression status to robustly derive functional modules. CIBS is empowered by a novel fast Boolean Matrix Factorization method, namely PFAST, to increase the computational feasibility on large scale scRNA-seq data. Application of CIBS on two scRNA-seq datasets collected from cancer tumor micro-environment successfully identified subgroups of cancer cells with distinct expression patterns of epithelial-mesenchymal transition and extracellular matrix marker genes, which was not revealed by the existing cell clustering analysis tools. The identified cell groups were significantly associated with the clinically confirmed lymph-node invasion and metastasis events across different patients

    Modulation of Immune Infiltration of Ovarian Cancer Tumor Microenvironment by Specific Subpopulations of Fibroblasts

    Get PDF
    Simple Summary The ovarian cancer tumor microenvironment is made up of ovarian cancer cells along with a milieu of proteins and normal cells, including fibroblasts, immune cells, endothelial cells, pericytes and adipocytes. The noncancer components also play an important role in determining the fate of the tumor and exhibit a lot of heterogeneity. In this study, we have used a deconvolution algorithm to identify four different fibroblast subpopulations and multiple immune cell types, from bulk RNA-seq data of ovarian cancer primary tumors, metastases and normal omentum. We report the prevalence of specific fibroblast subtypes that determine the tumor-immune microenvironment. Our study can potentially help provide a template for identification of potential combination therapies to enhance the efficacy of ovarian cancer immunotherapies. Abstract Tumor immune infiltration plays a key role in the progression of solid tumors, including ovarian cancer, and immunotherapies are rapidly emerging as effective treatment modalities. However, the role of cancer-associated fibroblasts (CAFs), a predominant stromal constituent, in determining the tumor-immune microenvironment and modulating efficacy of immunotherapies remains poorly understood. We have conducted an extensive bioinformatic analysis of our and other publicly available ovarian cancer datasets (GSE137237, GSE132289 and GSE71340), to determine the correlation of fibroblast subtypes within the tumor microenvironment (TME) with the characteristics of tumor-immune infiltration. We identified (1) four functional modules of CAFs in ovarian cancer that are associated with the TME and metastasis of ovarian cancer, (2) immune-suppressive function of the collagen 1,3,5-expressing CAFs in primary ovarian cancer and omental metastases, and (3) consistent positive correlations between the functional modules of CAFs with anti-immune response genes and negative correlation with pro-immune response genes. Our study identifies a specific fibroblast subtype, fibroblast functional module (FFM)2, in the ovarian cancer tumor microenvironment that can potentially modulate a tumor-promoting immune microenvironment, which may be detrimental toward the effectiveness of ovarian cancer immunotherapies

    Spatially and Robustly Hybrid Mixture Regression Model for Inference of Spatial Dependence

    Get PDF
    In this paper, we propose a Spatial Robust Mixture Regression model to investigate the relationship between a response variable and a set of explanatory variables over the spatial domain, assuming that the relationships may exhibit complex spatially dynamic patterns that cannot be captured by constant regression coefficients. Our method integrates the robust finite mixture Gaussian regression model with spatial constraints, to simultaneously handle the spatial non-stationarity, local homogeneity, and outlier contaminations. Compared with existing spatial regression models, our proposed model assumes the existence a few distinct regression models that are estimated based on observations that exhibit similar response-predictor relationships. As such, the proposed model not only accounts for non-stationarity in the spatial trend, but also clusters observations into a few distinct and homogenous groups. This provides an advantage on interpretation with a few stationary sub-processes identified that capture the predominant relationships between response and predictor variables. Moreover, the proposed method incorporates robust procedures to handle contaminations from both regression outliers and spatial outliers. By doing so, we robustly segment the spatial domain into distinct local regions with similar regression coefficients, and sporadic locations that are purely outliers. Rigorous statistical hypothesis testing procedure has been designed to test the significance of such segmentation. Experimental results on many synthetic and real-world datasets demonstrate the robustness, accuracy, and effectiveness of our proposed method, compared with other robust finite mixture regression, spatial regression and spatial segmentation methods
    corecore