1 research outputs found
Additional file 1 of Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans
Additional file 1: Figure S1 Filtering of ChIP-seq samples. A: Schematic overview of ChIP-seq sample filtering. B: Violin plot showing the AUROC of the prediction of the top 10% PWM-supported k-mers based on the MOCCS2score. The red violin plot represents all CTCF ChIP-seq samples, the green plot represents soft-filtered CTCF ChIP-seq samples, and the blue plot represents hard-filtered CTCF ChIP-seq samples. High-quality ChIP-seq samples with high AUROC scores were retained after hard filtering. C: Distribution of each quality control metric of ChIP-seq sample filtering for samples that passed the hard filter (pink) and others (blue). D: Bar plots display the number of ChIP-seq samples that passed through the soft and hard filters. Bars are colored according to cell type classes or TFs. Figure S2 Simulation of significant k-mer detection. A: The procedure for generating simulated datasets. Simulated data generated by embedding a âtrue significant k-merâ within random sequences was applied to MOCCS2 and the q-values of the MOCCS2score were calculated for each k-mer. B: Parameters for each simulation condition from #1 to #5. α is the percentage of input sequences containing embedded âtrue significant k-mersâ , N is the number of peaks in a ChIP-seq sample, and Ï is the standard deviation of the embedded âtrue significant k-mersâ from the center of the peak. C: Simulation results for significant k-mer detection. The sensitivity, specificity, and FDR for detecting âtrue significant k-mersâ are shown for different parameter settings. Figure S3 Number of peaks and significant k-mers in MOCCS profiles. A: Number of peaks in MOCCS profiles. The x-axis represents the log-transformed number of peaks with a base of 10 and the y-axis represents the number of ChIP-seq samples. B: Relationship between the number of peaks and significant k-mers in MOCCS profiles (left, q < 0.05; right, q < 0.01). Figure S4 Similarities in MOCCS profiles and peak locations for sample pairs of same or different TFs. A: Comparison of k-sim Jaccard, Pearson and peak overlap indices (a-c: groups of the same cell types). B: Two-dimensional density plot of k-sim Jaccard or Pearson with the peak overlap index (a-c: groups of the same cell types). C: Correlation coefficient of k-sim Jaccard or Pearson with the peak overlap index in each group. The y-axis indicates Spearmanâ s correlation coefficient. Red and blue indicate k-sim Pearson and Jaccard values, respectively (a-c: groups of the same cell types) Figure S5 Similarities in MOCCS profiles and peak locations for sample pairs of same/different cell types. A: Comparison of the k-sim Jaccard, Pearson, and peak overlap indices (a, d, and e: groups of the same TFs). B: Two-dimensional density plot of k-sim Jaccard or Pearson with the peak overlap index (a, d, and e: groups of the same TFs). C: Correlation coefficient of k-sim Jaccard or Pearson with the peak overlap index in each group. The y-axis indicates Spearmanâ s correlation coefficient. Red and blue indicate k-sim Pearson and Jaccard values, respectively (a, d, and e: groups of the same TFs). Figure S6 Heat maps of cell type-dependent TFs. The heat map color indicates the k-sim Jaccard value for the 33 cell type-dependent TFs. The color labels of the heat maps indicate the cell type classes. Cell type classes with only a single ChIP-seq sample were excluded from the visualization. Asterisks indicate the statistical significance of ChIP-seq samples with the same and different cell type classes (MannâWhitney U test, p < 0.05). Figure S7 Violin plots of all cell type-dependent TFs. The y-axis indicates the k-sim Jaccard value. The same and different groups were arranged along the x-axis. Asterisks indicate the statistical significance of ChIP-seq samples with the same and different cell type classes (MannâWhitney U test, p < 0.05). Figure S8 Simulation of differential k-mer detection. A: Simulated data processing. Simulated data with an embedded âtrue differential k-merâ and âtrue significant k-merâ was prepared by embedding a âtrueâ k-mer within α% of a randomly generated sample of 2W + 1 bp (W = 350) DNA sequences and applied to MOCCS2. âTrue significant k-mersâ were embedded following a normal distribution whose mean was W + 1 and whose standard deviation was Ï. âTrue differential k-mersâ were embedded in S1 (or S2), similar to âtrue significant k-mers,â and were embedded in S2 (or S1) following a uniform distribution whose mean was 1 and whose standard deviation was (2 Ă W + 1) â (k â 1). It should be noted that we set k as k=6. B: Parameters for each simulation condition from #1 to #5. L is the number of differential k-mers and m is the number of significant k-mers. Figure S9 ÎMOCCS2score profiles were consistent with the in vitro SNP-SELEX and PWM motif fold change. A: Spearmanâ s correlation coefficient between PBS (SNP-SELEX) and ÎMOCCS2score in each TF for the original and permuted data. Red points indicate the original Spearmanâ s correlation coefficient, and blue points indicate the permutated data. B: Difference in ÎMOCCS2score profile consistency among the positions of SNPs in k-mers. The kth SNP position indicates the kth allele on the left side of the k-mer. C: The ÎMOCCS2score is consistent with the PWM motif fold change. Figure S10 Number of peak-overlapping GWAS-SNPs with significant ÎMOCCS2scores. Number of peak-overlapping GWAS-SNPs in each ChIP-seq sample. Each bar represents a ChIP-seq sample, and the y-axis represents the number of peak-overlapping GWAS-SNPs. The red fraction represents the number of peak-overlapping GWAS-SNPs with significant ÎMOCCS2scores (q < 0.05), and the gray fraction represents the number of GWAS SNPs with non-significant ÎMOCCS2scores. Figure S11 Prediction of SNP-affected TFs and cell type classes using ÎMOCCS2score profiles. Top ChIP-seq samples with high ÎMOCCS2scores in each phenotype (IBD, inflammatory bowel disease; CD, Crohnâ s disease; MS, multiple sclerosis; SLE, systemic lupus erythematosus). The ÎMOCCS2score was calculated for each SNP and ChIP-seq sample. Bar graph colors represent TFs or cell type classes. Figure S12 Association between the allele frequency and ÎMOCCS2score. Association between the allele frequency and (A) the absolute values of the ÎMOCCS2score or (B) the ratio of SNPs with significant ÎMOCCS2scores in each phenotype (IBD, inflammatory bowel disease; CD, Crohnâ s disease; MS, multiple sclerosis; SLE, systemic lupus erythematosus). Figure S13 Accuracy of detecting canonical motifs using MOCCS2score for different k. AUROC for detecting canonical PWM motifs using the MOCCS2score in the difference of value k. The x-axis represents the ratio of PWM-supported k-mers in all k-mers and the y-axis represents the AUROC. The colors of the violin plots represent the different k values