3 research outputs found
Construction of the Somatic Mutation (SOM) model for liver cancer.
<p><b>A</b>. Relative density of somatic mutations from whole genome sequences of 88 liver tumors [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004583#pcbi.1004583.ref011" target="_blank">11</a>], associated to different genome features (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004583#sec007" target="_blank">Methods</a> for feature details). Mutation density is normalized so that the whole genome average has a mutation density of 1. PC gene: protein coding gene; CDS: coding sequence; Exon.P, Intron.P, Exon.L,Intron.L are exon and intron of protein coding gene and lncRNA respectively; CR: conserved region; DNase: DNase I hypersensitive site; ECS: evolutionarily conserved structure; ncExon: non-coding exon; PC gene.HE, LncRNA.HE, PC gene.LE and LncRNA.LE are high expressed and low expressed protein coding gene and lncRNA; PC gene.early, LncRNA.early, PC gene.late and LncRNA.late are early and late replicated protein coding gene and lncRNA; cTFBS: conserved transcription factor binding site;RR H,RR L,GC H,GC L,DNA.met H and DNA.met L are 1-Kb windows with high recombination rate (> 4.0), low recombination rate (< 0.5), high GC content (GC % > 50%), low GC content (GC%<30%), high DNA methylation (average value > 0.7245) and low DNA methylation (average value < 0.4062) respectively; Blue and red dotted lines: base lines showing average values for CDS and intergenic regions, respectively; <b>B:</b> Feature importance as measured by IncNodePurity. We only show here features that passed feature selection. <b>C</b>. Distribution of SOM scores for neutral SNPs and for clinical variants from two disease-causing variants databases Clivariant and HGMD. Neutral SNPs here are SNPs from the 1000 Genome project with allele frequency higher than 0.01, SOM scores predicted by the random forest model were divided by the number of patients. <b>D</b>. Correlation of SOM score with densities of disease-causing variants. Genome positions were sorted by SOM score and split into 100Mb intervals. The plots show the average SOM score and density of disease-causing variants for each interval. The purple dotted line shows cutoff used for defining low SOM score thereafter.</p
Relationship between SNP and SOM scores in liver cancer.
<p>Grey dots: 1 million random genome positions; cyan contour: HGMD disease-causing variant positions; red contour: Clivariant positions. The top and right curves show marginal distributions of SNP scores (top) and SOM scores (right) for random genome positions, HGMD and Clivariant disease-causing variant positions. Dotted lines define cutoff values for hypomutated/hypermutated regions. SNP score cutoff = 0.63 (98.16Mb above cutoff), SOM score cutoffs = 3.10 variants/Mb, defining areas below cutoff of 55.67 Mb, in liver cancer. Hypomutated regions defined by both cutoff correspond to ~56Mb in liver cancer type.</p
Construction of the rare SNP model.
<p><b>A.</b> Fraction of rare SNPs (allele frequency <0.01) according to different genome features (see <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004583#pcbi.1004583.s008" target="_blank">S1 Table</a> and <a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004583#sec007" target="_blank">Methods</a> for feature details). Each box shows rare SNP fraction across all human chromosomes, except chr. Y. CDS: coding sequence; cTFBS: conserved transcription factor binding site; CR: evolutionarily conserved region; UTR: untranslated region; Sensitive: region with high rate of rare SNP defined in [<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004583#pcbi.1004583.ref010" target="_blank">10</a>], ER/LR: early and late replicated region; DNase: DNase I hypersensitive site; HE/LE: high and low expressed region; Intron L/Intron P: intron of lncRNA/of protein coding gene; ncExon: non coding exon; ECS: evolutionarily conserved structure; RR H/RR L/GC H/GC L: high recombination rate, low recombination rate, high GC content and low GC content regions. The red dotted line represents the average fraction of rare SNPs across the genome. <b>B.</b> Feature importance as measured by IncNodePurity. We only show here features that passed feature selection. <b>C</b>. Distribution of SNP scores for random SNPs and for clinical variants from the Clivariants and HGMD databases. Random SNPs here are a set of 1M random intergenic SNPs from the 1000 Genome project. <b>D</b>. Correlation of SNP scores with densities of disease-causing variants. Genome positions were sorted by SNP score and split into 20 Mb intervals. The plots show the average SNP score and density of disease-causing variants for each interval. The purple dotted line shows cutoff used for defining high SNP score thereafter.</p