23 research outputs found
Comparing protein-coding and noncoding genic intolerance scores.
<p>To enable a matched comparison, the estimates in this table are based on a set of 14,567 CCDS genes with assessable scores across RVIS-CHGV, ncRVIS and ncGERP formulations. Both RVIS-CHGV and ncRVIS are based on the same population of 690 whole-genome sequenced samples from the CHGV.</p><p><sup>a</sup>HI = Haploinsufficiency. To obtain the presented levels of significance, we used a logistic regression model to regress the presence or absence of a gene within the corresponding gene list on each of the genic scores.</p><p>Joint Model: The AUC of a combined logistic regression model that uses all three features. Correlation plots for the pairs of scores are available in <a href="http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1005492#pgen.1005492.s001" target="_blank">S1 Fig</a>.</p><p>Comparing protein-coding and noncoding genic intolerance scores.</p
Receiver operating characteristic (ROC) curves to measure the ability of RVIS-CHGV, ncRVIS, pcGERP, ncGERP, ncCADD, ncGWAVA scores and two joint models to discriminate genes reported among ClinGen’s dosage sensitivity map from the rest of the human genome.
<p>Here, for a given score, all assessable genes were used. To obtain the presented levels of significance, we use a logistic regression model to regress the presence or absence of a gene among the ClinGen dosage sensitivity map list on each of the genic scores.</p
(A) Distribution of ncRVIS scores for the 1,235 loss-of-function deficient genes (left) compared to the 1,762 loss-of-function control genes (right). Median 37.95% vs. 58.09%; Mann-Whitney U test, p = 6.6x10<sup>-34</sup>. (B) Receiver operating characteristic (ROC) curves measuring the ability of RVIS, ncRVIS, pcGERP and ncGERP to discriminate between loss-of-function deficient and loss-of-function control genes.
<p>To obtain the presented levels of significance in <b>(B)</b>, we used a logistic regression model to regress loss-of-function deficient or control gene status for the combined 2,997 genes on each of the four genic scores.</p
Overlaid histograms of ncGERP (blue) and pcGERP (red).
<p>These data show that the two form very different genome-wide distributions (medians: ncGERP -0.02 versus pcGERP 2.64). Moreover, pcGERP tends to present with a slightly platykurtic, left-skewed distribution (Îł<sub>2</sub> = -0.10, Îł<sub>1</sub> = -0.66) compared to ncGERP, which reflects a more leptokurtic, right-skewed distribution (Îł<sub>2</sub> = 0.97, Îł<sub>1</sub> = 0.96).</p
(A) Scatterplot of RVIS-sum (RVIS-CHGV + ncRVIS) and RVIS-diff (RVIS-CHGV–ncRVIS) scores. Each dot represents a gene. The grey dots represent the background genome-wide distribution. The red dots highlight the 82 OMIM haploinsufficiency genes with reported causal de novo mutations. A higher (positive Y-axis value) RVIS-diff score indicates genes where we might have a greater expectation of gene dosage aberrations being important compared with protein structure aberrations. A lower RVIS-sum (X-axis value) highlights genes that are increasingly intolerant in both their noncoding and protein-coding sequence. (B) A cumulative percentage plot for the RVIS-sum percentile accommodating the 82 OMIM halpoinsufficiency genes.
<p>At any given point on the X-axis (RVIS-sum percentile) we can determine what percentage of the 82 OMIM haploinsufficiency genes are accounted for.</p
A regression plot that shows the regression of noncoding polymorphisms (Y) on an estimate of the noncoding sequence mutability (X) (S1 Data).
<p>Each dot represents the position of a gene in the regression plot and the corresponding regression line is provided. Annotations are made for the 5% extremes: red = 5% most intolerant, blue = 5% most tolerant.</p
Recovery of Unknown TE
<p>Although not found in Repbase, we believe that this ReAS TE is a valid reconstruction, because it has a BlastX match with identity 98% over 869 amino acids to a TE-related protein (gi|34896386|ref|NP_909537.1| Putative mutator like transposase) that is annotated in a GenBank clone.</p
TEs within Segmental Duplications
<p>If the duplication is of sufficiently high copy number, it will be assembled as a “ReAS TE,” and what we need to do afterwards is find the boundaries of the TEs within this assembled duplication. On the assumption that TEs have much higher copy numbers, TE boundaries can be identified by sudden changes in depth, accompanied by many partially aligned reads.</p
The ReAS Algorithm
<p>We start by computing <i>K-</i>mer depth, which is the number of times that a <i>K-</i>mer appears in the shotgun data. Copy number refers to how often a <i>K-</i>mer appears in the assembled genome. Depth divided by copy number is the coverage. We seed the process using a randomly chosen high-depth <i>K-</i>mer. All shotgun reads containing this <i>K-</i>mer are retrieved and trimmed into 100-bp segments centered at that <i>K-</i>mer. When the sequence identity between them exceeds a preset threshold, they are assembled into an ICS using ClustalW. We perform an iterative extension by selecting high-depth <i>K-</i>mers at both ends of the ICS and repeating the above procedure. After all such extensions are done, clone-end pairing information is used to resolve ambiguous joins and to break misassemblies, but not to join fragmented assemblies. The final consensus is our ReAS TE.</p
Fragmentation due to Low <i>K-</i>mer Depth
<p>SZ-43LTR is the LTR region from a TE that is found as one piece in Repbase, but is recovered by ReAS as two nonoverlapping pieces, with 98% and 97% nucleotide identity to the Repbase entry.</p