23 research outputs found
Precision, recall and f-measure for CNVs when combining the three following features length, DGV and gene.
<p>Length is the CNV length. DGV is a measure of the CNV’s frequency in the Database of Genomic Variants. Gene is the feature derived from the previous machine learning step in this method.</p
Importance of Model Features.
<p>(a) Histogram of CNV lengths (on log scale) for harmful and benign CNVs within our dataset shows that harmful CNVs are more likely to be longer, and hence likely affect more genes and gene functions. (b-d) Precision (b), recall (c) and f-measure (d) for predicting harmful versus benign CNVs relative to the number of closest neighbors considered within the gene interaction network. Both precision (b) and f-measure (d) improve as we expand the number of neighbors considered, but stabilize or slightly descend after 10 neighbors. We also see an improvement from utilizing the patient phenotypes uniform model in precision and accuracy as we add the ranking as a source for weighing our features.</p
Prioritizing Clinically Relevant Copy Number Variation from Genetic Interactions and Gene Function Data
<div><p>It is becoming increasingly necessary to develop computerized methods for identifying the few disease-causing variants from hundreds discovered in each individual patient. This problem is especially relevant for Copy Number Variants (CNVs), which can be cheaply interrogated via low-cost hybridization arrays commonly used in clinical practice. We present a method to predict the disease relevance of CNVs that combines functional context and clinical phenotype to discover clinically harmful CNVs (and likely causative genes) in patients with a variety of phenotypes. We compare several feature and gene weighing systems for classifying both genes and CNVs. We combined the best performing methodologies and parameters on over 2,500 Agilent CGH 180k Microarray CNVs derived from 140 patients. Our method achieved an F-score of 91.59%, with 87.08% precision and 97.00% recall. Our methods are freely available at <a href="https://github.com/compbio-UofT/cnv-prioritization" target="_blank">https://github.com/compbio-UofT/cnv-prioritization</a>. Our dataset is included with the supplementary information.</p></div
Databases, ontologies and known associations used to identify CNV-phenotype correlations.
<p>Our approach integrates 3 types of information: 1) CNVs an their non-exhaustive frequency in healthy individuals, 2) genes and gene interactions, with their respective functions (each gene within a CNV is weighted by its likelihood of contributing to the phenotypes; via semantic similarity within the GO ontology), and 3) phenotypic descriptions and relationships between them as specified by HPO, with their non-exhaustive associations to disease genes (via OMIM). For an individuals variants and known HPO phenotypes, genes affected by these variants are highlighted within the gene interaction network, while the phenotypes are emphasized in the phenotype ontology layer.</p
The overall structure of the two layer classifier, with the output of hte Gene Classifier being one of the inputs to the CNV classifier.
<p>The overall structure of the two layer classifier, with the output of hte Gene Classifier being one of the inputs to the CNV classifier.</p
Dotplots of sequence similarity in an allelic bin before and after ordering into hypercontigs by DDA
<p><b>Copyright information:</b></p><p>Taken from "A haplome alignment and reference sequence of the highly polymorphic genome"</p><p>http://genomebiology.com/2007/8/3/R41</p><p>Genome Biology 2007;8(3):R41-R41.</p><p>Published online 20 Mar 2007</p><p>PMCID:PMC1868934.</p><p></p> The x-axis and y-axis in both plots represent sequence from sub-bins A and B, respectively, and cover approximately 550 kilobases (kb). In both plots green dots record a region of sequence similarity on the positive strand and red dots sequence similarity on the negative strand. Before the Double Draft Aligner (DDA) is run on this bin, supercontigs from each sub-bin are unordered and not oriented with respect to one another; their locations are denoted by alternating light and dark blue lines along the appropriate axis. After the DDA is run, contigs from both sub-bins have been ordered and oriented to produce a pair of linearly consistent hypercontigs
Various mutation and error events, and their effects on the color-code readouts.
<p>The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts.</p
Running time of SHRiMP for mapping 500,000 35 bp SOLiD <i>C. savignyi</i> reads to the 180 Mb reference genome on a single Core2 2.66 GHz processor.
<p>In all cases, two k-mer hits were required within a 41 bp window to invoke the vectorized Smith-Waterman filter.</p
SHRiMP Hashing technique & Vectorized Alignment algorithm.
<p>A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right.</p
Size distribution of indels.
<p>(A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive <i>C. savignyi</i> genome is coding.</p