Search CORE

23 research outputs found

Precision, recall and f-measure for CNVs when combining the three following features length, DGV and gene.

Author: James Stavropoulos (803555)
Justin Foong (803554)
Marta Girdea (524232)
Michael Brudno (29299)
Publication venue
Publication date
Field of study

Length is the CNV length. DGV is a measure of the CNV’s frequency in the Database of Genomic Variants. Gene is the feature derived from the previous machine learning step in this method.</p

FigShare

Importance of Model Features.

Author: James Stavropoulos (803555)
Justin Foong (803554)
Marta Girdea (524232)
Michael Brudno (29299)
Publication venue
Publication date
Field of study

(a) Histogram of CNV lengths (on log scale) for harmful and benign CNVs within our dataset shows that harmful CNVs are more likely to be longer, and hence likely affect more genes and gene functions. (b-d) Precision (b), recall (c) and f-measure (d) for predicting harmful versus benign CNVs relative to the number of closest neighbors considered within the gene interaction network. Both precision (b) and f-measure (d) improve as we expand the number of neighbors considered, but stabilize or slightly descend after 10 neighbors. We also see an improvement from utilizing the patient phenotypes uniform model in precision and accuracy as we add the ranking as a source for weighing our features.</p

FigShare

Prioritizing Clinically Relevant Copy Number Variation from Genetic Interactions and Gene Function Data

Author: James Stavropoulos (803555)
Justin Foong (803554)
Marta Girdea (524232)
Michael Brudno (29299)
Publication venue
Publication date: 05/10/2015
Field of study

<div>It is becoming increasingly necessary to develop computerized methods for identifying the few disease-causing variants from hundreds discovered in each individual patient. This problem is especially relevant for Copy Number Variants (CNVs), which can be cheaply interrogated via low-cost hybridization arrays commonly used in clinical practice. We present a method to predict the disease relevance of CNVs that combines functional context and clinical phenotype to discover clinically harmful CNVs (and likely causative genes) in patients with a variety of phenotypes. We compare several feature and gene weighing systems for classifying both genes and CNVs. We combined the best performing methodologies and parameters on over 2,500 Agilent CGH 180k Microarray CNVs derived from 140 patients. Our method achieved an F-score of 91.59%, with 87.08% precision and 97.00% recall. Our methods are freely available at <a href="https://github.com/compbio-UofT/cnv-prioritization" target="_blank">https://github.com/compbio-UofT/cnv-prioritization</a>. Our dataset is included with the supplementary information.</div

Directory of Open Access Journals

FigShare

Databases, ontologies and known associations used to identify CNV-phenotype correlations.

Author: James Stavropoulos (803555)
Justin Foong (803554)
Marta Girdea (524232)
Michael Brudno (29299)
Publication venue
Publication date
Field of study

Our approach integrates 3 types of information: 1) CNVs an their non-exhaustive frequency in healthy individuals, 2) genes and gene interactions, with their respective functions (each gene within a CNV is weighted by its likelihood of contributing to the phenotypes; via semantic similarity within the GO ontology), and 3) phenotypic descriptions and relationships between them as specified by HPO, with their non-exhaustive associations to disease genes (via OMIM). For an individuals variants and known HPO phenotypes, genes affected by these variants are highlighted within the gene interaction network, while the phenotypes are emphasized in the phenotype ontology layer.</p

FigShare

The overall structure of the two layer classifier, with the output of hte Gene Classifier being one of the inputs to the CNV classifier.

Author: James Stavropoulos (803555)
Justin Foong (803554)
Marta Girdea (524232)
Michael Brudno (29299)
Publication venue
Publication date
Field of study

The overall structure of the two layer classifier, with the output of hte Gene Classifier being one of the inputs to the CNV classifier.</p

FigShare

Dotplots of sequence similarity in an allelic bin before and after ordering into hypercontigs by DDA

Author: Arend Sidow (5485)
Kerrin S Small (29298)
Matthew M Hill (29300)
Michael Brudno (29299)
Publication venue
Publication date
Field of study

Copyright information:Taken from "A haplome alignment and reference sequence of the highly polymorphic genome"http://genomebiology.com/2007/8/3/R41Genome Biology 2007;8(3):R41-R41.Published online 20 Mar 2007PMCID:PMC1868934. The x-axis and y-axis in both plots represent sequence from sub-bins A and B, respectively, and cover approximately 550 kilobases (kb). In both plots green dots record a region of sequence similarity on the positive strand and red dots sequence similarity on the negative strand. Before the Double Draft Aligner (DDA) is run on this bin, supercontigs from each sub-bin are unordered and not oriented with respect to one another; their locations are denoted by alternating light and dark blue lines along the appropriate axis. After the DDA is run, contigs from both sub-bins have been ordered and oriented to produce a pair of linearly consistent hypercontigs

FigShare

Various mutation and error events, and their effects on the color-code readouts.

Author: Adrian V. Dalca (372652)
Arend Sidow (5485)
Marc Fiume (372653)
Michael Brudno (29299)
Phil Lacroute (254608)
Stephen M. Rumble (372651)
Publication venue
Publication date
Field of study

The reference genome is labeled G and the read R. A: A perfect alignment; B: In case of a sequencing error (the 2 should have been read as a 0) the rest of the read no longer matches the genome in letter-space; C: In case of a SNP two adjacent colors do not match the genome, but all subsequent letters do match. However, D: only 3 of the 9 possible color changes represent valid SNPs; E: the rules for deciding which insertion and deletion events are valid are even more complex, as indels can also change adjacent color readouts.</p

FigShare

Running time of SHRiMP for mapping 500,000 35 bp SOLiD C. savignyi reads to the 180 Mb reference genome on a single Core2 2.66 GHz processor.

Author: Adrian V. Dalca (372652)
Arend Sidow (5485)
Marc Fiume (372653)
Michael Brudno (29299)
Phil Lacroute (254608)
Stephen M. Rumble (372651)
Publication venue
Publication date
Field of study

In all cases, two k-mer hits were required within a 41 bp window to invoke the vectorized Smith-Waterman filter.</p

FigShare

SHRiMP Hashing technique & Vectorized Alignment algorithm.

Author: Adrian V. Dalca (372652)
Arend Sidow (5485)
Marc Fiume (372653)
Michael Brudno (29299)
Phil Lacroute (254608)
Stephen M. Rumble (372651)
Publication venue
Publication date
Field of study

A: Overview of the k-mer filtering stage within SHRiMP: A window is moved along the genome. If a particular read has a preset number of k-mers within the window the vectorized Smith-Waterman stage is run to align the read to the genome. B: Schematic of the vectorized-implementation of the Needleman-Wunsch algorithm. The red cells are the vector being computed, on the basis of the vectors computed in the last step (yellow) and the next-to-last (blue). The match/mismatch vector for the diagonal is determined by comparing one sequence with the other one reversed (indicated by the red arrow below). To obtain the set of match/mismatch positions for the next diagonal, the lower sequence needs to be shifted to the right.</p

FigShare

Size distribution of indels.

Author: Adrian V. Dalca (372652)
Arend Sidow (5485)
Marc Fiume (372653)
Michael Brudno (29299)
Phil Lacroute (254608)
Stephen M. Rumble (372651)
Publication venue
Publication date
Field of study

(A) and distance between adjacent SNPs (B) detected by SHRiMP. The distance between adjacent SNPs shows a clear 3-periodicity, due to the fact that a significant fraction of the non-repetitive C. savignyi genome is coding.</p

FigShare