136 research outputs found
Workflow.
<p>Quick identification of DNA as performed with Tapir is a step in an analysis workflow when working with unknown samples. The current web browser-based client implements the part of the process in the grey area, with the downloading of reference sequence currently in test and made available on our production server very soon. At the beginning, all reads are unmapped and a sample of them is submitted for identification. The resulting list contains a pointer to the reference DNA represented most in the sample, and the sequences for the top hits can then be fetched, indexed and used for mapping all unmapped reads, for example with an aligner for short reads. If unmapped reads remain after this step, they constitute a new set of unmapped reads to iterate on. This procedure works by iteratively decreasing the number of reads; should a mixture of DNA such as plasmids, or different species be present they will remain as unmapped and be handled with the next iteration.</p
Client-server alignment without pre-specified reference genome.
<p>(A) A small random sample of the umapped reads (initially all reads) is taken by the client and sent to the server. (B) In return the server sends a list of hits, or candidate reference sequences for the sample. (C) The client then iterates through the top hits and for each one requests the full genomic sequence from the server after checking that it does not already have a copy of it locally, and calls bowtie2 to build an index for that reference and align all currently unmapped reads to it. The reference for which the most reads map is kept, and unmapped reads remaining are moved back to step (A). The outcome is a list of reference genomes, along with a percentage of the reads iteratively aligning to these references (screenshot in (D)).</p
Bacterial reads.
<p>For each bacterial genome in a set of 747 genomes, we simulated several read lengths (50 nt, 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0%, 1%, 5%, 10%). 100 random reads were used in each query and the distribution of the rank of the correct references in the list recorded; a rank of means that the correct reference was at the very top of the list. The list of hits has a maximum length of 25 and we count the reference as ‘not found’ if not in the list at all. The percentage of correct test bacterial genomes present in the list is represented in a bar nested on the right side of each panel. The figure shows that, as expected, the performance degrades as the substitution rate increases, but also that reads of length 50 appear of little practical use for identification purposes. Increasing the read length beyond 100 nt brings only small improvements, and has a limited compensatory effect on the substitution rate. The figure suggests that current leading technology for sequencing possess sufficient length for an accurate identification, and should focus on sequence quality rather than increased read length.</p
Genomic references
<p>Snapshot of genomic references (source and number of references). The references are a mixture of full genomes or plasmids, and of genomic fragments such as contigs or genes.</p
Bacterial reads, same specie.
<p>Percentage of matches giving the correct specie, that is a reference in our collection that belongs to a bacteria of the same specie rather than the correct exact same reference as shown in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0083784#pone-0083784-g003" target="_blank">Figure 3</a>, and the percentage of cases for which the correct specie was not in the top 25 matches. Independent samples of 5, 10, 25, 50, 100, 200, or 300 random reads were used in each query and the distribution of the rank of the correct references in the list recorded; a rank of means that the correct reference was at the very top of the list. The list of hits has a maximum length of 25 and we count the reference as ‘not found’ if it not present in the list. The percentages of correct test bacterial genomes found in that list are represented in a bar plot nested on the right side of each panel. The performance remains poor for the shorter reads (50 nt), with noise decreasing it further (barplot on the first row), but become extremely good from 100 nt and stays robust against noise.</p
Bacterial reads (number of reads).
<p>For each bacterial genome in a set of 747 genomes, we simulated several read lengths (50 nt, 75 nt, 100 nt, 150 nt, 200 nt, 250 nt) and several substitution error rates (0%, 1%, 5%, 10%). Independent samples of 5, 10, 25, 50, 100, 200, or 300 random reads were used in each query and the distribution of the rank of the correct references in the list recorded; a rank of means that the correct reference was at the very top of the list. The list of hits has a maximum length of 25 and we count the reference as ‘not found’ if it not present in the list. The percentages of correct test bacterial genomes found in that list are represented in a bar plot nested on the right side of each panel. Increasing the number of reads in the random sample beyond 100 reads only improves very slightly the performance observed, mostly for shorter read lengths and higher substitution rates. The substitution rate or the read length has much stronger effects on the performance.</p
Overview of the indexing and scoring procedures.
<p>(A) During the indexing of a collection of reference sequences, non-overlapping <i>k</i>-mers are indexed into two distinct key-value stores, one associating <i>k</i>-mers with the references they were found in (‘presence’) and one associating <i>k</i>-mers with the position in the reference at which the k-mer was found (‘position’). (B) When processing a sequencing read in a query set, overlapping <i>k</i>-mers are looked up in the ‘presence’ store. Using overlapping <i>k</i>-mers allows to resolve relatively rapidly misalignments between the beginning of the read and the beginning of the reference sequence (dotted lines). On the figure, only <i>k</i>-mers in red are in phase with the indexing step, therefore only those can be found in ‘presence’. (C) For a given read, a threshold is applied to retain only references potentially matching enough of the read. Situations where very large references containing disjoint scattered <i>k</i>-mers, such as a bacterial read against a mammalian genome, are resolved in the last step where the ‘position’ store is queried.</p
Bacterial reads, variability of accuracy when identifying a genome.
<p>Average rank (, x-axis) and standard deviation of the rank (, y-axis) of the correct reference when performing the identification procedure for 747 test bacterial genomes, using 100 random reads and 3 times for each genome. The closest the average rank is to 1 the closest to a perfect performance, and the smallest the standard deviation of the ranks the least sensitive to sampling effects. In order to increase clarity when many bacterial genomes tested produce equal or close coordinates on the scatter, we use hexagonal binning and color the areas accordingly. The vertical bar on the right side of panel indicates the percentage of times the correct reference was within the top 25 matches. Various reads size (rows) and error rates (random substitution, columns) were tried, producing a matrix of scatter plots.</p
Length distribution of amino terminal PFRs for MHC-II binding and non-binding peptides
<p><b>Copyright information:</b></p><p>Taken from "Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method"</p><p>http://www.biomedcentral.com/1471-2105/8/238</p><p>BMC Bioinformatics 2007;8():238-238.</p><p>Published online 4 Jul 2007</p><p>PMCID:PMC1939856.</p><p></p> All peptide data for the three alleles in the AntiJen and IEDB data sets are included in the figure. Binding peptides have an affinity stronger than 500 nM. The PFR is defined as the residues flanking the peptide-binding core as determined by the SMM-align method
Simulation of an immunization experiment.
<p>B cell (panel a) and CD4 T cell (panel b) population during a typical immunization experiment. An immunogenic molecule is injected at time zero and after six months. In both plots, the total number of lymphocytes along with the immune memory compartment are shown. Panel (c) shows that the secondary response eliminates the antigen on a shorter timescale due to the presence of memory cells ready to react.</p
- …