20 research outputs found

    A Bioinformatics Approach for Determining Sample Identity from Different Lanes of High-Throughput Sequencing Data

    Get PDF
    The ability to generate whole genome data is rapidly becoming commoditized. For example, a mammalian sized genome (∼3Gb) can now be sequenced using approximately ten lanes on an Illumina HiSeq 2000. Since lanes from different runs are often combined, verifying that each lane in a genome's build is from the same sample is an important quality control. We sought to address this issue in a post hoc bioinformatic manner, instead of using upstream sample or “barcode” modifications. We rely on the inherent small differences between any two individuals to show that genotype concordance rates can be effectively used to test if any two lanes of HiSeq 2000 data are from the same sample. As proof of principle, we use recent data from three different human samples generated on this platform. We show that the distributions of concordance rates are non-overlapping when comparing lanes from the same sample versus lanes from different samples. Our method proves to be robust even when different numbers of reads are analyzed. Finally, we provide a straightforward method for determining the gender of any given sample. Our results suggest that examining the concordance of detected genotypes from lanes purported to be from the same sample is a relatively simple approach for confirming that combined lanes of data are of the same identity and quality

    Summary of data used in these analyses.

    No full text
    <p>Number of reads reflects the number of aligning reads after removing duplicate read pairs and filtering for low quality alignments (see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#s4" target="_blank">Methods</a>). Gender was determined by looking at coverage of reads in specific representative regions of the X and Y chromosomes (see <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#s4" target="_blank">Methods</a>). Number of genotypes called is from the autosomes only, which is what was used for downstream comparisons.</p

    Modeling <em>Drosophila</em> Positional Preferences in Open Field Arenas with Directional Persistence and Wall Attraction

    Get PDF
    <div><p>In open field arenas, <em>Drosophila</em> adults exhibit a preference for arena boundaries over internal walls and open regions. Herein, we investigate the nature of this preference using phenomenological modeling of locomotion to determine whether local arena features and constraints on movement alone are sufficient to drive positional preferences within open field arenas of different shapes and with different internal features. Our model has two components: directional persistence and local wall force. In regions far away from walls, the trajectory is entirely characterized by a directional persistence probability, , for each movement defined by the step size, , and the turn angle, . In close proximity to walls, motion is computed from and a local attractive force which depends on the distance between the fly and points on the walls. The directional persistence probability was obtained experimentally from trajectories of wild type <em>Drosophila</em> in a circular open field arena and the wall force was computed to minimize the difference between the radial distributions from the model and <em>Drosophila</em> in the same circular arena. The two-component model for fly movement was challenged by comparing the positional preferences from the two-component model to wild type <em>Drosophila</em> in a variety of open field arenas. In most arenas there was a strong concordance between the two-component model and <em>Drosophila</em>. In more complex arenas, the model exhibits similar trends, but some significant differences were found. These differences suggest that there are emergent features within these complex arenas that have significance for the fly, such as potential shelter. Hence, the two-component model is an important step in defining how <em>Drosophila</em> interact with their environment.</p> </div

    Overview of approach.

    No full text
    <p>Several lanes of HiSeq 2000 data are typically combined together for a comprehensive genome analysis, giving a high depth of coverage (A), and the ability to accurately call genotypes in the majority of the genome. In (B), two individual lanes of HiSeq 2000 data are depicted, with a lower average depth of coverage. By chance, some regions of the genome have enough data to be genotyped in both lanes (shaded gray).</p

    Effect of data quantity on concordance rates.

    No full text
    <p>The total number of reads used in the analysis affects different-sample comparisons, but not same-sample comparisons. In (A), lane 7 of sample ID B was kept constant at 140 million reads (B7), and the amount of data for the other sample [either lane 8 of B (B8) or lane 7 of C (C7)] was varied between 40 million and 140 million reads (x-axis) in 20 million read increments. The y-axis represents the concordance rate between variant (nonreference) genotypes called between the two different datasets. Note that for the same-sample comparison (red line), varying the number of reads used in the analysis does not substantially alter the concordance rate. However, this is not the case for different-sample comparisons (blue line), where the concordance rate becomes more different as more reads are used. In (B), a similar trend is observed when the reads in both samples are incremented simultaneously. Solid lines represent a LOESS smoothed fit to the data points.</p

    Concordance between lanes.

    No full text
    <p>Distributions of genotype concordance rates from same- and different-sample comparisons are non-overlapping. The box plot in (A) shows the distributions of concordance rates when using all callable positions for all combinations of pairs of the three samples being analyzed. The x-axis denotes each pair being compared (A, B, and C, refer to the sample IDs in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0023683#pone-0023683-t001" target="_blank">Table 1</a>), and the y-axis represents the distribution of concordance rates for all pair-wise combinations of lanes representing the specific pair of samples on the x-axis. It is likely that the detected differences from same-sample comparisons (B–B, C–C, and A–A) arise solely from sequencing and genotyping error. The box plot in (B) is similar to (A), except that here only variant (nonreference) positions are considered. The symmetrical heat map in (C) summarizes the data from panel (A); the blue boxes represent low concordance rates and correspond to different-sample comparisons, while the yellow boxes along the diagonal represent high concordance rates and correspond to same-sample comparisons. Note that comparisons between samples B and C (gray boxes) are slightly more similar to each other than the other different-sample comparisons, but still sufficiently distinct from same-sample comparisons. This is expected given the known partial consanguinity between these individuals.</p

    Positional preferences from experiments and two-component model with nonlinear wall forces in two arenas.

    No full text
    <p><b>A</b>. Radial distribution of a wild type fly and simulations using a power law form of wall attraction, with and , in the circular arena of radius 4.2 cm. Simulations using these values most closely matched the positional preference of wild type Canton-S. <b>B</b>. Radial distribution of a wild type fly and simulations using a exponential law form of wall attraction, with and , in the circular arena of radius 4.2 cm. Simulations using these values most closely matched the positional preference of wild type Canton-S. <b>C</b>. We examined the preference of two spatial zones (cross and edge zones as described in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046570#pone-0046570-g003" target="_blank">Figure 3</a>) inside 4.2 cm circular arena with internal corners. There were no statistical differences in the occupancies from simulations (both exponential and power law decay wall attraction) and experiments for the two zones. <b>D</b>. We examined the occupancies in seven spatial zones in the Texas area as described in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0046570#pone-0046570-g008" target="_blank">Figure 8</a>. Both the two groups of simulations using exponential and power law forms of forces captured several trends similarly to the experiments, but underemphasized the responses to the preferred areas and overemphasized the less preferred - especially zone 7. The level of significant differences between simulations and experiments are indicated using *: , : , and +: .</p

    Mean occupancy in different zones of 4.2 cm circular arena with internal corners.

    No full text
    <p><b>A.</b> The internal corner arena was constructed with a cross placed in the center of a 4.2 cm circular arena. We considered two zones in this arena: cross and edge zone. The cross zone was a square sector positioned at the center while the edge zone was a annular region of width 0.6 cm along the boundary. <b>B.</b> The mean occupancies of the flies, both from simulations and experiments, in each zone are shown. The times spent in these zones by <i>Drosophila</i> were statistically similar to that from the two-component model. <i>Drosophila</i> showed a preference for the boundary over internal corners.</p

    Positional preferences from experiments and simulations in Texas arena.

    No full text
    <p>We examined the preference of seven spatial zones inside the arena: four corners (zones 1, 2, 3, and 4) and three internal regions (zones 5, 6, and 7). Zone 5 is a triangle connecting three cities: Dallas, San Antonio and Houston. Zone 6 connects Dallas, San Antonio and Abeline, while zone 7 connects San Antonio, Abeline and Fort Stockon. The simulations captured several trends similarly to the experiments, but underemphasized the responses to the preferred areas and overemphasized the less preferred - especially zone 7. The asterix indicates significant difference between simulations and experiments (*: , **: , and ***: ).</p
    corecore