21 research outputs found

    An Improved Protocol for Sequencing of Repetitive Genomic Regions and Structural Variations Using Mutagenesis and Next Generation Sequencing

    Get PDF
    <div><p>The rise of Next Generation Sequencing (NGS) technologies has transformed <em>de novo</em> genome sequencing into an accessible research tool, but obtaining high quality eukaryotic genome assemblies remains a challenge, mostly due to the abundance of repetitive elements. These also make it difficult to study nucleotide polymorphism in repetitive regions, including certain types of structural variations. One solution proposed for resolving such regions is Sequence Assembly aided by Mutagenesis (SAM), which relies on the fact that introducing enough random mutations breaks the repetitive structure, making assembly possible. Sequencing many different mutated copies permits the sequence of the repetitive region to be inferred by consensus methods. However, this approach relies on molecular cloning in order to isolate and amplify individual mutant copies, making it hard to scale-up the approach for use in conjunction with high-throughput sequencing technologies. To address this problem, we propose NG-SAM, a modified version of the SAM protocol that relies on PCR and dilution steps only, coupled to a NGS workflow. NG-SAM therefore has the potential to be scaled-up, e.g. using emerging microfluidics technologies. We built a realistic simulation pipeline to study the feasibility of NG-SAM, and our results suggest that under appropriate experimental conditions the approach might be successfully put into practice. Moreover, our simulations suggest that NG-SAM is capable of reconstructing robustly a wide range of potential target sequences of varying lengths and repetitive structures.</p> </div

    Performance of NG-SAM in simulated experiments.

    No full text
    <p>The hexagons are colored according to the mean of the metrics from all covered simulated experiments. White areas represent unexplored parameter space. <b>A</b>. The percentage of successful simulated experiments in the first simulation setting, as a function of length and number of repetitive units. The black circle [at the point (3813, 3)] marks the repetitive structure of the target region used in the second simulation setting. The dashed line corresponds to target regions with a total size of 10 kb. <b>B</b>. Percentage of correctly reconstructed bases in the successful experiments from the first simulation setting, as a function of length and number of repetitive units in the target sequence (black circle and dashed line as in <b>A</b>). <b>C</b>. The percentage of successful simulated experiments in the second simulation setting, as a function of the dilution factors ( and in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0043359#pone-0043359-g002" target="_blank">Figure 2</a>). The black circle corresponds to the dilution factors used in the first simulation setting. <b>D</b>. Percentage of correctly reconstructed bases in the second simulation setting as a function of the dilution factors. Black circle as in <b>C</b>; see text for further details.</p

    Assembly problems caused by the presence of repeats.

    No full text
    <p><b>A</b>. The structure of the target region. Red units are identical or near-identical; other colours are unique. <b>B</b>. Fragments ordered by their origin. <b>C</b>. Pool of reads obtained by short read sequencing. Note that in this example the full length of the fragments is sequenced. <b>D</b>. A graph structure summarizing assembly uncertainty. The thickness of the arrows representing the units is indicative of the depth of coverage. <b>E</b>. The two possible resolutions of the assembly graph, given that the copy numbers of all of the units are estimated correctly.</p

    Overview of the simulated NG-SAM protocol.

    No full text
    <p>The numbering corresponds to the steps enumerated above in the main text. The trapezoids shaded in light blue represent PCR amplifications (with – being the number of cycles), while the rectangles shaded in yellow represent sampling of molecules by dilution. – are the number of molecules present in the various stages of the simulated experiment, with unique variants symbolised by different coloured dots. and are the dilution factors corresponding to the first and second dilution steps. The black lines represent the “lineages" of the molecules sampled by the second dilution, traced back to the initial molecule pool of size . The steps <b>A</b>–<b>C</b> correspond to the mutagenic PCR, dilution and cleanup PCR steps of the mutagenic protocol. simNGS <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0043359#pone.0043359-simNGS1" target="_blank">[35]</a> is a software for simulating Illumina sequencing and Velvet <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0043359#pone.0043359-Zerbino1" target="_blank">[8]</a> is a short read assembler.</p

    Efficient distance calculation is enabled via a transducer architecture.

    No full text
    <p>A) Overlapping genomic rearrangements modify the associated copy-number profiles in different ways. Amplifications are indicated in green, deletions in red. The blue rectangles indicate the previous event. B) The one-step minimum event transducer describes all possible edit operations achievable in one event. This FST is composed times with itself to create the the full minimum event FST . Edge labels consist of an input symbol, a colon and the corresponding output symbol, followed by a slash and the weight associated with taking that transition. C) The minimum event FST is asymmetric and describes the evolution of a genomic profile from its ancestor. Composed with its inverse this yields the symmetric minimum event distance .</p

    MEDICC improves reconstruction accuracy over competing methods.

    No full text
    <p>A) Simulations results show the improvement of reconstruction accuracy for MEDICC over naive methods (BioNJ clustering on Euclidean distances between copy-number profiles, red) and competing algorithms (TuMult, green). B) Allele phasing accuracy across the simulated trees. On average 92.9% of all genomic loci were correctly assigned to the individual parental alleles. C) Density estimates of clonal expansion indices for neutrally evolving trees (red) and trees with induced long branches as created by clonal expansion processes (blue) show the ability of MEDICC to detect clonal expansion.</p

    Application to a case of endometrioid cancer.

    No full text
    <p>A) Evolutionary tree of the OV03-04 case reconstructed from whole genome copy-number profiles. Approximate support values indicate how often each split was observed in trees reconstructed after resampling of the distance matrix with added truncated Gaussian noise. MEDICC performs reconstruction of ancestral copy-number profiles. Here, the (compressed) ancestral profiles for chromosome 17 are given as an example and MEDICC depicts unresolved ambiguities in the form of sequence logos. A star indicates no change compared to its ancestor. B) Ordination of the samples using kPCA shows four clear clonal expansions, comprising three separate Omentum groups and the Bl/VV group. C) Circos plot of selected genomic profiles (marked in bold in the tree) shows the extent of chromosomal aberrations across the genome. The two phased parental alleles are indicated in red and blue.</p

    Evolutionary copy-number trees are reconstructed in three steps.

    No full text
    <p>1) After segmentation and compression, major and minor alleles are phased using the minimum event criterion. 2) The tree topology is reconstructed from the pairwise distances between genomes. 3) Reconstruction of ancestral genomes yields the final branch lengths of the tree, which correspond to the number of events between genomes.</p

    Parental alleles are phased using context-free grammars.

    No full text
    <p>A) Allelic phasing is achieved by choosing consecutive segments from either the major or minor allele which minimise the pairwise distance between profiles. B) The set of all possible phasing choices is modelled by a context-free grammar. In this representation, the order of the regions' copy-number values on the second allele is reversed, in order to match the inside-out parsing scheme of CFGs. That way every possible parse tree of the grammar describes one possible phasing.</p

    Amino Acid Changes in Disease-Associated Variants Differ Radically from Variants Observed in the 1000 Genomes Project Dataset

    Get PDF
    <div><p>The 1000 Genomes Project data provides a natural background dataset for amino acid germline mutations in humans. Since the direction of mutation is known, the amino acid exchange matrix generated from the observed nucleotide variants is asymmetric and the mutabilities of the different amino acids are very different. These differences predominantly reflect preferences for nucleotide mutations in the DNA (especially the high mutation rate of the CpG dinucleotide, which makes arginine mutability very much higher than other amino acids) rather than selection imposed by protein structure constraints, although there is evidence for the latter as well. The variants occur predominantly on the surface of proteins (82%), with a slight preference for sites which are more exposed and less well conserved than random. Mutations to functional residues occur about half as often as expected by chance. The disease-associated amino acid variant distributions in OMIM are radically different from those expected on the basis of the 1000 Genomes dataset. The disease-associated variants preferentially occur in more conserved sites, compared to 1000 Genomes mutations. Many of the amino acid exchange profiles appear to exhibit an anti-correlation, with common exchanges in one dataset being rare in the other. Disease-associated variants exhibit more extreme differences in amino acid size and hydrophobicity. More modelling of the mutational processes at the nucleotide level is needed, but these observations should contribute to an improved prediction of the effects of specific variants in humans.</p></div
    corecore