894 research outputs found

    DUDE-Seq: Fast, Flexible, and Robust Denoising for Targeted Amplicon Sequencing

    Full text link
    We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq

    De novo human genome assemblies reveal spectrum of alternative haplotypes in diverse populations.

    Get PDF
    The human reference genome is used extensively in modern biological research. However, a single consensus representation is inadequate to provide a universal reference structure because it is a haplotype among many in the human population. Using 10× Genomics (10×G) "Linked-Read" technology, we perform whole genome sequencing (WGS) and de novo assembly on 17 individuals across five populations. We identify 1842 breakpoint-resolved non-reference unique insertions (NUIs) that, in aggregate, add up to 2.1 Mb of so far undescribed genomic content. Among these, 64% are considered ancestral to humans since they are found in non-human primate genomes. Furthermore, 37% of the NUIs can be found in the human transcriptome and 14% likely arose from Alu-recombination-mediated deletion. Our results underline the need of a set of human reference genomes that includes a comprehensive list of alternative haplotypes to depict the complete spectrum of genetic diversity across populations

    Quality control and preprocessing of metagenomic datasets

    Get PDF
    Summary: Here, we present PRINSEQ for easy and rapid quality control and data preprocessing of genomic and metagenomic datasets. Summary statistics of FASTA (and QUAL) or FASTQ files are generated in tabular and graphical form and sequences can be filtered, reformatted and trimmed by a variety of options to improve downstream analysis

    High-quality, high-throughput measurement of protein-DNA binding using HiTS-FLIP

    Get PDF
    In order to understand in more depth and on a genome wide scale the behavior of transcription factors (TFs), novel quantitative experiments with high-throughput are needed. Recently, HiTS-FLIP (High-Throughput Sequencing-Fluorescent Ligand Interaction Profiling) was invented by the Burge lab at the MIT (Nutiu et al. (2011)). Based on an Illumina GA-IIx machine for next-generation sequencing, HiTS-FLIP allows to measure the affinity of fluorescent labeled proteins to millions of DNA clusters at equilibrium in an unbiased and untargeted way examining the entire sequence space by Determination of dissociation constants (Kds) for all 12-mer DNA motifs. During my PhD I helped to improve the experimental design of this method to allow measuring the protein-DNA binding events at equilibrium omitting any washing step by utilizing the TIRF (Total Internal Reflection Fluorescence) based optics of the GA-IIx. In addition, I developed the first versions of XML based controlling software that automates the measurement procedure. Meeting the needs for processing the vast amount of data produced by each run, I developed a sophisticated, high performance software pipeline that locates DNA clusters, normalizes and extracts the fluorescent signals. Moreover, cluster contained k-mer motifs are ranked and their DNA binding affinities are quantified with high accuracy. My approach of applying phase-correlation to estimate the relative translative Offset between the observed tile images and the template images omits resequencing and thus allows to reuse the flow cell for several HiTS-FLIP experiments, which greatly reduces cost and time. Instead of using information from the sequencing images like Nutiu et al. (2011) for normalizing the cluster intensities which introduces a nucleotide specific bias, I estimate the cluster related normalization factors directly from the protein Images which captures the non-even illumination bias more accurately and leads to an improved correction for each tile image. My analysis of the ranking algorithm by Nutiu et al. (2011) has revealed that it is unable to rank all measured k-mers. Discarding all the clusters related to previously ranked k-mers has the side effect of eliminating any clusters on which k-mers could be ranked that share submotifs with previously ranked k-mers. This shortcoming affects even strong binding k-mers with only one mutation away from the top ranked k-mer. My findings show that omitting the cluster deletion step in the ranking process overcomes this limitation and allows to rank the full spectrum of all possible k-mers. In addition, the performance of the ranking algorithm is drastically reduced by my insight from a quadratic to a linear run time. The experimental improvements combined with the sophisticated processing of the data has led to a very high accuracy of the HiTS-FLIP dissociation constants (Kds) comparable to the Kds measured by the very sensitive HiP-FA assay (Jung et al. (2015)). However, experimentally HiTS-FLIP is a very challenging assay. In total, eight HiTS-FLIP experiments were performed but only one showed saturation, the others exhibited Protein aggregation occurring at the amplified DNA clusters. This biochemical issue could not be remedied. As example TF for studying the details of HiTS-FLIP, GCN4 was chosen which is a dimeric, basic leucine zipper TF and which acts as the master regulator of the amino acid starvation Response in Saccharomyces cerevisiae (Natarajan et al. (2001)). The fluorescent dye was mOrange. The HiTS-FLIP Kds for the TF GCN4 were validated by the HiP-FA assay and a Pearson correlation coefficient of R=0.99 and a relative error of delta=30.91% was achieved. Thus, a unique and comprehensive data set of utmost quantitative precision was obtained that allowed to study the complex binding behavior of GCN4 in a new way. My Downstream analyses reveal that the known 7-mer consensus motif of GCN4, which is TGACTCA, is modulated by its 2-mer neighboring flanking regions spanning an affinity range over two orders of magnitude from a Kd=1.56 nM to Kd=552.51 nM. These results suggest that the common 9-mer PWM (Position Weight Matrix) for GCN4 is insufficient to describe the binding behavior of GCN4. Rather, an additional left and right flanking nucleotide is required to extend the 9-mer to an 11-mer. My analyses regarding mutations and related delta delta G values suggest long-range interdependencies between nucleotides of the two dimeric half-sites of GCN4. Consequently, models assuming positional independence, such as a PWM, are insufficient to explain these interdependencies. Instead, the full spectrum of affinity values for all k-mers of appropriate size should be measured and applied in further analyses as proposed by Nutiu et al. (2011). Another discovery were new binding motifs of GCN4, which can only be detected with a method like HiTS-FLIP that examines the entire sequence space and allows for unbiased, de-novo motif discovery. All These new motifs contain GTGT as a submotif and the data collected suggests that GCN4 binds as monomer to these new motifs. Therefore, it might be even possible to detect different binding modes with HiTS-FLIP. My results emphasize the binding complexity of GCN4 and demonstrate the advantage of HiTS-FLIP for investigating the complexity of regulative processes

    genomepy: genes and genomes at your fingertips

    Full text link
    Analyzing a functional genomics experiment, such as ATAC-, ChIP- or RNA-sequencing, requires reference data including a genome assembly and gene annotation. These resources can generally be retrieved from different organizations and in different versions. Most bioinformatic workflows require the user to supply this genomic data manually, which can be a tedious and error-prone process. Here we present genomepy, which can search, download, and preprocess the right genomic data for your analysis. Genomepy can search genomic data on NCBI, Ensembl, UCSC and GENCODE, and compare available gene annotations to enable an informed decision. The selected genome and gene annotation can be downloaded and preprocessed with sensible, yet controllable, defaults. Additional supporting data can be automatically generated or downloaded, such as aligner indexes, genome metadata and blacklists. Genomepy is freely available at https://github.com/vanheeringen-lab/genomepy under the MIT license and can be installed through pip or bioconda

    Sequencing viral genomes from a single isolated plaque

    Get PDF

    Database indexing for production MegaBLAST searches

    Get PDF
    Motivation: The BLAST software package for sequence comparison speeds up homology search by preprocessing a query sequence into a lookup table. Numerous research studies have suggested that preprocessing the database instead would give better performance. However, production usage of sequence comparison methods that preprocess the database has been limited to programs such as BLAT and SSAHA that are designed to find matches when query and database subsequences are highly similar

    Relative impact of indels versus SNPs on complex disease

    Full text link
    It is unclear whether insertions and deletions (indels) are more likely to influence complex traits than abundant single‐nucleotide polymorphisms (SNPs). We sought to understand which category of variation is more likely to impact health. Using the SardiNIA study as an exemplar, we characterized 478,876 common indels and 8,246,244 common SNPs in up to 5,949 well‐phenotyped individuals from an isolated valley in Sardinia. We assessed association between 120 traits, resulting in 89 nonoverlapping‐associated loci.We evaluated whether indels were enriched among credible sets of potential causal variants. These credible sets included 1,319 SNPs and 88 indels. We did not find indels to be significantly enriched. Indels were the most likely causal variant in seven loci, including one locus associated with monocyte count where an indel with causality and mechanism previously demonstrated (rs200748895:TGCTG/T) had a 0.999 posterior probability. Overall, our results show a very modest and nonsignificant enrichment for common indels in associated loci.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/147866/1/gepi22175_am.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147866/2/gepi22175-sup-0001-Gagliano-Supplementary.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/147866/3/gepi22175.pd
    corecore