370 research outputs found

    Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets

    Get PDF
    Current genome-wide association studies (GWAS) use commercial genotyping microarrays that can assay over a million single nucleotide polymorphisms (SNPs). The number of SNPs is further boosted by advanced statistical genotype-imputation algorithms and large SNP databases for reference human populations. The testing of a huge number of SNPs needs to be taken into account in the interpretation of statistical significance in such genome-wide studies, but this is complicated by the non-independence of SNPs because of linkage disequilibrium (LD). Several previous groups have proposed the use of the effective number of independent markers (Me) for the adjustment of multiple testing, but current methods of calculation for Me are limited in accuracy or computational speed. Here, we report a more robust and fast method to calculate Me. Applying this efficient method [implemented in a free software tool named Genetic type 1 error calculator (GEC)], we systematically examined the Me, and the corresponding p-value thresholds required to control the genome-wide type 1 error rate at 0.05, for 13 Illumina or Affymetrix genotyping arrays, as well as for HapMap Project and 1000 Genomes Project datasets which are widely used in genotype imputation as reference panels. Our results suggested the use of a p-value threshold of ~10−7 as the criterion for genome-wide significance for early commercial genotyping arrays, but slightly more stringent p-value thresholds ~5 × 10−8 for current or merged commercial genotyping arrays, ~10−8 for all common SNPs in the 1000 Genomes Project dataset and ~5 × 10−8 for the common SNPs only within genes

    ConDeTri - A Content Dependent Read Trimmer for Illumina Data

    Get PDF
    During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data

    Investigation into the annotation of protocol sequencing steps in the sequence read archive

    Get PDF
    BACKGROUND: The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. RESULTS: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records). CONCLUSIONS: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present

    Targeted Genome-Wide Enrichment of Functional Regions

    Get PDF
    Only a small fraction of large genomes such as that of the human contains the functional regions such as the exons, promoters, and polyA sites. A platform technique for selective enrichment of functional genomic regions will enable several next-generation sequencing applications that include the discovery of causal mutations for disease and drug response. Here, we describe a powerful platform technique, termed “functional genomic fingerprinting” (FGF), for the multiplexed genomewide isolation and analysis of targeted regions such as the exome, promoterome, or exon splice enhancers. The technique employs a fixed part of a uniquely designed Fixed-Randomized primer, while the randomized part contains all the possible sequence permutations. The Fixed-Randomized primers bind with full sequence complementarity at multiple sites where the fixed sequence (such as the splice signals) occurs within the genome, and multiplex amplify many regions bounded by the fixed sequences (e.g., exons). Notably, validation of this technique using cardiac myosin binding protein-C (MYBPC3) gene as an example strongly supports the application and efficacy of this method. Further, assisted by genomewide computational analyses of such sequences, the FGF technique may provide a unique platform for high-throughput sample production and analysis of targeted genomic regions by the next-generation sequencing techniques, with powerful applications in discovering disease and drug response genes

    The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences

    Get PDF
    Metagenome sequencing is becoming common and there is an increasing need for easily accessible tools for data analysis. An essential step is the taxonomic classification of sequence fragments. We describe a web server for the taxonomic assignment of metagenome sequences with PhyloPythiaS. PhyloPythiaS is a fast and accurate sequence composition-based classifier that utilizes the hierarchical relationships between clades. Taxonomic assignments with the web server can be made with a generic model, or with sample-specific models that users can specify and create. Several interactive visualization modes and multiple download formats allow quick and convenient analysis and downstream processing of taxonomic assignments. Here, we demonstrate usage of our web server by taxonomic assignment of metagenome samples from an acidophilic biofilm community of an acid mine and of a microbial community from cow rumen

    Meraculous: De Novo Genome Assembly with Short Paired-End Reads

    Get PDF
    We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ∼280 bp or ∼3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed

    Meta-analytic approach to the accurate prediction of secreted virulence effectors in gram-negative bacteria

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many pathogens use a type III secretion system to translocate virulence proteins (called effectors) in order to adapt to the host environment. To date, many prediction tools for effector identification have been developed. However, these tools are insufficiently accurate for producing a list of putative effectors that can be applied directly for labor-intensive experimental verification. This also suggests that important features of effectors have yet to be fully characterized.</p> <p>Results</p> <p>In this study, we have constructed an accurate approach to predicting secreted virulence effectors from Gram-negative bacteria. This consists of a support vector machine-based discriminant analysis followed by a simple criteria-based filtering. The accuracy was assessed by estimating the average number of true positives in the top-20 ranking in the genome-wide screening. In the validation, 10 sets of 20 training and 20 testing examples were randomly selected from 40 known effectors of <it>Salmonella enterica </it>serovar Typhimurium LT2. On average, the SVM portion of our system predicted 9.7 true positives from 20 testing examples in the top-20 of the prediction. Removal of the N-terminal instability, codon adaptation index and ProtParam indices decreased the score to 7.6, 8.9 and 7.9, respectively. These discrimination features suggested that the following characteristics of effectors had been uncovered: unstable N-terminus, non-optimal codon usage, hydrophilic, and less aliphathic. The secondary filtering process represented by coexpression analysis and domain distribution analysis further refined the average true positive counts to 12.3. We further confirmed that our system can correctly predict known effectors of <it>P. syringae </it>DC3000, strongly indicating its feasibility.</p> <p>Conclusions</p> <p>We have successfully developed an accurate prediction system for screening effectors on a genome-wide scale. We confirmed the accuracy of our system by external validation using known effectors of <it>Salmonella </it>and obtained the accurate list of putative effectors of the organism. The level of accuracy was sufficient to yield candidates for gene-directed experimental verification. Furthermore, new features of effectors were revealed: non-optimal codon usage and instability of the N-terminal region. From these findings, a new working hypothesis is proposed regarding mechanisms controlling the translocation of virulence effectors and determining the substrate specificity encoded in the secretion system.</p
    corecore