15 research outputs found

    Spaced seeds improve k-mer-based metagenomic classification

    Full text link
    Metagenomics is a powerful approach to study genetic content of environmental samples that has been strongly promoted by NGS technologies. To cope with massive data involved in modern metagenomic projects, recent tools [4, 39] rely on the analysis of k-mers shared between the read to be classified and sampled reference genomes. Within this general framework, we show in this work that spaced seeds provide a significant improvement of classification accuracy as opposed to traditional contiguous k-mers. We support this thesis through a series a different computational experiments, including simulations of large-scale metagenomic projects. Scripts and programs used in this study, as well as supplementary material, are available from http://github.com/gregorykucherov/spaced-seeds-for-metagenomics.Comment: 23 page

    Subset Seed Extension to Protein BLAST

    Get PDF
    The seeding technique became central in the theory of sequence alignment and there are several efficient tools applying seeds to DNA homology search. Recently, a concept of subset seeds has been proposed for similarity search in protein sequences. We experimentally evaluate the applicability of subset seeds to protein homology search. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The representation of seeds by deterministic finite automata (DFAs) is developed and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original NCBI-BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. SeedBLAST is an open source software freely availabl

    Spaced seeds improve k

    No full text

    Efficient alternatives to PSI-BLAST

    No full text
    In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses a advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we fill this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm

    The association between 38 previously reported polymorphisms and psoriasis in a Polish population: High predicative accuracy of a genetic risk score combining 16 loci.

    No full text
    To confirm the association of previously discovered psoriasis (Ps) risk loci with the disease in a Polish population and to create predictive models based on the combination of these single nucleotide polymorphisms (SNPs).Thirty-eight SNPs were genotyped in 480 Ps patients and 490 controls. Alleles distributions were compared between patients and controls, as well as between different Ps sub-phenotypes. The genetic risk score (GRS) was calculated to assess the cumulative risk conferred by multiple loci.We confirmed associations of several loci with Ps: HLA-C, REL, IL12B, TRIM39/RPP21, POU5F1, MICA. The analysis of ROC curves showed that GRS combining 16 SNPs at least nominally (uncorrected P0.05). In order to assess the total risk conferred by GRS-N, we calculated ORs according to GRS-N quartile - the Ps OR for top vs. bottom GRS-N quartiles was 12.29 (P<1 x 10-6). The analysis of different Ps sub-phenotypes showed an association of GRS-N with age of onset and family history of Ps.We confirmed the association of Ps with several previously identified genetic risk factors in a Polish population. We found that a GRS combining 16 SNPs at least nominally associated with Ps had a significantly better discriminatory ability than HLA-C or GRS combining SNPs associated with Ps after the Bonferroni correction. In contrast, adding additional SNPs to GRS did not increase significantly the discriminative power
    corecore