sistr_cmd v1.0.2 serotyping databases

Abstract

<p><strong><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147101">Salmonella In Silico Typing Resource (SISTR)</a> <a href="https://github.com/peterk87/sistr_cmd">sistr_cmd</a> version <a href="https://github.com/peterk87/sistr_cmd/releases/tag/v1.0.2">1.0.2</a> serotyping databases</strong></p> <p>File structure tree for <code>sistr_cmd</code> <code>data</code> folder:</p> <pre><code>. |-- [4.0K] antigens | |-- [1.0M] fliC.fasta | |-- [210K] fljB.fasta | |-- [126K] wzx.fasta | `-- [ 60K] wzy.fasta |-- [4.0K] cgmlst | |-- [7.4M] cgmlst-centroid.fasta | |-- [ 96M] cgmlst-full.fasta | |-- [134M] cgmlst-profiles.hdf | `-- [ 803] README.md |-- [1.1M] genomes-to-serovar.txt |-- [1.0M] genomes-to-subspecies.txt |-- [118K] Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv `-- [ 92M] sistr.msh 2 directories, 12 files</code></pre> <p><strong>Description of files:</strong></p> <ul> <li><code>genomes-to-serovar.txt</code>: Each genome id to serovar designation delimited by tab character for the 52,790 Salmonella genomes.</li> <li><code>genomes-to-subspecies.txt</code>: Each genome id to subspecies designation delimited by tab character for the 52,790 Salmonella genomes.</li> <li><code>Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv</code>: Serovar and antigenic formula information table used by `sistr_cmd` for looking up serovar designations from antigen results</li> <li><code>sistr.msh</code>: <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x">Mash</a> sketch file of 11840 Salmonella genomes for Mash-based serotyping</li> <li><code>antigens</code>: for antigen gene search-based serotyping <ul> <li><code>fliC.fasta</code>: fliC gene alleles for H1-antigen typing</li> <li><code>fljB.fasta</code>: fljB gene alleles for H2-antigen typing</li> <li><code>wzx.fasta</code>: wzx gene alleles for O-antigen typing</li> <li><code>wzy.fasta</code>: wzy gene alleles for O-antigen typing</li> </ul> </li> <li><code>cgmlst</code> for core-genome multilocus sequence typing (cgMLST) and cgMLST-based serotyping <ul> <li><code>cgmlst-profiles.hdf</code>: HDF5 file with cgMLST allelic profiles of 52,790 Salmonella genomes <ul> <li>read in with Pandas, i.e. <pre><code>pd.read_hdf(CGMLST_PROFILES_PATH, key='cgmlst')</code></pre> </li> </ul> </li> <li><code>cgmlst-centroid.fasta</code>: "Centroid" or representative alleles of 52,790 Salmonella genomes for rapid NCBI BLAST+ blastn searching. Centroid alleles were defined from the full set of alleles for the 52,790 Salmonella genomes as the alleles for each locus: <ul> <li>group alleles by length</li> <li>group length grouped alleles by ends (28bp at allele start and end; 28 is word size of blastn megablast)</li> <li>hierarchical clustering of length+end grouped alleles</li> <li>flat clusters at 2.5% distance</li> <li>within each cluster, pick allele with least distance to others in cluster</li> </ul> </li> </ul> </li> <li><code>cgmlst-full.fasta</code>: alleles for the 52,790 Salmonella genomes</li> </ul

    Similar works

    Full text

    thumbnail-image

    Available Versions