<p><strong><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0147101">Salmonella In Silico Typing Resource (SISTR)</a> <a href="https://github.com/peterk87/sistr_cmd">sistr_cmd</a> version <a href="https://github.com/peterk87/sistr_cmd/releases/tag/v1.0.2">1.0.2</a> serotyping databases</strong></p>
<p>File structure tree for <code>sistr_cmd</code> <code>data</code> folder:</p>
<pre><code>.
|-- [4.0K] antigens
| |-- [1.0M] fliC.fasta
| |-- [210K] fljB.fasta
| |-- [126K] wzx.fasta
| `-- [ 60K] wzy.fasta
|-- [4.0K] cgmlst
| |-- [7.4M] cgmlst-centroid.fasta
| |-- [ 96M] cgmlst-full.fasta
| |-- [134M] cgmlst-profiles.hdf
| `-- [ 803] README.md
|-- [1.1M] genomes-to-serovar.txt
|-- [1.0M] genomes-to-subspecies.txt
|-- [118K] Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv
`-- [ 92M] sistr.msh
2 directories, 12 files</code></pre>
<p><strong>Description of files:</strong></p>
<ul>
<li><code>genomes-to-serovar.txt</code>: Each genome id to serovar designation delimited by tab character for the 52,790 Salmonella genomes.</li>
<li><code>genomes-to-subspecies.txt</code>: Each genome id to subspecies designation delimited by tab character for the 52,790 Salmonella genomes.</li>
<li><code>Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv</code>: Serovar and antigenic formula information table used by `sistr_cmd` for looking up serovar designations from antigen results</li>
<li><code>sistr.msh</code>: <a href="https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0997-x">Mash</a> sketch file of 11840 Salmonella genomes for Mash-based serotyping</li>
<li><code>antigens</code>: for antigen gene search-based serotyping
<ul>
<li><code>fliC.fasta</code>: fliC gene alleles for H1-antigen typing</li>
<li><code>fljB.fasta</code>: fljB gene alleles for H2-antigen typing</li>
<li><code>wzx.fasta</code>: wzx gene alleles for O-antigen typing</li>
<li><code>wzy.fasta</code>: wzy gene alleles for O-antigen typing</li>
</ul>
</li>
<li><code>cgmlst</code> for core-genome multilocus sequence typing (cgMLST) and cgMLST-based serotyping
<ul>
<li><code>cgmlst-profiles.hdf</code>: HDF5 file with cgMLST allelic profiles of 52,790 Salmonella genomes
<ul>
<li>read in with Pandas, i.e.
<pre><code>pd.read_hdf(CGMLST_PROFILES_PATH, key='cgmlst')</code></pre>
</li>
</ul>
</li>
<li><code>cgmlst-centroid.fasta</code>: "Centroid" or representative alleles of 52,790 Salmonella genomes for rapid NCBI BLAST+ blastn searching. Centroid alleles were defined from the full set of alleles for the 52,790 Salmonella genomes as the alleles for each locus:
<ul>
<li>group alleles by length</li>
<li>group length grouped alleles by ends (28bp at allele start and end; 28 is word size of blastn megablast)</li>
<li>hierarchical clustering of length+end grouped alleles</li>
<li>flat clusters at 2.5% distance</li>
<li>within each cluster, pick allele with least distance to others in cluster</li>
</ul>
</li>
</ul>
</li>
<li><code>cgmlst-full.fasta</code>: alleles for the 52,790 Salmonella genomes</li>
</ul