This data set contains sequences, sequence alignments and phylogenetic trees used in the bioinformatic analyses presented in:<br><br>Shakya M, Soucy SM, and Zhaxybayeva O. "Insights into Origin and Evolution of α-proteobacterial Gene Transfer Agents", submitted. <br><br><div><b>File Contents:<br></b></div><div><br></div><div><b>Supplementary_Figures_final.pdf: </b>Supplementary Figures S1-S9 referred to in the manuscript.</div><div><br></div><div><b>SupplementaryTables.pdf</b> and <b>SupplementaryTables.xlsx</b>: Supplementary Tables S1-S5 referred to in the manuscript.<br></div><b><br></b><b><b>GTA_Rhodobacterales_queries.zip</b>: </b>FASTA-formatted files of RcGTA homologs from <i>Rhodobacterales</i> that were used in BLAST
searches of <i>RefSeq</i> database and 255 α-proteobacterial
genomes. <b><br><br>RefSeq_bacterial_hits.zip:</b> FASTA-formatted files of detected bacterial homologs
of RcGTA genes in RefSeq database release 76.
The
filenames correspond to gene names listed in Supplementary Table S4.<br><br>
<p> </p>
<p><b>RefSeq_viral_hits.zip:</b> FASTA-formatted files of detected
viral homologs of RcGTA genes within RefSeq database release 76.
The
filenames correspond to gene names listed in Supplementary Table S4. <br></p><p><br></p><p>
<b>StructuralClusterHomologs.xlsx: </b>An Excel spreadsheet with
information about
RcGTA homologs found in small clusters (SC) and large clusters (LC)
across α-proteobacterial genomes. The table contains the GI and
accession numbers of each
homolog, as well as accession number and taxonomic information of the
source
genome. <br></p><p>
</p><p><b><br></b></p><p><b>SC_and_LC_homologs_per_genome.zip:
</b>FASTA-formatted
files of RcGTA structural cluster homologs identified during the screen of 255
fully sequenced α-proteobacterial genomes. Each file represents an individual cluster
found within a genome, and name of the file contains the source genome name, genome
accession number and type of cluster (LC or SC). Within file, definition line
of each FASTA header is augmented with the type of cluster (SC or LC) and RcGTA
gene name of the homolog (see first column of Supplementary Table 4 for
notations).</p><p>
<br></p><p>
<b>individual_proteins_fa.zip</b>: FASTA-formatted sets of individual RcGTA
structural cluster genes and their large cluster (LC) homologs used to create
the LC-locus alignment. The filenames correspond to gene names listed in
Supplementary Table S4. <br></p><p><br></p><p><b>individual_proteins_aln.zip</b>: FASTA-formatted alignments of individual RcGTA
structural cluster genes and their large cluster (LC) homologs used to create
the LC-locus alignment. The filenames correspond to gene names listed in
Supplementary Table S4. <br></p><p><br></p><p><b>individual_trees.zip</b>: NEWICK-formatted
phylogenetic trees reconstructed from the alignments in individual_protein.zip
file. These trees were used in analyses shown in Supplementary Table S3. <br></p><p><br></p><p><b>LC_locus.zip</b>: FASTA-formatted LC-locus alignment and NEWICK-formatted
phylogenetic tree of the LC-locus (the right panel of Figure 6). </p><p><br></p><p><b>PPD.zip: </b>
Pairwise
phylogenetic distances (PPDs) of RcGTA homologs found in large clusters (LC),
small clusters (SC), and viruses in tab-delimited text files, and FASTA-formatted
alignments of RcGTA homologs used to calculate the PPDs. The data are shown in
Supplementary Figure S4. </p><p><br></p>
<p>
</p><p><b>flanking_genes.zip</b>: FASTA-formatted alignments and
NEWICK-formatted phylogenetic trees of three genes that were found flanking
large clusters detected in non-alpha-proteobacterial genomes. The trees are
shown in Supplementary Figure S8.</p><p><br></p><p><a>
</a></p><p><b>reference_tree.zip: </b>PHYLIP-formatted<b> </b>concatenated alignment of 99 alignments
of genes conserved across<b> </b>α-proteobacteria
(see
Supplementary Table S2), and NEWICK-formatted
phylogenetic trees reconstructed using this alignment (see Figure 6 and
Supplementary Figure S3.)</p><p><br></p><br><p>
</p>
<p><b>
</b></p