    Rapid protein evolution, organellar reductions, and invasive intronic elements in the marine aerobic parasite dinoflagellate Amoebophrya spp

    BACKGROUND : Dinoflagellates are aquatic protists particularly widespread in the oceans worldwide. Some are responsible for toxic blooms while others live in symbiotic relationships, either as mutualistic symbionts in corals or as parasites infecting other protists and animals. Dinoflagellates harbor atypically large genomes (~ 3 to 250 Gb), with gene organization and gene expression patterns very different from closely related apicomplexan parasites. Here we sequenced and analyzed the genomes of two early-diverging and co-occurring parasitic dinoflagellate Amoebophrya strains, to shed light on the emergence of such atypical genomic features, dinoflagellate evolution, and host specialization. RESULTS : We sequenced, assembled, and annotated high-quality genomes for two Amoebophrya strains (A25 and A120), using a combination of Illumina paired-end short-read and Oxford Nanopore Technology (ONT) MinION long-read sequencing approaches. We found a small number of transposable elements, along with short introns and intergenic regions, and a limited number of gene families, together contribute to the compactness of the Amoebophrya genomes, a feature potentially linked with parasitism. While the majority of Amoebophrya proteins (63.7% of A25 and 59.3% of A120) had no functional assignment, we found many orthologs shared with Dinophyceae. Our analyses revealed a strong tendency for genes encoded by unidirectional clusters and high levels of synteny conservation between the two genomes despite low interspecific protein sequence similarity, suggesting rapid protein evolution. Most strikingly, we identified a large portion of non-canonical introns, including repeated introns, displaying a broad variability of associated splicing motifs never observed among eukaryotes. Those introner elements appear to have the capacity to spread over their respective genomes in a manner similar to transposable elements. Finally, we confirmed the reduction of organelles observed in Amoebophrya spp., i.e., loss of the plastid, potential loss of a mitochondrial genome and functions. CONCLUSION : These results expand the range of atypical genome features found in basal dinoflagellates and raise questions regarding speciation and the evolutionary mechanisms at play while parastitism was selected for in this particular unicellular lineage. Filtered alignments were concatenated using seqCat.pl and a phylogenetic tree was produced under Maximum Likelihood framework using RAxML v8.2.9 with the PROTGAMMALGF model of sequence evolution and 101 bootstraps. Asterics represent support values of 95 and above. A detailed method can be found in Kayal et al. 2018 BMC Evol. Biol. (https://doi.org/10.1186/s12862-018-1142-0). The full tree can be found at http://mmo.sb-roscoff.fr/jbrowseAmoebophrya/. FIGURE S2. SSU rDNA sequence identity (in percentage, relative to A25 and A120 compared to other species). FIGURE S3. Distribution of k-mer in A25 and A120 genomes. FIGURE S4. Classification of repeated elements in 3 Amoebophrya genomes (AT5, A25, and A120) using REPET. The x-axis represents the cumulated number of bases of repeated elements in the genome. FIGURE S5. Conserved motif of the putative splice leader (SL) in A25 and A120. FIGURE S6. Alignments of gene encoding the putative spliced leader (SL) gene in A25 and A120. FIGURE S7. Gene orientation change rate in 3 Amoebophrya genomes. FIGURE S8. Number of orthologs genes shared by selected taxa. FIGURE S9. Boxplot of the dN/dS ratios of orthologous genes between A25 and A120, calculated using the model average method (MA). FIGURE S10. Synteny dot-plot obtained by comparison between Amoebophrya A25 and AT5 genomes. FIGURE S11. Synteny dot-plot obtained by comparison between Amoebophrya A120 and AT5 genomes. FIGURE S12. Intron length distribution. FIGURE S13. GC content distribution. FIGURE S14. Multiple alignments of U2 snRNAs. FIGURE S15. Multiple alignments of U4 snRNAs. FIGURE S16. Multiple alignments of U5 snRNAs. FIGURE S17. Multiple alignments of U6 snRNAs. FIGURE S18. Secondary structure of Amoebophrya snRNA. FIGURE S19. Example of introner elements (IEs) in Amoebophrya. FIGURE S20. Distribution the direct repeats with size ranging between 3 and 8 nucleotides in A25. FIGURE S21. Distribution of the direct repeats with size ranging between 3 and 8 nucleotides in A120. FIGURE S22. Composition of direct repeats in introners elements. The diversity in composition of the three (a, b, c) most abundant of direct repeats in introner elements in A25 (up) and A120 (down). FIGURE S23. Terminal inverted repeat locations around the splicing sites in A25 and A120. The position of inverted repeats according to the location of the splice sites in A25 and A120. Left, the inverted repeats of A120 are located at 1–5 the nucleotides upstream and downstream of the splice sites. Right, the inverted repeats of A25 are located at the 1–6 nucleotides in upstream and downstream of the splice sites. FIGURE S24. The flowchart for the in silico search of introner elements. FIGURE S25. Hierarchical clustering analysis (pairwise similarity and OrthoMCL) of all intron families and of the inverted repeats in A25 and A120. FIGURE S26. Percentage of genes with assigned functions in relation with introns composition. FIGURE S27. Difference in the proportion of IEs-containing-genes compared to their KEGG assignment in A25 and A120. FIGURE S28. Distribution of conserved introns. TABLE S1. RCC number, date and site of isolation of strains considered in this study. TABLE S2. Metrics of Nanopore runs for the two Amoebophrya strains. TABLE S3. Search for pathways involved in plastidial functions that are entirely independent of plastid-encoded gene content. TABLE S4. Number of the different types of introns identified in A25 and A120 genomes. TABLE S5. Search for RNA editing in A25 and A120 introns. TABLE S6. Putative Amoebophrya A25 and A120 snRNP homologs. TABLE S7. Classification into families of non-canonical introns in A25 and A120. TABLE S8. RNAseq read assembly statistics of Amoebophrya A25 and A120 corresponding samples from the different time of infection and to the freeliving stage (dinospore only). TABLE S9. This research was funded by the ANR (Agence Nationale de la Recherche) Grant ANR-14-CE02-0007 HAPAR, the CEA and the Région Bretagne (RC doctoral grant ARED PARASITE 9450 and EK postdoctoral grant SAD HAPAR 9229), and the CNRS (X-life SEAgOInG).

    EuGene: an eucaryotic gene finder that combines several sources of evidence

    In this paper, we describe the basis of EuGne, a gene nder for eucaryotic organisms applied to Arabidopsis thaliana. The specicity of EuGne, compared to existing gene nding software, is that EuGne combines the output of several information sources, including output of other software, in a weighted directed acyclic graph (DAG) designed in such a way that a shortest path in this graph represents the most likely gene structure of the underlying ADN sequence. The usual simple Bellman linear time shortest path algorithm for DAG has been replaced by a shortest path with constraints algorithm. The constraints express minimum length of introns or intergenic regions. The specicity of the constraints leads to an algorithm which is still linear both in time and space. EuGne eectiveness has been assessed on Araset, a recent dataset of Arabidopsis thaliana sequences used to evaluate several existing gene nding software. It appears that, despite its simplicity, EuGne gives results..

    Cleaning the GenBank Arabidopsis thaliana data set

    Data driven computational biology relies on the large quantities of genomic data stored in international sequence data banks, However, the possibilities are drastically impaired if the stored data is unreliable, During a project aiming to predict splice sites in the dicot Arabidopsis thaliana, we extracted a data set from the A.thaliana entries in GenBank, A number of simple 'sanity' checks, based on the nature of the data, revealed an alarmingly high error rate, More than 15% of the most important entries extracted did contain erroneous information, In addition, a number of entries had directly conflicting assignments of exons and introns, not stemming from alternative splicing, In a few cases the errors are due to mere typographical misprints, which may be corrected by comparison to the original papers, but errors caused by wrong assignments of splice sites from experimental data are the most common, It is proposed that the level of error correction should be increased and that gene structure sanity checks should be incorporated-also at the submitter level-to avoid or reduce the problem in the future. A non-redundant and error corrected subset of the data for A.thaliana is made available through anonymous FTP

    INCLUSive: INtegrated Clustering, Upstream sequence retrieval and motif Sampling

    Summary: INCLUSive allows automatic multistep analysis of microarray data (clustering and motif finding). The input consists of a data matrix containing the identification tags and the expression levels of the genes in the different profiling experiments. The clustering algorithm (adaptive quality-based clustering) groups together genes with highly similar expression profiles. The upstream sequences of the genes belonging to a cluster are automatically retrieved from GenBank and can be fed directly into Motif Sampler, a Gibbs sampling algorithm that retrieves statistically over-represented motifs in sets of sequences, in this case upstream regions of co-expressed genes

    Impact d'une combinaison cohérente de faisceaux par technique de marquage en fréquence sur un signal télécom pour les communications spatiales

    International audienceIn this contribution, we present a preliminary study for coherent beam combining carrying high information rate data for very high power space communications. We focus on the impact of the amplitude of the frequency tagging locking technique on the quality of an NRZ and DPSK telecom signals.Dans cette contribution, nous présentons une étude préliminaire préparant la possibilité d'effectuer une combinaison cohérente de faisceaux portant des données à haut débit d'information pour les communications spatiales à très forte puissance. Nous nous focalisons sur l'impact de l'amplitude du marquage en fréquence utilisé pour la mise en cohérence des faisceaux sur la qualité d'un signal NRZ et DPSK

    Higher intron loss rate in Arabidopsis thaliana than A. lyrata is consistent with stronger selection for a smaller genome

    The number of introns varies considerably among different organisms. This can be explained by the differences in the rates of intron gain and loss. Two factors that are likely to influence these rates are selection for or against introns and the mutation rate that generates the novel intron or the intronless copy. Although it has been speculated that stronger selection for a compact genome might result in a higher rate of intron loss and a lower rate of intron gain, clear evidence is lacking, and the role of selection in determining these rates has not been established. Here, we studied the gain and loss of introns in the two closely related species Arabidopsis thaliana and A. lyrata as it was recently shown that A. thaliana has been undergoing a faster genome reduction driven by selection. We found that A. thaliana has lost six times more introns than A. lyrata since the divergence of the two species but gained very few introns. We suggest that stronger selection for genome reduction probably resulted in the much higher intron loss rate in A. thaliana, although further analysis is required as we could not find evidence that the loss rate increased in A. thaliana as opposed to having decreased in A. lyrata compared with the rate in the common ancestor. We also examined the pattern of the intron gains and losses to better understand the mechanisms by which they occur. Microsimilarity was detected between the splice sites of several gained and lost introns, suggesting that nonhomologous end joining repair of double-strand breaks might be a common pathway not only for intron gain but also for intron loss

    2007b. Unique regulation of the calvin cycle in the ultrasmall green alga Ostreococcus

    Abstract. Glyceraldehyde-3-phosphate dehydrogenase (GapAB) and CP12 are two major players in controlling the inactivation of the Calvin cycle in land plants at night. GapB originated from a GapA gene duplication and differs from GapA by the presence of a specific C-terminal extension that was recruited from CP12. While GapA and CP12 are assumed to be generally present in the Plantae (glaucophytes, red and green algae, and plants), up to now GapB was exclusively found in Streptophyta, including the enigmatic green alga Mesostigma viride. However, here we show that two closely related prasinophycean green algae, Ostreococcus tauri and Ostreococcus lucimarinus, also possess a GapB gene, while CP12 is missing. This remarkable finding either antedates the GapA/B gene duplication or indicates a lateral recruitment. Moreover, Ostreococcus is the first case where the crucial CP12 function may be completely replaced by GapB-mediated GapA/B aggregation