20 research outputs found

    Matching curated genome databases: a non trivial task

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation.</p> <p>Results</p> <p>Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS), and/or in the total number of CDS in the respective version of the same genome.</p> <p>CorBank is freely accessible at <url>http://www.corbank.u-psud.fr</url>. The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon.</p> <p>Conclusion</p> <p>CorBank is very efficient in rapid detection of the numerous differences existing between RefSeq and Genome Reviews versions of the same curated genome. Although such differences are acceptable as reflecting different views, we suggest that curators of both genome databases could help reducing further divergence by agreeing on a minimal dialogue and attempting to publish the point of view of the other database whenever it is technically possible.</p

    Megasatellite formation and evolution in vertebrate genes

    No full text
    International audienceSince formation of the first proto-eukaryotes, gene repertoire and genome complexity have significantly increased. Among genetic elements responsible for this increase are tandem repeats. Here we describe a genome-wide analysis of large tandem repeats, called megasatellites, in 58 vertebrate genomes. Two bursts occurred, one after the radiation between Agnatha and Gnathostomata fishes and the second one in therian mammals. Megasatellites are enriched in subtelomeric regions and frequently encoded in genes involved in transcription regulation, intracellular trafficking, and cell membrane metabolism, reminiscent of what is observed in fungus genomes. The presence of many introns within young megasatellites suggests that an exon-intron DNA segment is first duplicated and amplified before accumulation of mutations in intronic parts partially erases the megasatellite in such a way that it becomes detectable only in exons. Our results suggest that megasatellite formation and evolution is a dynamic and still ongoing process in vertebrate genomes

    Differential efficacies of Cas nucleases on microsatellites involved in human disorders and associated off-target mutations

    No full text
    International audienceMicrosatellite expansions are the cause of >20 neurological or developmental human disorders. Shortening expanded repeats using specific DNA endonucleases may be envisioned as a gene editing approach. Here, we measured the efficacy of several CRISPR-Cas nucleases to induce recombination within disease-related microsatellites, in Saccharomyces cerevisiae. Broad variations in nuclease performances were detected on all repeat tracts. Wild-type Streptococcus pyogenes Cas9 (SpCas9) was more efficient than Staphylococcus aureus Cas9 on all repeats tested, except (CAG) 33. Cas12a (Cpf1) was the most efficient on GAA trinucleotide repeats, whereas GC-rich repeats were more efficiently cut by SpCas9. The main genetic factor underlying Cas efficacy was the propensity of the recognition part of the sgRNA to form a stable secondary structure, independently of its structural part. This suggests that such structures form in vivo and interfere with sgRNA metabolism. The yeast genome contains 221 natural CAG/CTG and GAA/CTT trinucleotide repeats. Deep sequencing after nuclease induction identified three of them as carrying statistically significant low frequency mutations, corresponding to SpCas9 off-target double-strand breaks

    Functional variability in adhesion and flocculation of yeast megasatellite genes

    No full text
    International audienceMegasatellites are large tandem repeats found in all fungal genomes but especially abundant in the opportunistic pathogen Candida glabrata. They are encoded in genes involved in cell-cell interactions, either between yeasts or between yeast and human cells. In the present work, we have been using an iterative genetic system to delete several Candida glabrata megasatellite-containing genes and found that 2 of them were positively involved in adhesion to epithelial cells, whereas 3 genes negatively controlled adhesion. Two of the latter, CAGL0B05061g or CAGL0A04851g, were also negative regulators of yeast-to-yeast adhesion, making them central players in controlling Candida glabrata adherence properties. Using a series of synthetic Saccharomyces cerevisiae strains in which the FLO1 megasatellite was replaced by other tandem repeats of similar length but different sequences, we showed that the capacity of a strain to flocculate in liquid culture was unrelated to its capacity to adhere to epithelial cells or to invade agar. Finally, to understand how megasatellites were initially created and subsequently expanded, an experimental evolution system was set up, in which modified yeast strains containing different megasatellite seeds were grown in bioreactors for more than 200 generations and selected for their ability to sediment at the bottom of the culture tube. Several flocculation-positive mutants were isolated. Functionally relevant mutations included general transcription factors as well as a 230-kbp segmental duplication

    Diverse single-stranded DNA viruses from viral metagenomics on a cynopterus bat in China

    No full text
    Bats serve as reservoirs for many emerging viruses. Cressdnaviruses can infect a wide range of animals, including agricultural species, such as pigs, in which porcine circoviruses cause severe gastroenteritis. New cressdnaviruses have also attracted considerable attention recently, due to their involvement with infectious diseases. However, little is known about their host range and many cressdnaviruses remain poorly characterized. We identified and characterized 11 contigs consisting of previously unknown cressdnaviruses from a rectal swab sample of a Cynopterus bat collected in Yunnan Province, China, in 2011. Full genomes of two cressdnaviruses (OQ267680, 2069 nt; OQ351951, 2382 nt), and a nearly complete genome for a third (OQ267683, 2361 nt) were obtained. Phylogenetic analyses and the characteristics of these viral genomes suggest a high degree of ssDNA virus diversity. These results shed light on cressdnavirus diversity and the probable role of Cynopterus bats as their hosts

    Resection and repair of a Cas9 double-strand break at CTG trinucleotide repeats induces local and extensive chromosomal deletions

    No full text
    International audienceMicrosatellites are short tandem repeats, ubiquitous in all eukaryotes and represent ~2% of the human genome. Among them, trinucleotide repeats are responsible for more than two dozen neurological and developmental disorders. Targeting microsatellites with dedicated DNA endonucleases could become a viable option for patients affected with dramatic neurodegenerative disorders. Here, we used the Streptococcus pyogenes Cas9 to induce a double-strand break within the expanded CTG repeat involved in myotonic dystrophy type 1, integrated in a yeast chromosome. Repair of this double-strand break generated unexpected large chromosomal deletions around the repeat tract. These deletions depended on RAD50, RAD52, DNL4 and SAE2, and both non-homologous end-joining and single-strand annealing pathways were involved. Resection and repair of the double-strand break (DSB) were totally abolished in a rad50Δ strain, whereas they were impaired in a sae2Δ mutant, only on the DSB end containing most of the repeat tract. This observation demonstrates that Sae2 plays significant different roles in resecting a DSB end containing a repeated and structured sequence as compared to a non-repeated DSB end. In addition, we also discovered that gene conversion was less efficient when the DSB could be repaired using a homologous template, suggesting that the trinucleotide repeat may interfere with gene conversion too. Altogether, these data show that SpCas9 may not be the best choice when inducing a double-strand break at or near a microsatellite, especially in mammalian genomes that contain many more dispersed repeated elements than the yeast genome

    Improved assembly procedure of viral RNA genomes amplified with Phi29 polymerase from new generation sequencing data

    No full text
    BACKGROUND: New sequencing technologies have opened the way to the discovery and the characterization of pathogenic viruses in clinical samples. However, the use of these new methods can require an amplification of viral RNA prior to the sequencing. Among all the available methods, the procedure based on the use of Phi29 polymerase produces a huge amount of amplified DNA. However, its major disadvantage is to generate a large number of chimeric sequences which can affect the assembly step. The pre-process method proposed in this study strongly limits the negative impact of chimeric reads in order to obtain the full-length of viral genomes. FINDINGS: Three different assembly softwares (ABySS, Ray and SPAdes) were tested for their ability to correctly assemble the full-length of viral genomes. Although in all cases, our pre-processed method improved genome assembly, only its combination with the use of SPAdes allowed us to obtain the full-length of the viral genomes tested in one contig. CONCLUSIONS: The proposed pipeline is able to overcome drawbacks due to the generation of chimeric reads during the amplification of viral RNA which considerably improves the assembling of full-length viral genomes
    corecore