2,641 research outputs found

    An approach to improved microbial eukaryotic genome annotation

    Full text link
    Les nouvelles technologies de séquençage d’ADN ont accélérées la vitesse à laquelle les données génomiques sont générées. Par contre, une fois séquencées et assemblées, un défi continu est l'annotation structurelle précise de ces nouvelles séquences génomiques. Par le séquençage et l'assemblage du transcriptome (RNA-Seq) du même organisme, la précision de l'annotation génomique peut être améliorée, car les lectures de RNA-Seq et les transcrits assemblés fournissent des informations précises sur la structure des gènes. Plusieurs pipelines bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des données de similarité des séquences protéiques, pour automatiser l'annotation structurelle d’un génome de manière que la qualité se rapproche à celle de l'annotation par des experts. Les pipelines suivent généralement un flux de travail similaire. D'abord, les régions répétitives sont identifiées afin d'éviter de fausser les alignements de séquences et les prédictions de gènes. Deuxièmement, une base de données est construite contenant les données expérimentales telles que l’alignement des lectures de séquences, des transcrits et des protéines, ce qui informe les prédictions de gènes basées sur les Modèles de Markov Cachés généralisés. La dernière étape est de consolider les alignements de séquences et les prédictions de gènes dans un consensus de haute qualité. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux erreurs, ce qui peut empoisonner les prédictions de gènes et la construction de modèles consensus. Nous avons développé une approche améliorée pour l'annotation des génomes eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axé sur la création d'un ensemble d'évidences extrinsèques le plus complet et diversifié afin de mieux informer les prédictions de gènes. Le deuxième porte sur la construction du consensus du modèle de gènes en utilisant les évidences extrinsèques et les prédictions par MMC, tel que l'influence de leurs biais potentiel soit réduite. La comparaison de notre nouvel outil avec trois pipelines populaires démontre des gains significatifs de sensibilité et de spécificité des modèles de gènes, de transcrits, d'exons et d'introns dans l’annotation structural de génomes d’eucaryotes microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is being generated. One ongoing challenge is the accurate structural annotation of those novel genomes once sequenced and assembled, in particular if the organism does not have close relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and assembly—both of which share similarities to whole-genome sequencing and assembly, respectively—have been shown to dramatically increase the accuracy of gene annotation. Read coverage, inferred splice junctions and assembled transcripts can provide valuable information about gene structure. Several annotation pipelines have been developed to automate structural annotation by incorporating information from RNA-Seq, as well as protein sequence similarity data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence alignments and gene predictions. The next step is to construct a database of evidence from experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments, which are used to inform the generalised Hidden Markov Models of gene prediction software. The final step is to consolidate sequence alignments and gene predictions into a high-confidence consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete and erroneous use of information, which can poison gene predictions and consensus model building. Here, we present an improved approach to microbial eukaryotic genome annotation. Its conception was based on identifying and mitigating potential sources of error and bias that are present in available pipelines. Our approach has two main aspects. The first is to create a more complete and diverse set of extrinsic evidence to better inform gene predictions. The second is to use extrinsic evidence in tandem with predictions such that the influence of their respective biases in the consensus gene models is reduced. We benchmarked our new tool against three known pipelines, showing significant gains in gene, transcript, exon and intron sensitivity and specificity in the genome annotation of microbial eukaryotes

    rKOMICS:An R package for processing mitochondrial minicircle assemblies in population-scale genome projects

    Get PDF
    Abstract Background The advent of population-scale genome projects has revolutionized our biological understanding of parasitic protozoa. However, while hundreds to thousands of nuclear genomes of parasitic protozoa have been generated and analyzed, information about the diversity, structure and evolution of their mitochondrial genomes remains fragmentary, mainly because of their extraordinary complexity. Indeed, unicellular flagellates of the order Kinetoplastida contain structurally the most complex mitochondrial genome of all eukaryotes, organized as a giant network of homogeneous maxicircles and heterogeneous minicircles. We recently developed KOMICS, an analysis toolkit that automates the assembly and circularization of the mitochondrial genomes of Kinetoplastid parasites. While this tool overcomes the limitation of extracting mitochondrial assemblies from Next-Generation Sequencing datasets, interpreting and visualizing the genetic (dis)similarity within and between samples remains a time-consuming process. Results Here, we present a new analysis toolkit—rKOMICS—to streamline the analyses of minicircle sequence diversity in population-scale genome projects. rKOMICS is a user-friendly R package that has simple installation requirements and that is applicable to all 27 trypanosomatid genera. Once minicircle sequence alignments are generated, rKOMICS allows to examine, summarize and visualize minicircle sequence diversity within and between samples through the analyses of minicircle sequence clusters. We showcase the functionalities of the (r)KOMICS tool suite using a whole-genome sequencing dataset from a recently published study on the history of diversification of the Leishmania braziliensis species complex in Peru. Analyses of population diversity and structure highlighted differences in minicircle sequence richness and composition between Leishmania subspecies, and between subpopulations within subspecies. Conclusion The rKOMICS package establishes a critical framework to manipulate, explore and extract biologically relevant information from mitochondrial minicircle assemblies in tens to hundreds of samples simultaneously and efficiently. This should facilitate research that aims to develop new molecular markers for identifying species-specific minicircles, or to study the ancestry of parasites for complementary insights into their evolutionary history

    Swine blood transcriptomics: Application and advancement

    Get PDF
    Improving swine feed efficiency (FE) by selection for low residual feed intake (RFI) is of practical interest. However, whether selection for low RFI compromises a pig’s immune response is not clear. In addition, current RFI-based selection for improving feed efficiency was expensive and time-consuming. Seeking alternative tools to facilitate selection, such as predictive biomarkers for RFI, is of great interest. The objectives of this thesis are as follows: (1) to investigate whether selection for low RFI compromise a pig’s immune response; (2) to develop candidate biomarkers applicable at early growth stage for predicting RFI at late growth stage; (3) to improve the annotation of the porcine blood transcriptome. In Chapter 2, pigs of two lines divergently selected for RFI were injected with lipopolysaccharide (LPS). Transcriptomes of peripheral blood at baseline and multi-time points post injection were profiled by RNA-seq. LPS injection induced systemic inflammatory response in both RFI lines. However, no significant differences were detected in dynamics of body temperature, blood cell count and cytokine levels during the time course. Only a very small number of differentially expressed genes (DEGs) were detected between the lines over all time points, though ~ 50% of blood genes were differentially expressed post LPS injection compared to baseline for each line. The two lines were largely similar in most biological pathways and processes studied. Minor differences included a slightly lower level of inflammatory response in the low- versus high-RFI animals. Cross-species comparison showed that humans and pigs responded to LPS stimulation similarly at both the gene and pathway levels, though pigs are more tolerant to LPS than humans. In Chapter 3, post-weaning blood transcriptomic differences between the two lines were studied by RNA-seq. DEGs between the lines significantly overlapped gene sets associated with human diseases, such as eating disorders, hyperphagia and mitochondrial disease. Genes functioning in the mitochondrion and proteasome, and signaling had lower and higher expression in the low-RFI group relative to the high-RFI group, respectively. Expression levels of five differentially expressed genes between the two groups were significantly associated with individual animal’s RFI values. These five genes were candidate biomarkers for predicting RFI. Given limitations of current annotation of the porcine reference genome, a high-quality annotated transcriptome of porcine peripheral blood was built in the last study via a hybrid assembly strategy with a large amount of blood RNA-seq data from studies mentioned above and public databases. Taken together, this work provides evidence that selection for low RFI did not significantly compromise pigs’ immune response to systemic inflammation, offers a few candidate biomarkers for predicting RFI to facilitate RFI-based selection, and significantly advances the structural and functional annotation of porcine blood transcriptome

    More Than 1,001 Problems with Protein Domain Databases: Transmembrane Regions, Signal Peptides and the Issue of Sequence Homology

    Get PDF
    Large-scale genome sequencing gained general importance for life science because functional annotation of otherwise experimentally uncharacterized sequences is made possible by the theory of biomolecular sequence homology. Historically, the paradigm of similarity of protein sequences implying common structure, function and ancestry was generalized based on studies of globular domains. Having the same fold imposes strict conditions over the packing in the hydrophobic core requiring similarity of hydrophobic patterns. The implications of sequence similarity among non-globular protein segments have not been studied to the same extent; nevertheless, homology considerations are silently extended for them. This appears especially detrimental in the case of transmembrane helices (TMs) and signal peptides (SPs) where sequence similarity is necessarily a consequence of physical requirements rather than common ancestry. Thus, matching of SPs/TMs creates the illusion of matching hydrophobic cores. Therefore, inclusion of SPs/TMs into domain models can give rise to wrong annotations. More than 1001 domains among the 10,340 models of Pfam release 23 and 18 domains of SMART version 6 (out of 809) contain SP/TM regions. As expected, fragment-mode HMM searches generate promiscuous hits limited to solely the SP/TM part among clearly unrelated proteins. More worryingly, we show explicit examples that the scores of clearly false-positive hits, even in global-mode searches, can be elevated into the significance range just by matching the hydrophobic runs. In the PIR iProClass database v3.74 using conservative criteria, we find that at least between 2.1% and 13.6% of its annotated Pfam hits appear unjustified for a set of validated domain models. Thus, false-positive domain hits enforced by SP/TM regions can lead to dramatic annotation errors where the hit has nothing in common with the problematic domain model except the SP/TM region itself. We suggest a workflow of flagging problematic hits arising from SP/TM-containing models for critical reconsideration by annotation users

    Moniliophthora perniciosa genome: assembly and annotation of mitochondrion and development of a semi-automatic system of genes annotation

    Get PDF
    Orientador: Gonçalo Amarante Guimarães PereiraTese (doutorado) - Universidade Estadual de Campinas, Instituto de BiologiaResumo: O genoma mitocondrial (mtDNA) do fungo Moniliophthora perniciosa foi completamente seqüenciado e contém 109103 pb, com 31% de bases GC, porcentagem menor que a encontrada nas seqüências do genoma nuclear (47%). É o maior genoma mitocondrial de fungos descrito até o momento, e seu tamanho é conseqüência de grande espaço intergênico, que contém diversas ORFs com possibilidade de serem confirmadas como novos genes. Análises computacionais indicam a presença de variação no número de mtDNAs/célula nas diferentes bibliotecas, com tendência significativa de menor número de mtDNAs/célula no grupo de bibliotecas proveniente de culturas submetidas a repetidas repicagens. A maioria dos genes típicos (atp6, atp9, nad1-6, nad4L, cox1-3, cob, sendo a exceção o atp8), todos os rRNAS, tRNAS (foi encontrado pelo menos um para cada aminoácido) e genes das ORFs intrônicas estão orientados no sentido horário. Foram identificados também um gene rps3 e um grupo de ORFs com características semelhantes às dos genes típicos. Surpreendentemente o mtDNA apresenta uma região ocupada por uma estrutura de invertron característica de plasmídeos kalilo-like, integrado de maneira estável ao genoma em todas as variedades do biótipo C, e presente nos demais biótipos testados. Esta seqüência está disponível no GenBank através do número de acesso: AY376688. A outra linha de trabalho foi desenvolvida juntamente com outros bioinformatas do Laboratório de Genômica e Expressão. Foram desenvolvidas ferramentas de mineração e anotação de genes para projetos genoma, sendo os maiores destaques o Gene Projects, que permite mineração e anotação de genes durante o processo de seqüenciamento, e a nova interface de anotação, desenvolvida para otimizar a qualidade e a eficiência da anotação de genesAbstract: The mitochondrial genome (mtDNA) of the fungus Moniliophthora perniciosa was completely sequenced and it contains 109103 bases pair, with 31% of bases GC, smaller percentage than found in the sequences of the nuclear genome (47%). It is the largest mitochondrial genome of fungus described to the moment, and its size is consequence of great intergenic space, with several ORFs who can be confirmed as new genes. Computational analyses show the presence of variation in the number of mtDNAs / cell in different libraries, with significant tendency of smaller mtDNAs / cell number in group of libraries originating from cultures undergoes to repeatedly reply. Most of the typical genes (atp6, atp9, nad1-6, nad4L, cox1-3, cob, being the exception the atp8), all of the rRNAS, tRNAS (it was found at least one for each amino acid) and genes of the intronic ORFs are guided in the hourly sense. Surprisingly the mtDNA presents one region occupied for a structure of invertron, characteristic of plasmids kalilo-like, integrated in stable way to the genome in all of the varieties of the biotype C, and present in other tested biotypes. This sequence is available in the GenBank through the accession number: AY376688. The other work line was developed together with other bioinformatics of the Genomic and Expression Laboratory. Data mining and annotation of genes tools were developed for projects genome, being the largest prominences the Gene Projects, that allows mining and annotation of genes during the sequencing process, and the new annotation interface, developed to optimize the quality and the efficiency of the annotation of genesDoutoradoBioquimicaDoutor em Biologia Funcional e Molecula

    The Marvelous World of tRNAs: From Accurate Mapping to Chemical Modifications

    Get PDF
    Since the discovery of transfer RNAs (tRNAs) as decoders of the genetic code, life science has transformed. Particularly, as soon as the importance of tRNAs in protein synthesis has been established, researchers recognized that the functionality of tRNAs in cellular regulation exceeds beyond this paradigm. A strong impetus for these discoveries came from advances in large-scale RNA sequencing (RNA-seq) and increasingly sophisticated algorithms. Sequencing tRNAs is challenging both experimentally and in terms of the subsequent computational analysis. In RNA-seq data analysis, mapping tRNA reads to a reference genome is an error-prone task. This is in particular true, as chemical modifications introduce systematic reverse transcription errors while at the same time the genomic loci are only approximately identical due to the post-transcriptional maturation of tRNAs. Additionally, their multi-copy nature complicates the precise read assignment to its true genomic origin. In the course of the thesis a computational workflow was established to enable accurate mapping of tRNA reads. The developed method removes most of the mapping artifacts introduced by simpler mapping schemes, as demonstrated by using both simulated and human RNA-seq data. Subsequently, the resulting mapping profiles can be used for reliable identification of specific chemical tRNA modifications with a false discovery rate of only 2%. For that purpose, computational analysis methods were developed that facilitates the sensitive detection and even classification of most tRNA modifications based on their mapping profiles. This comprised both untreated RNA-seq data of various species, as well as treated data of Bacillus subtilis that has been designed to display modifications in a specific read-out in the mapping profile. The discussion focuses on sources of artifacts that complicate the profiling of tRNA modifications and strategies to overcome them. Exemplary studies on the modification pattern of different human tissues and the developmental stages of Dictyostelium discoideum were carried out. These suggested regulatory functions of tRNA modifications in development and during cell differentiation. The main experimental difficulties of tRNA sequencing are caused by extensive, stable secondary structures and the presence of chemical modifications. Current RNA-seq methods do not sample the entire tRNA pool, lose short tRNA fragments, or they lack specificity for tRNAs. Within this thesis, the benchmark and improvement of LOTTE-seq, a method for specific selection of tRNAs for high-throughput sequencing, exhibited that the method solves the experimental challenges and avoids the disadvantages of previous tRNA-seq protocols. Applying the accurate tRNA mapping strategy to LOTTE-seq and other tRNA-specific RNA- seq methods demonstrated that the content of mature tRNAs is highest in LOTTE-seq data, ranging from 90% in Spinacia oleracea to 100% in D. discoideum. Additionally, the thesis addressed the fact that tRNAs are multi-copy genes that undergo concerted evolution which keeps sequences of paralogous genes effectively identical. Therefore, it is impossible to distinguish orthologs from paralogs by sequence similarity alone. Synteny, the maintenance of relative genomic positions, is helpful to disambiguate evolutionary relationships in this situation. During this thesis a workflow was computed for synteny-based orthology identification of tRNA genes. The workflow is based on the use of pre-computed genome-wide multiple sequence alignment blocks as anchors to establish syntenic conservation of sequence intervals. Syntenic clusters of concertedly evolving genes of different tRNA families are then subdivided and processed by cograph editing to recover their duplication histories. A useful outcome of this study is that it highlights the technical problems and difficulties associated with an accurate analysis of the evolution of multi-copy genes. To showcase the method, evolution of tRNAs in primates and fruit flies were reconstructed. In the last decade, a number of reports have described novel aspects of tRNAs in terms of the diversity of their genes. For example, nuclear-encoded mitochondrial-derived tRNAs (nm-tRNAs) have been reported whose presence provokes intriguing questions about their functionality. Within this thesis an annotation strategy was developed that led to the identification of 335 and 43 novel nm-tRNAs in human and mouse, respectively. Interestingly, downstream analyses showed that the localization of several nm-tRNAs in introns and the over-representation of conserved RNA-binding sites of proteins involved in splicing suggest a potential regulatory function of intronic nm-tRNAs in splicing

    The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features

    Full text link
    Background: Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava is a highly heterozygous crop with a repetitive and difficult-to-assemble genome. Findings: Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present 2 chromosome-scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. With consensus accuracy >QV46, contig N50 >18 Mb, BUSCO completeness of 99%, and 35k phased gene loci, it is the most accurate, continuous, complete, and haplotype-resolved cassava genome assembly so far. Ab initio gene prediction with RNA-seq data and Iso-Seq transcripts identified abundant novel gene loci, with enriched functionality related to chromatin organization, meristem development, and cell responses. During tissue development, differentially expressed transcripts of different haplotype origins were enriched for different functionality. In each tissue, 20-30% of transcripts showed allele-specific expression (ASE) differences. ASE bias was often tissue specific and inconsistent across different tissues. Direction-shifting was observed in <2% of the ASE transcripts. Despite high gene synteny, the HiFi genome assembly revealed extensive chromosome rearrangements and abundant intra-genomic and inter-genomic divergent sequences, with large structural variations mostly related to LTR retrotransposons. We use the reference-quality assemblies to build a cassava pan-genome and demonstrate its importance in representing the genetic diversity of cassava for downstream reference-guided omics analysis and breeding. Conclusions: The phased and annotated chromosome pairs allow a systematic view of the heterozygous diploid genome organization in cassava with improved accuracy, completeness, and haplotype resolution. They will be a valuable resource for cassava breeding and research. Our study may also provide insights into developing cost-effective and efficient strategies for resolving complex genomes with high resolution, accuracy, and continuity

    The haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar reveal novel pan-genome and allele-specific transcriptome features

    Get PDF
    Background Cassava (Manihot esculenta) is an important clonally propagated food crop in tropical and subtropical regions worldwide. Genetic gain by molecular breeding has been limited, partially because cassava is a highly heterozygous crop with a repetitive and difficult-to-assemble genome. Findings Here we demonstrate that Pacific Biosciences high-fidelity (HiFi) sequencing reads, in combination with the assembler hifiasm, produced genome assemblies at near complete haplotype resolution with higher continuity and accuracy compared to conventional long sequencing reads. We present 2 chromosome-scale haploid genomes phased with Hi-C technology for the diploid African cassava variety TME204. With consensus accuracy >QV46, contig N50 >18 Mb, BUSCO completeness of 99%, and 35k phased gene loci, it is the most accurate, continuous, complete, and haplotype-resolved cassava genome assembly so far. Ab initio gene prediction with RNA-seq data and Iso-Seq transcripts identified abundant novel gene loci, with enriched functionality related to chromatin organization, meristem development, and cell responses. During tissue development, differentially expressed transcripts of different haplotype origins were enriched for different functionality. In each tissue, 20-30% of transcripts showed allele-specific expression (ASE) differences. ASE bias was often tissue specific and inconsistent across different tissues. Direction-shifting was observed inPeer reviewe

    History and future perspectives of barley genomics

    Get PDF
    Barley (Hordeum vulgare), one of the most widely cultivated cereal crops, possesses a large genome of 5.1Gbp. Through various international collaborations, the genome has recently been sequenced and assembled at the chromosome-scale by exploiting available genetic and genomic resources. Many wild and cultivated barley accessions have been collected and preserved around the world. These accessions are crucial to obtain diverse natural and induced barley variants. The barley bioresource project aims to investigate the diversity of this crop based on purified seed and DNA samples of a large number of collected accessions. The long-term goal of this project is to analyse the genome sequences of major barley accessions worldwide. In view of technical limitations, a strategy has been employed to establish the exome structure of a selected number of accessions and to perform high-quality chromosome-scale assembly of the genomes of several major representative accessions. For the future project, an efficient annotation pipeline is essential for establishing the function of genomes and genes as well as for using this information for sequence-based digital barley breeding. In this article, the author reviews the existing barley resources along with their applications and discuss possible future directions of research in barley genomics
    corecore