72 research outputs found
Heuristic pairwise alignment of de Bruijn graphs to facilitate simultaneous transcript discovery in related organisms from RNA-Seq data
BACKGROUND: The advance of high-throughput sequencing has made it possible to obtain new transcriptomes and study splicing mechanisms in non-model organisms. In these studies, there is often a need to investigate the transcriptomes of two related organisms at the same time in order to find the similarities and differences between them. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms independently and then employ similarity comparison algorithms to study them. RESULTS: Instead of obtaining predicted transcripts for these organisms separately from the intermediate de Bruijn graph structures employed by de novo transcriptome assembly algorithms, we develop an algorithm to allow direct comparisons between paths in two de Bruijn graphs by first enumerating short paths in both graphs, and iteratively extending paths in one graph that have high similarity to paths in the other graph to obtain longer corresponding paths between the two graphs. These paths represent predicted transcripts that are present in both organisms. We show that our algorithm recovers significantly more shared transcripts than traditional approaches by applying it to simultaneously recover transcripts in mouse against rat and in mouse against human from publicly available RNA-Seq libraries. Our strategy utilizes sequence similarity information within the paths that is often more reliable than coverage information. CONCLUSIONS: Our approach generalizes the pairwise sequence alignment problem to allow the input to be non-linear structures, and provides a heuristic to reliably recover similar paths from the two structures. Our algorithm allows detailed investigation of the similarities and differences in alternative splicing between the two organisms at both the sequence and structure levels, even in the absence of reference transcriptomes or a closely related model organism
Genomic and Transcriptomic Studies on Non-Model Organisms
As the advance in high-throughput sequencing enables the generation of large volumes of genomic information, it provides researchers the opportunity to study non-model organisms even in the absence of a fully sequenced genome. The hugely advantageous progress calls for powerful sequencing assembly algorithms as these technologies also raise challenging assembly problems: (1) Some RNA products are highly expressed but others may have much lower expression level. (2) Data cannot easily be represented as linear structure, due to post-transcriptional modification like alternative splicing. (3) Conserved sequences in domains in gene families can result in assembly errors, (4) Sequencing errors due to technique limitations. Useful assembly algorithms are required to overcome the difficulties above. In these studies, there is often a need to identify similar transcripts in non-model organisms to transcripts found in related organisms. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms and then employ similarity comparison algorithms to identify them. I observe it is possible to obtain a more complete set of similar transcripts from transcriptome assembly by making use of evolutionary information. I apply new algorithms to study non-model organisms which play an important role in applied biology.
Moreover, improvement of sequencing technologies and application of current algorithms also help to study interkingdom signals between blow flies and bacteria community. With current computational tools, I annotate genomes of Proteus mirabilis and Providencia stuartii, which play an important role in bacteria-insect interaction. The study shows significant features of these strains isolated, which provides useful information to develop and test hypothesis in related interactions in insects and bacteria
An approach to improved microbial eukaryotic genome annotation
Les nouvelles technologies de sĂ©quençage dâADN ont accĂ©lĂ©rĂ©es la vitesse Ă laquelle les
données génomiques sont générées. Par contre, une fois séquencées et assemblées, un défi
continu est l'annotation structurelle précise de ces nouvelles séquences génomiques. Par le
sĂ©quençage et l'assemblage du transcriptome (RNA-Seq) du mĂȘme organisme, la prĂ©cision de
l'annotation gĂ©nomique peut ĂȘtre amĂ©liorĂ©e, car les lectures de RNA-Seq et les transcrits
assemblés fournissent des informations précises sur la structure des gÚnes. Plusieurs pipelines
bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des
donnĂ©es de similaritĂ© des sĂ©quences protĂ©iques, pour automatiser l'annotation structurelle dâun
génome de maniÚre que la qualité se rapproche à celle de l'annotation par des experts. Les
pipelines suivent généralement un flux de travail similaire. D'abord, les régions répétitives sont
identifiées afin d'éviter de fausser les alignements de séquences et les prédictions de gÚnes.
DeuxiÚmement, une base de données est construite contenant les données expérimentales telles
que lâalignement des lectures de sĂ©quences, des transcrits et des protĂ©ines, ce qui informe les
prédictions de gÚnes basées sur les ModÚles de Markov Cachés généralisés. La derniÚre étape
est de consolider les alignements de séquences et les prédictions de gÚnes dans un consensus de
haute qualité. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux
erreurs, ce qui peut empoisonner les prédictions de gÚnes et la construction de modÚles
consensus. Nous avons développé une approche améliorée pour l'annotation des génomes
eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axé
sur la création d'un ensemble d'évidences extrinsÚques le plus complet et diversifié afin de mieux
informer les prédictions de gÚnes. Le deuxiÚme porte sur la construction du consensus du modÚle
de gÚnes en utilisant les évidences extrinsÚques et les prédictions par MMC, tel que l'influence
de leurs biais potentiel soit réduite. La comparaison de notre nouvel outil avec trois pipelines
populaires démontre des gains significatifs de sensibilité et de spécificité des modÚles de gÚnes,
de transcrits, d'exons et d'introns dans lâannotation structural de gĂ©nomes dâeucaryotes
microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is
being generated. One ongoing challenge is the accurate structural annotation of those novel
genomes once sequenced and assembled, in particular if the organism does not have close
relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and
assemblyâboth of which share similarities to whole-genome sequencing and assembly,
respectivelyâhave been shown to dramatically increase the accuracy of gene annotation. Read
coverage, inferred splice junctions and assembled transcripts can provide valuable information
about gene structure. Several annotation pipelines have been developed to automate structural
annotation by incorporating information from RNA-Seq, as well as protein sequence similarity
data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a
similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence
alignments and gene predictions. The next step is to construct a database of evidence from
experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments,
which are used to inform the generalised Hidden Markov Models of gene prediction software.
The final step is to consolidate sequence alignments and gene predictions into a high-confidence
consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete
and erroneous use of information, which can poison gene predictions and consensus model
building. Here, we present an improved approach to microbial eukaryotic genome annotation.
Its conception was based on identifying and mitigating potential sources of error and bias that
are present in available pipelines. Our approach has two main aspects. The first is to create a
more complete and diverse set of extrinsic evidence to better inform gene predictions. The
second is to use extrinsic evidence in tandem with predictions such that the influence of their
respective biases in the consensus gene models is reduced. We benchmarked our new tool
against three known pipelines, showing significant gains in gene, transcript, exon and intron
sensitivity and specificity in the genome annotation of microbial eukaryotes
New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly
Great efforts have been devoted to decipher the sequence composition of
the genomes and transcriptomes of diverse organisms. Continuing advances in
high-throughput sequencing technologies have led to a decline in associated
costs, facilitating a rapid increase in the amount of available genetic data. In
particular genome studies have undergone a fundamental paradigm shift where
genome projects are no longer limited by sequencing costs, but rather by
computational problems associated with assembly. There is an urgent demand
for more efficient and more accurate methods. Most recently, âhybridâ
methods that integrate short- and long-read data have been devised to address
this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a
bipartite overlap graph between long reads and restrictively filtered short-read
unitigs. This graph is translated into a long-read overlap graph. By design,
unitigs are both unique and almost free of assembly errors. As a consequence,
only few spurious overlaps are introduced into the graph. Instead of the more
conventional approach of removing tips, bubbles, and other local features,
LazyB extracts subgraphs whose global properties approach a disjoint union of
paths in multiple steps, utilizing properties of proper interval graphs. A
prototype implementation of LazyB, entirely written in Python, not only yields
significantly more accurate assemblies of the yeast, fruit fly, and human
genomes compared to state-of-the-art pipelines, but also requires much less
computational effort. An optimized C++ implementation dubbed MuCHSALSA
further significantly reduces resource demands.
Advances in RNA-seq have facilitated tremendous insights into the role of
both coding and non-coding transcripts. Yet, the complete and accurate
annotation of the transciptomes of even model organisms has remained elusive.
RNA-seq produces reads significantly shorter than the average distance
between related splice events and presents high noise levels and other biases
The computational reconstruction remains a critical bottleneck.
RyĆ«tĆ implements an extension of common splice graphs facilitating the integration
of reads spanning multiple splice sites and paired-end reads bridging distant
transcript parts. The decomposition of read coverage patterns is modeled as a
minimum-cost flow problem. Using phasing information from multi-splice and
paired-end reads, nodes with uncertain connections are decomposed step-wise
via Linear Programming.
RyĆ«tĆs performance compares favorably with
state-of-the-art methods on both simulated and real-life datasets. Despite
ongoing research and our own contributions, progress on traditional single
sample assembly has brought no major breakthrough. Multi-sample RNA-Seq
experiments provide more information which, however, is challenging to utilize
due to the large amount of accumulating errors. An extension to RyĆ«tĆ
enables the reconstruction of consensus transcriptomes from multiple RNA-seq
data sets, incorporating consensus calling at low level features. Benchmarks
show stable improvements already at 3 replicates.
RyĆ«tĆ outperforms competing approaches, providing a better and user-adjustable
sensitivity-precision trade-off. RyĆ«tĆ consistently improves assembly on
replicates, demonstrable also when mixing conditions or time series and for
differential expression analysis. RyĆ«tĆs approach towards guided assembly is
equally unique. It allows users to adjust results based on the quality of the
guide, even for multi-sample assembly.:1 Preface
1.1 Assembly: A vast and fast evolving field
1.2 Structure of this Work
1.3 Available
2 Introduction
2.1 Mathematical Background
2.2 High-Throughput Sequencing
2.3 Assembly
2.4 Transcriptome Expression
3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly
3.1 Background
3.2 Strategy
3.3 Data preprocessing
3.4 Processing of the overlap graph
3.5 Post Processing of the Path Decomposition
3.6 Benchmarking
3.7 MuCHSALSA â Moving towards the future
4 RyĆ«tĆ - Versatile, Fast, and Effective Transcript Assembly
4.1 Background
4.2 Strategy
4.3 The RyĆ«tĆ core algorithm
4.4 Improved Multi-sample transcript assembly with RyĆ«tĆ
5 Conclusion & Future Work
5.1 Discussion and Outlook
5.2 Summary and Conclusio
From RNA-seq reads to differential expression results
Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression
A computational framework for transcriptome assembly and annotation in non-model organisms: the case of venturia inaequalis
Philosophiae Doctor - PhDIn this dissertation three computational approaches are presented that enable optimization of reference-free transcriptome reconstruction. The first addresses the selection of bona fide reconstructed transcribed fragments (transfrags) from de novo transcriptome assemblies and annotation with a multiple domain co-occurrence framework. We showed that selected transfrags are functionally relevant and represented over 94% of the information derived from annotation by transference. The second approach relates to quality score based RNA-seq sub-sampling and the description of a novel sequence similarity-derived metric for quality assessment of de novo transcriptome assemblies. A detail systematic analysis of the side effects induced by quality score based trimming and or filtering on artefact removal and transcriptome quality is describe. Aggressive trimming produced incomplete reconstructed and missing transfrags. This approach was applied in generating an optimal transcriptome assembly for a South African isolate of V. inaequalis. The third approach deals with the computational partitioning of transfrags assembled from RNA-Seq of mixed host and pathogen reads. We used this strategy to correct a publicly available transcriptome assembly for V. inaequalis (Indian isolate). We binned 50% of the latter to Apple transfrags and identified putative immunity transcript models. Comparative transcriptomic analysis between fungi transfrags from the Indian and South African isolates reveal effectors or transcripts that may be expressed in planta upon morphogenic differentiation.
These studies have successfully identified V. inaequalis specific transfrags that can facilitate gene discovery. The unique access to an in-house draft genome assembly allowed us to provide preliminary description of genes that are implicated in pathogenesis. Gene prediction with bona fide transfrags produced 11,692 protein-coding genes. We identified two hydrophobin-like genes and six accessory genes of the melanin biosynthetic pathway that are implicated in the invasive action of the appressorium. The cazyome reveals an impressive repertoire of carbohydrate degrading enzymes and carbohydrate-binding modules amongst which are six polysaccharide lyases, and the largest number of carbohydrate esterases (twenty-eight) known in any fungus sequenced to dat
Detecting and comparing non-coding RNAs in the high-throughput era.
In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data
Technology dictates algorithms: Recent developments in read alignment
Massively parallel sequencing techniques have revolutionized biological and
medical sciences by providing unprecedented insight into the genomes of humans,
animals, and microbes. Modern sequencing platforms generate enormous amounts of
genomic data in the form of nucleotide sequences or reads. Aligning reads onto
reference genomes enables the identification of individual-specific genetic
variants and is an essential step of the majority of genomic analysis
pipelines. Aligned reads are essential for answering important biological
questions, such as detecting mutations driving various human diseases and
complex traits as well as identifying species present in metagenomic samples.
The read alignment problem is extremely challenging due to the large size of
analyzed datasets and numerous technological limitations of sequencing
platforms, and researchers have developed novel bioinformatics algorithms to
tackle these difficulties. Importantly, computational algorithms have evolved
and diversified in accordance with technological advances, leading to todays
diverse array of bioinformatics tools. Our review provides a survey of
algorithmic foundations and methodologies across 107 alignment methods
published between 1988 and 2020, for both short and long reads. We provide
rigorous experimental evaluation of 11 read aligners to demonstrate the effect
of these underlying algorithms on speed and efficiency of read aligners. We
separately discuss how longer read lengths produce unique advantages and
limitations to read alignment techniques. We also discuss how general alignment
algorithms have been tailored to the specific needs of various domains in
biology, including whole transcriptome, adaptive immune repertoire, and human
microbiome studies
Comparison of Multiple Organisms Using de novo Transcriptome Assembly
Technical report by Kevin Legarreta on using De Brujin graphs to compare transcriptomes from different organisms. We are trying to extend work described in Fu S, Tarone AM, Sze SH. (2015) Heuristic pairwise alignment
of de Bruijn graphs to facilitate simultaneous transcript discovery in
related organisms from RNA-Seq data. BMC Genomics 16:S5
Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data
Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics.
The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance.
In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph
- âŠ