Search CORE

72 research outputs found

Heuristic pairwise alignment of de Bruijn graphs to facilitate simultaneous transcript discovery in related organisms from RNA-Seq data

Author: Fu Shuhua
Sze Sing-Hoi
Tarone Aaron
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND: The advance of high-throughput sequencing has made it possible to obtain new transcriptomes and study splicing mechanisms in non-model organisms. In these studies, there is often a need to investigate the transcriptomes of two related organisms at the same time in order to find the similarities and differences between them. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms independently and then employ similarity comparison algorithms to study them. RESULTS: Instead of obtaining predicted transcripts for these organisms separately from the intermediate de Bruijn graph structures employed by de novo transcriptome assembly algorithms, we develop an algorithm to allow direct comparisons between paths in two de Bruijn graphs by first enumerating short paths in both graphs, and iteratively extending paths in one graph that have high similarity to paths in the other graph to obtain longer corresponding paths between the two graphs. These paths represent predicted transcripts that are present in both organisms. We show that our algorithm recovers significantly more shared transcripts than traditional approaches by applying it to simultaneously recover transcripts in mouse against rat and in mouse against human from publicly available RNA-Seq libraries. Our strategy utilizes sequence similarity information within the paths that is often more reliable than coverage information. CONCLUSIONS: Our approach generalizes the pairwise sequence alignment problem to allow the input to be non-linear structures, and provides a heuristic to reliably recover similar paths from the two structures. Our algorithm allows detailed investigation of the similarities and differences in alternative splicing between the two organisms at both the sequence and structure levels, even in the absence of reference transcriptomes or a closely related model organism

Crossref

Springer - Publisher Connector

Texas A&M Repository

PubMed Central

Genomic and Transcriptomic Studies on Non-Model Organisms

Author: Fu Shuhua
Publication venue
Publication date: 21/09/2015
Field of study

As the advance in high-throughput sequencing enables the generation of large volumes of genomic information, it provides researchers the opportunity to study non-model organisms even in the absence of a fully sequenced genome. The hugely advantageous progress calls for powerful sequencing assembly algorithms as these technologies also raise challenging assembly problems: (1) Some RNA products are highly expressed but others may have much lower expression level. (2) Data cannot easily be represented as linear structure, due to post-transcriptional modification like alternative splicing. (3) Conserved sequences in domains in gene families can result in assembly errors, (4) Sequencing errors due to technique limitations. Useful assembly algorithms are required to overcome the difficulties above. In these studies, there is often a need to identify similar transcripts in non-model organisms to transcripts found in related organisms. The traditional approach to address this problem is to perform de novo transcriptome assemblies to obtain predicted transcripts for these organisms and then employ similarity comparison algorithms to identify them. I observe it is possible to obtain a more complete set of similar transcripts from transcriptome assembly by making use of evolutionary information. I apply new algorithms to study non-model organisms which play an important role in applied biology. Moreover, improvement of sequencing technologies and application of current algorithms also help to study interkingdom signals between blow flies and bacteria community. With current computational tools, I annotate genomes of Proteus mirabilis and Providencia stuartii, which play an important role in bacteria-insect interaction. The study shows significant features of these strains isolated, which provides useful information to develop and test hypothesis in related interactions in insects and bacteria

Texas A&M Repository

An approach to improved microbial eukaryotic genome annotation

Author: Sarrasin Matthew
Publication venue
Publication date: 01/12/2017
Field of study

Les nouvelles technologies de séquençage d’ADN ont accélérées la vitesse à laquelle les données génomiques sont générées. Par contre, une fois séquencées et assemblées, un défi continu est l'annotation structurelle précise de ces nouvelles séquences génomiques. Par le séquençage et l'assemblage du transcriptome (RNA-Seq) du même organisme, la précision de l'annotation génomique peut être améliorée, car les lectures de RNA-Seq et les transcrits assemblés fournissent des informations précises sur la structure des gènes. Plusieurs pipelines bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des données de similarité des séquences protéiques, pour automatiser l'annotation structurelle d’un génome de manière que la qualité se rapproche à celle de l'annotation par des experts. Les pipelines suivent généralement un flux de travail similaire. D'abord, les régions répétitives sont identifiées afin d'éviter de fausser les alignements de séquences et les prédictions de gènes. Deuxièmement, une base de données est construite contenant les données expérimentales telles que l’alignement des lectures de séquences, des transcrits et des protéines, ce qui informe les prédictions de gènes basées sur les Modèles de Markov Cachés généralisés. La dernière étape est de consolider les alignements de séquences et les prédictions de gènes dans un consensus de haute qualité. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux erreurs, ce qui peut empoisonner les prédictions de gènes et la construction de modèles consensus. Nous avons développé une approche améliorée pour l'annotation des génomes eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axé sur la création d'un ensemble d'évidences extrinsèques le plus complet et diversifié afin de mieux informer les prédictions de gènes. Le deuxième porte sur la construction du consensus du modèle de gènes en utilisant les évidences extrinsèques et les prédictions par MMC, tel que l'influence de leurs biais potentiel soit réduite. La comparaison de notre nouvel outil avec trois pipelines populaires démontre des gains significatifs de sensibilité et de spécificité des modèles de gènes, de transcrits, d'exons et d'introns dans l’annotation structural de génomes d’eucaryotes microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is being generated. One ongoing challenge is the accurate structural annotation of those novel genomes once sequenced and assembled, in particular if the organism does not have close relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and assembly—both of which share similarities to whole-genome sequencing and assembly, respectively—have been shown to dramatically increase the accuracy of gene annotation. Read coverage, inferred splice junctions and assembled transcripts can provide valuable information about gene structure. Several annotation pipelines have been developed to automate structural annotation by incorporating information from RNA-Seq, as well as protein sequence similarity data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence alignments and gene predictions. The next step is to construct a database of evidence from experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments, which are used to inform the generalised Hidden Markov Models of gene prediction software. The final step is to consolidate sequence alignments and gene predictions into a high-confidence consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete and erroneous use of information, which can poison gene predictions and consensus model building. Here, we present an improved approach to microbial eukaryotic genome annotation. Its conception was based on identifying and mitigating potential sources of error and bias that are present in available pipelines. Our approach has two main aspects. The first is to create a more complete and diverse set of extrinsic evidence to better inform gene predictions. The second is to use extrinsic evidence in tandem with predictions such that the influence of their respective biases in the consensus gene models is reduced. We benchmarked our new tool against three known pipelines, showing significant gains in gene, transcript, exon and intron sensitivity and specificity in the genome annotation of microbial eukaryotes

Dépôt Institutionnel Numérique

New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

Author: Gatter Thomas
Publication venue
Publication date: 18/02/2022
Field of study

Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

From RNA-seq reads to differential expression results

Author: Oshlack Alicia
Robinson Mark D
Young Matthew D
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression

Crossref

PubMed Central

University of Melbourne Institutional Repository

A computational framework for transcriptome assembly and annotation in non-model organisms: the case of venturia inaequalis

Author: Kimbung Stanley Mbandi
Publication venue: 'University of the Western Cape Library Service'
Publication date: 01/01/2014
Field of study

Philosophiae Doctor - PhDIn this dissertation three computational approaches are presented that enable optimization of reference-free transcriptome reconstruction. The first addresses the selection of bona fide reconstructed transcribed fragments (transfrags) from de novo transcriptome assemblies and annotation with a multiple domain co-occurrence framework. We showed that selected transfrags are functionally relevant and represented over 94% of the information derived from annotation by transference. The second approach relates to quality score based RNA-seq sub-sampling and the description of a novel sequence similarity-derived metric for quality assessment of de novo transcriptome assemblies. A detail systematic analysis of the side effects induced by quality score based trimming and or filtering on artefact removal and transcriptome quality is describe. Aggressive trimming produced incomplete reconstructed and missing transfrags. This approach was applied in generating an optimal transcriptome assembly for a South African isolate of V. inaequalis. The third approach deals with the computational partitioning of transfrags assembled from RNA-Seq of mixed host and pathogen reads. We used this strategy to correct a publicly available transcriptome assembly for V. inaequalis (Indian isolate). We binned 50% of the latter to Apple transfrags and identified putative immunity transcript models. Comparative transcriptomic analysis between fungi transfrags from the Indian and South African isolates reveal effectors or transcripts that may be expressed in planta upon morphogenic differentiation. These studies have successfully identified V. inaequalis specific transfrags that can facilitate gene discovery. The unique access to an in-house draft genome assembly allowed us to provide preliminary description of genes that are implicated in pathogenesis. Gene prediction with bona fide transfrags produced 11,692 protein-coding genes. We identified two hydrophobin-like genes and six accessory genes of the melanin biosynthetic pathway that are implicated in the invasive action of the appressorium. The cazyome reveals an impressive repertoire of carbohydrate degrading enzymes and carbohydrate-binding modules amongst which are six polysaccharide lyases, and the largest number of carbohydrate esterases (twenty-eight) known in any fungus sequenced to dat

UWC Theses and Dissertations

Detecting and comparing non-coding RNAs in the high-throughput era.

Author: Bussotti Giovanni
Enright Anton J
Notredame Cedric
Publication venue: Int J Mol Sci
Publication date: 01/01/2013
Field of study

In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data

CiteSeerX

Directory of Open Access Journals

PubMed Central

Apollo (Cambridge)

Technology dictates algorithms: Recent developments in read alignment

Author: Alkan Can
Alser Mohammed
Balliu Brunilda
Deshpande Dhrithi
Icer Baykal Pelin
Knyazev Sergey
Koslicki David
Mangul Serghei
Mutlu Onur
Rotman Jeremy
Shi Huwenbo
Singer Benjamin D.
Skums Pavel
Taraszka Kodi
Xue Victor
Yang Harry T.
Zelikovsky Alex
Publication venue
Publication date: 09/07/2020
Field of study

Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data in the form of nucleotide sequences or reads. Aligning reads onto reference genomes enables the identification of individual-specific genetic variants and is an essential step of the majority of genomic analysis pipelines. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Importantly, computational algorithms have evolved and diversified in accordance with technological advances, leading to todays diverse array of bioinformatics tools. Our review provides a survey of algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2020, for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies

arXiv.org e-Print Archive

Repository for Publications and Research Data

Directory of Open Access Journals

Comparison of Multiple Organisms Using de novo Transcriptome Assembly

Author: Humberto Ortiz-Zuazaga (664228)
Kevin Legarreta (3641863)
Publication venue
Publication date
Field of study

Technical report by Kevin Legarreta on using De Brujin graphs to compare transcriptomes from different organisms. We are trying to extend work described in Fu S, Tarone AM, Sze SH. (2015) Heuristic pairwise alignment of de Bruijn graphs to facilitate simultaneous transcript discovery in related organisms from RNA-Seq data. BMC Genomics 16:S5

FigShare

Focus: A Graph Approach for Data-Mining and Domain-Specific Assembly of Next Generation Sequencing Data

Author: Sommer Julia
Publication venue: DigitalCommons@UNMC
Publication date: 15/12/2017
Field of study

Next Generation Sequencing (NGS) has emerged as a key technology leading to revolutionary breakthroughs in numerous biomedical research areas. These technologies produce millions to billions of short DNA reads that represent a small fraction of the original target DNA sequence. These short reads contain little information individually but are produced at a high coverage of the original sequence such that many reads overlap. Overlap relationships allow for the reads to be linearly ordered and merged by computational programs called assemblers into long stretches of contiguous sequence called contigs that can be used for research applications. Although the assembly of the reads produced by NGS remains a difficult task, it is the process of extracting useful knowledge from these relatively short sequences that has become one of the most exciting and challenging problems in Bioinformatics. The assembly of short reads is an aggregative process where critical information is lost as reads are merged into contigs. In addition, the assembly process is treated as a black box, with generic assembler tools that do not adapt to input data set characteristics. Finally, as NGS data throughput continues to increase, there is an increasing need for smart parallel assembler implementations. In this dissertation, a new assembly approach called Focus is proposed. Unlike previous assemblers, Focus relies on a novel hybrid graph constructed from multiple graphs at different levels of granularity to represent the assembly problem, facilitating information capture and dynamic adjustment to input data set characteristics. This work is composed of four specific aims: 1) The implementation of a robust assembly and analysis tool built on the hybrid graph platform 2) The development and application of graph mining to extract biologically relevant features in NGS data sets 3) The integration of domain specific knowledge to improve the assembly and analysis process. 4) The construction of smart parallel computing approaches, including the application of energy-aware computing for NGS assembly and knowledge integration to improve algorithm performance. In conclusion, this dissertation presents a complete parallel assembler called Focus that is capable of extracting biologically relevant features directly from its hybrid assembly graph

University of Nebraska Medical Center Research: DigitalCommons@UNMC