21 research outputs found

    Anti-bias training for (sc)RNA-seq : experimental and computational approaches to improve precision

    Get PDF
    RNA-seq, including single cell RNA-seq (scRNA-seq), is plagued by insufficient sensitivity and lack of precision. As a result, the full potential of (sc)RNA-seq is limited. Major factors in this respect are the presence of global bias in most datasets, which affects detection and quantitation of RNA in a length-dependent fashion. In particular, scRNA-seq is affected by technical noise and a high rate of dropouts, where the vast majority of original transcripts is not converted into sequencing reads. We discuss these biases origins and implications, bioinformatics approaches to correct for them, and how biases can be exploited to infer characteristics of the sample preparation process, which in turn can be used to improve library preparation

    Genomic and transcriptomic landscape of the Indian Water Buffalo (Bubalus bubalis)

    Get PDF
    The water buffalo (Bubalus bubalis) is one of the most important domesticated species in India providing milk, meat, hide and draft power. At over 100 million animals, India has the highest number of water buffalo in the world, however, the species is found across the globe, including Europe where the Mediterranean subspecies is farmed. Despite the importance of this domesticated bovid, there are limited high-resolution genomic and transcriptomic analyses across these animals. The aim of this thesis was to use whole genome and RNA sequencing data to characterise regulatory variation and genome evolution in the water buffalo. Specifically, I explored the presence of regulatory variation in macrophages of water buffalo in the form of allele-specific expression (ASE) and investigated signatures of selection and breed divergence across water buffalo breeds. Water buffalo are exposed to a range of important pathogens, many of which that are zoonotic in nature. Differences in regulatory variation between animals have been shown to underlie some of the diversity in response to these pathogens. Macrophages are among the first cells of the innate immune system to act against a pathogen through its recognition, phagocytosis and destruction playing an important role in host disease susceptibility. Regulatory variants acting in macrophages are thus important candidates for explaining differences in disease susceptibility to infectious diseases among water buffalo. To detect the presence of regulatory variation, I used whole genome sequencing and RNA-seq data in 4 Mediterranean water buffalo to identify ASE in macrophage expressed genes. The analysis revealed that regulatory variation does exist in macrophage expressed genes which could be reliably detected as ASE signature. To understand the impact of domestication and how water buffalo have evolved I used whole genome sequencing data from 81 animals spanning seven distinct breeds. I identified the population structure of these breeds and explored how gene flow has shaped their genomes. I also characterised the signatures of putative selection between breeds. Sites identified included genes linked to milk production, coat colour and body size and interestingly a number of these overlapped those found to be under selection in other domesticated species suggesting some extent of convergent domestication. In this thesis, I consequently undertook one of the first high-resolution evolutionary and regulatory variation analyses of an important domesticated species, Bubalus bubalis. The results from this study are likely to be invaluable to inform future studies of how regulatory variants may confer tolerance to water buffalo pathogens as well as the impact of domestication on its genome

    New Algorithms for Fast and Economic Assembly: Advances in Transcriptome and Genome Assembly

    Get PDF
    Great efforts have been devoted to decipher the sequence composition of the genomes and transcriptomes of diverse organisms. Continuing advances in high-throughput sequencing technologies have led to a decline in associated costs, facilitating a rapid increase in the amount of available genetic data. In particular genome studies have undergone a fundamental paradigm shift where genome projects are no longer limited by sequencing costs, but rather by computational problems associated with assembly. There is an urgent demand for more efficient and more accurate methods. Most recently, “hybrid” methods that integrate short- and long-read data have been devised to address this need. LazyB is a new, low-cost hybrid genome assembler. It starts from a bipartite overlap graph between long reads and restrictively filtered short-read unitigs. This graph is translated into a long-read overlap graph. By design, unitigs are both unique and almost free of assembly errors. As a consequence, only few spurious overlaps are introduced into the graph. Instead of the more conventional approach of removing tips, bubbles, and other local features, LazyB extracts subgraphs whose global properties approach a disjoint union of paths in multiple steps, utilizing properties of proper interval graphs. A prototype implementation of LazyB, entirely written in Python, not only yields significantly more accurate assemblies of the yeast, fruit fly, and human genomes compared to state-of-the-art pipelines, but also requires much less computational effort. An optimized C++ implementation dubbed MuCHSALSA further significantly reduces resource demands. Advances in RNA-seq have facilitated tremendous insights into the role of both coding and non-coding transcripts. Yet, the complete and accurate annotation of the transciptomes of even model organisms has remained elusive. RNA-seq produces reads significantly shorter than the average distance between related splice events and presents high noise levels and other biases The computational reconstruction remains a critical bottleneck. Ryūtō implements an extension of common splice graphs facilitating the integration of reads spanning multiple splice sites and paired-end reads bridging distant transcript parts. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem. Using phasing information from multi-splice and paired-end reads, nodes with uncertain connections are decomposed step-wise via Linear Programming. Ryūtōs performance compares favorably with state-of-the-art methods on both simulated and real-life datasets. Despite ongoing research and our own contributions, progress on traditional single sample assembly has brought no major breakthrough. Multi-sample RNA-Seq experiments provide more information which, however, is challenging to utilize due to the large amount of accumulating errors. An extension to Ryūtō enables the reconstruction of consensus transcriptomes from multiple RNA-seq data sets, incorporating consensus calling at low level features. Benchmarks show stable improvements already at 3 replicates. Ryūtō outperforms competing approaches, providing a better and user-adjustable sensitivity-precision trade-off. Ryūtō consistently improves assembly on replicates, demonstrable also when mixing conditions or time series and for differential expression analysis. Ryūtōs approach towards guided assembly is equally unique. It allows users to adjust results based on the quality of the guide, even for multi-sample assembly.:1 Preface 1.1 Assembly: A vast and fast evolving field 1.2 Structure of this Work 1.3 Available 2 Introduction 2.1 Mathematical Background 2.2 High-Throughput Sequencing 2.3 Assembly 2.4 Transcriptome Expression 3 From LazyB to MuCHSALSA - Fast and Cheap Genome Assembly 3.1 Background 3.2 Strategy 3.3 Data preprocessing 3.4 Processing of the overlap graph 3.5 Post Processing of the Path Decomposition 3.6 Benchmarking 3.7 MuCHSALSA – Moving towards the future 4 Ryūtō - Versatile, Fast, and Effective Transcript Assembly 4.1 Background 4.2 Strategy 4.3 The Ryūtō core algorithm 4.4 Improved Multi-sample transcript assembly with Ryūtō 5 Conclusion & Future Work 5.1 Discussion and Outlook 5.2 Summary and Conclusio

    Correction to: RNA Bioinformatics.

    Get PDF
    n/

    Analyzing Acute Myeloid Leukemia by RNA-sequencing

    Get PDF
    Bulk and single cell RNA sequencing have revolutionized biomedical research and empower researchers to quantify the global gene expression of populations and single cells to further understand the development, manifestation and the treatment of diseases like cancer. Acute myeloid leukemia (AML), a cancer of the myeloid line of blood cells, could benefit from these technologies as relapse and mortality rates remain high despite the extensive research conducted over several decades. This is partly because AML is a heterogeneous disease, differing substantially between patients and hence requiring more fine-grained classifications and specialised treatment strategies, for example by incorporating expression profiles. In addition, single cell RNA sequencing (scRNA-seq) can resolve genetic and epigenetic subclonal structures within a patient to improve understanding and treatment of AML. However, improving and adapting RNA-seq technologies is still often necessary to efficiently and reliably obtain expression profiles, especially from small or suboptimally processed samples. To this end, we developed a bulk RNA-seq protocol, which copes with the major challenges of limited sample quantities, different sample types, throughput and costs and subsequently applied this method to further understand the subclonal structures in AML. We were able to characterize a plastic cell state of AML cells that is defined by increased stemness and dormancy and could influence treatment outcome and relapse. For this, we isolated non-dividing AML cells based on a proliferation-sensitive dye from patient derived xenograft (PDX) models of two AML patients. We found that these cells have low levels of cell cycle genes confirming dormancy, and additionally had similar expression patterns to previously described dormant minimal residual disease (MRD) cells in lymphoblastic leukemia (ALL). This included high expression levels of cell adhesion molecules, potentially reflecting the persistence of dormant AML and ALL cells in the hematopoietic niche. Lastly, we could show that resting and cycling AML cells can transition between these two states, indicating that dormancy might be a general property of AML cells and not depend on particular genetic subclones. In a second project, we optimized a single cell RNA-seq technology. We used a systematic approach to evaluate experimental conditions of SCRB-seq, a powerful and efficient scRNA-seq method. Focussing on reverse transcription, arguably the most important and inefficient reaction, , we used a standardized human RNA (UHRR) and systematically tested nine different RT enzymes, several reaction enhancers and primer compositions to increase sensitivity. We found that Maxima H- showed the highest sensitivity and that molecular crowding using poylethylene glycol (PEG) could increase the efficiency of the reaction significantly. Together with several smaller changes in the workflow, primer design and PCR conditions, we developed mcSCRB-seq (molecular crowding SCRB-seq). We verified the 2.5x increase in sensitivity using mES cells in a side by side test between SCRB-seq and mcSCRB-seq, and further found mcSCRB-seq to be amongst the most sensitive methods using artificial RNA spike in molecules (ERCCS). Lastly, since method comparisons between studies suffer from missing accuracy due to batch effects and external factors, we participated in a complex scRNA-seq benchmark study aiming to provide a fair comparison between methods concerning sensitivity, accuracy and applicability for building expression atlases. In contrast to before, we found that in this particular setting, mcSCRB-seq did not perform well and ídentified fields for further improvement. In conclusion, my work described in this thesis not only contributes towards a deeper understanding of the emergence and progression of AML but also towards the development of experimental bulk and single-cell RNA sequencing methods, improving their widespread application to biomedical problems such as leukemia

    Measuring primate gene expression evolution using high throughput transcriptomics and massively parallel reporter assays

    Get PDF
    A key question in biology is how one genome sequence can lead to the great cellular diversity present in multicellular organisms. Enabled by he sequencing revolution, RNA sequencing (RNA-seq) has emerged as a central tool to measure transcriptome-wide gene expression levels. More recently, single cell RNA-seq was introduced and is becoming a feasible alternative to the more established bulk sequencing. While many different methods have been proposed, a thorough optimisation of established protocols can lead to improvements in robustness, sensitivity, scalability and cost effectiveness. Towards this goal, I have contributed to optimizing the single cell RNA-seq method "Single Cell RNA Barcoding and sequencing" (SCRB-seq) and publishing an improved version that uses optimized reaction conditions and molecular crowding (mcSCRB-seq). mcSCRB-seq achieves higher sensitivity at lower cost per cell and shows the highest RNA capture rate when compared with other published methods. We next sought the direct comparison to other scRNA-seq protocols within the Human Cell Atlas (HCA) benchmarking effort. Here we used mcSCRB-seq to profile a common reference sample that included heterogeneous cell populations from different sources. Transfer of the acquired knowledge on single cell RNA sequencing methods to bulk RNA-seq, led to the development of the prime-seq protocol. A sensitive, robust and cost-efficient bulk RNA-seq protocol that can be performed in any molecular biology laboratory. We compared the data generated, using the prime-seq protocol to the gold standard method TruSeq, using power simulations and found that the statistical power to detect differentially expressed genes is comparable, at 40-fold lower cost. While gene expression is an informative phenotype, the regulation that leads to the different phenotypes is still poorly understood. A state-of-the-art method to measure the activity of cis-regulatory elements (CRE) in a high throughput fashion are Massively Parallel Reporter Assays (MPRA). These assays can be used to measure the activity of thousands of cis-Regulatory Elements (CRE) in parallel. A good way to decode the genotype to phenotype conundrum is using evolutionary information. Cross-species comparisons of closely related species can help understand how particular diverging phenotypes emerged and how conserved gene regulatory programs are encoded in the genome. A very useful tool to perform comparative studies are cell lines, particularly induced Pluripotent Stem Cells (iPSCs). iPSCs can be reprogrammed from different primary somatic cells and are per definition pluripotent, meaning they can be differentiated into cells of all three germlayers. A main challenge for primate research is to obtain primary cells. To this end I contributed to establishing a protocol to generate iPSCs from a non-invasive source of primary cells, namely urine. By using prime-seq we characterized the primary Urine Derived Stem Cells (UDSCs) and the reprogrammed iPSCs. Finally, I used an MPRA to measure activity of putative regulatory elements of the gene TRNP1 across the mammalian phylogeny. We found co-evolution of one particular CRE with brain folding in old world monkeys. To validate the finding we looked for transcription factor binding sites within the identified CRE and intersected the list with transcription factors confirmed to be expressed in the cellular system using prime-seq. In addition we found that changes in the protein coding sequence of TRNP1 and neural stem cell proliferation induced by TRNP1 orthologs correlate with brain size. In summary, within my doctorate I developed methods that enable measuring gene expression and gene regulation in a comparative genomics setting. I further applied these methods in a cross mammalian study of the regulatory sequences of the gene TRNP1 and its association with brain phenotypes

    Bioinformatics for personal genomics: development and application of bioinformatic procedures for the analysis of genomic data

    Get PDF
    In the last decade, the huge decreasing of sequencing cost due to the development of high-throughput technologies completely changed the way for approaching the genetic problems. In particular, whole exome and whole genome sequencing are contributing to the extraordinary progress in the study of human variants opening up new perspectives in personalized medicine. Being a relatively new and fast developing field, appropriate tools and specialized knowledge are required for an efficient data production and analysis. In line with the times, in 2014, the University of Padua funded the BioInfoGen Strategic Project with the goal of developing technology and expertise in bioinformatics and molecular biology applied to personal genomics. The aim of my PhD was to contribute to this challenge by implementing a series of innovative tools and by applying them for investigating and possibly solving the case studies included into the project. I firstly developed an automated pipeline for dealing with Illumina data, able to sequentially perform each step necessary for passing from raw reads to somatic or germline variant detection. The system performance has been tested by means of internal controls and by its application on a cohort of patients affected by gastric cancer, obtaining interesting results. Once variants are called, they have to be annotated in order to define their properties such as the position at transcript and protein level, the impact on protein sequence, the pathogenicity and more. As most of the publicly available annotators were affected by systematic errors causing a low consistency in the final annotation, I implemented VarPred, a new tool for variant annotation, which guarantees the best accuracy (>99%) compared to the state-of-the-art programs, showing also good processing times. To make easy the use of VarPred, I equipped it with an intuitive web interface, that allows not only a graphical result evaluation, but also a simple filtration strategy. Furthermore, for a valuable user-driven prioritization of human genetic variations, I developed QueryOR, a web platform suitable for searching among known candidate genes as well as for finding novel gene-disease associations. QueryOR combines several innovative features that make it comprehensive, flexible and easy to use. The prioritization is achieved by a global positive selection process that promotes the emergence of the most reliable variants, rather than filtering out those not satisfying the applied criteria. QueryOR has been used to analyze the two case studies framed within the BioInfoGen project. In particular, it allowed to detect causative variants in patients affected by lysosomal storage diseases, highlighting also the efficacy of the designed sequencing panel. On the other hand, QueryOR simplified the recognition of LRP2 gene as possible candidate to explain such subjects with a Dent disease-like phenotype, but with no mutation in the previously identified disease-associated genes, CLCN5 and OCRL. As final corollary, an extensive analysis over recurrent exome variants was performed, showing that their origin can be mainly explained by inaccuracies in the reference genome, including misassembled regions and uncorrected bases, rather than by platform specific errors

    Joint estimation of isoform expression and isoform-specific read distribution using multisample RNA-Seq data.

    Get PDF
    4siMOTIVATION: RNA-sequencing technologies provide a powerful tool for expression analysis at gene and isoform level, but accurate estimation of isoform abundance is still a challenge. Standard assumption of uniform read intensity would yield biased estimates when the read intensity is in fact non-uniform. The problem is that, without strong assumptions, the read intensity pattern is not identifiable from data observed in a single sample. RESULTS: We develop a joint statistical model that accounts for non-uniform isoform-specific read distribution and gene isoform expression estimation. The main challenge is in dealing with the large number of isoform-specific read distributions, which potentially are as many as the number of splice variants in the genome. A statistical regularization via a smoothing penalty is imposed to control the estimation. Also, for identifiability reasons, the method uses information across samples from the same region. We develop a fast and robust computational procedure based on the iterated-weighted least-squares algorithm, and apply it to simulated data and two real RNA-Seq datasets with reverse transcription-polymerase chain reaction validation. Empirical tests show that our model performs better than existing methods in terms of increasing precision in isoform-level estimation. AVAILABILITY AND IMPLEMENTATION: We have implemented our method in an R package called Sequgio as a pipeline for fast processing of RNA-Seq data.openopenSuo C;Calza S;Salim A;Pawitan YSuo, C; Calza, Stefano; Salim, A; Pawitan, Y
    corecore