3,291 research outputs found

    Statistical Algorithms and Bioinformatics Tools Development for Computational Analysis of High-throughput Transcriptomic Data

    Get PDF
    Next-Generation Sequencing technologies allow for a substantial increase in the amount of data available for various biological studies. In order to effectively and efficiently analyze this data, computational approaches combining mathematics, statistics, computer science, and biology are implemented. Even with the substantial efforts devoted to development of these approaches, numerous issues and pitfalls remain. One of these issues is mapping uncertainty, in which read alignment results are biased due to the inherent difficulties associated with accurately aligning RNA-Sequencing reads. GeneQC is an alignment quality control tool that provides insight into the severity of mapping uncertainty in each annotated gene from alignment results. GeneQC used feature extraction to identify three levels of information for each gene and implements elastic net regularization and mixture model fitting to provide insight in the severity of mapping uncertainty and the quality of read alignment. In combination with GeneQC, the Ambiguous Reads Mapping (ARM) algorithm works to re-align ambiguous reads through the integration of motif prediction from metabolic pathways to establish coregulatory gene modules for re-alignment using a negative binomial distribution-based probabilistic approach. These two tools work in tandem to address the issue of mapping uncertainty and provide more accurate read alignments, and thus more accurate expression estimates. Also presented in this dissertation are two approaches to interpreting the expression estimates. The first is IRIS-EDA, an integrated shiny web server that combines numerous analyses to investigate gene expression data generated from RNASequencing data. The second is ViDGER, an R/Bioconductor package that quickly generates high-quality visualizations of differential gene expression results to assist users in comprehensive interpretations of their differential gene expression results, which is a non-trivial task. These four presented tools cover a variety of aspects of modern RNASeq analyses and aim to address bottlenecks related to algorithmic and computational issues, as well as more efficient and effective implementation methods

    GenoMetric Query Language: A novel approach to large-scale genomic data management

    Get PDF
    Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction levels beyond available tool capabilities. Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Next Generation Sequencing Methods for Coastal Zone Water Quality Monitoring

    Get PDF
    When analyzing the water quality of the coastal zone, culture-based techniques have been utilized most often to identify Fecal Indicator Bacteria in samples. Since the advent of the Sanger Method for DNA sequencing, other techniques have arisen that provide significantly more information on the microorganisms in sample, but they are still not the mainstream for water quality analysis. This capstone reviews and compares culture-based techniques, DNA sequencing, RNA sequencing, qPCR for biomarker, and 16S rDNA sequencing to highlight their merits and shortcomings for analyzing environmental water samples. The technique presented that provides the broadest range of information (including the identification of bacteria, viruses, fungi, pathogens, virulence factors, and antibiotic resistance genes) is whole genome shotgun sequencing paired with k-mer based microbial identification. This technique allows researchers and managers not only to identify all microorganisms present in a given sample, but to identify sources of these microorganisms and infection potential to humans as well. This has huge implications for the future of water quality management and provides invaluable information that recreational water managers can use to determine risk to human health. As modern methods drop in price, they are becoming more accessible to user groups. This capstone is designed to help users determine the best method for their individual needs

    Measuring primate gene expression evolution using high throughput transcriptomics and massively parallel reporter assays

    Get PDF
    A key question in biology is how one genome sequence can lead to the great cellular diversity present in multicellular organisms. Enabled by he sequencing revolution, RNA sequencing (RNA-seq) has emerged as a central tool to measure transcriptome-wide gene expression levels. More recently, single cell RNA-seq was introduced and is becoming a feasible alternative to the more established bulk sequencing. While many different methods have been proposed, a thorough optimisation of established protocols can lead to improvements in robustness, sensitivity, scalability and cost effectiveness. Towards this goal, I have contributed to optimizing the single cell RNA-seq method "Single Cell RNA Barcoding and sequencing" (SCRB-seq) and publishing an improved version that uses optimized reaction conditions and molecular crowding (mcSCRB-seq). mcSCRB-seq achieves higher sensitivity at lower cost per cell and shows the highest RNA capture rate when compared with other published methods. We next sought the direct comparison to other scRNA-seq protocols within the Human Cell Atlas (HCA) benchmarking effort. Here we used mcSCRB-seq to profile a common reference sample that included heterogeneous cell populations from different sources. Transfer of the acquired knowledge on single cell RNA sequencing methods to bulk RNA-seq, led to the development of the prime-seq protocol. A sensitive, robust and cost-efficient bulk RNA-seq protocol that can be performed in any molecular biology laboratory. We compared the data generated, using the prime-seq protocol to the gold standard method TruSeq, using power simulations and found that the statistical power to detect differentially expressed genes is comparable, at 40-fold lower cost. While gene expression is an informative phenotype, the regulation that leads to the different phenotypes is still poorly understood. A state-of-the-art method to measure the activity of cis-regulatory elements (CRE) in a high throughput fashion are Massively Parallel Reporter Assays (MPRA). These assays can be used to measure the activity of thousands of cis-Regulatory Elements (CRE) in parallel. A good way to decode the genotype to phenotype conundrum is using evolutionary information. Cross-species comparisons of closely related species can help understand how particular diverging phenotypes emerged and how conserved gene regulatory programs are encoded in the genome. A very useful tool to perform comparative studies are cell lines, particularly induced Pluripotent Stem Cells (iPSCs). iPSCs can be reprogrammed from different primary somatic cells and are per definition pluripotent, meaning they can be differentiated into cells of all three germlayers. A main challenge for primate research is to obtain primary cells. To this end I contributed to establishing a protocol to generate iPSCs from a non-invasive source of primary cells, namely urine. By using prime-seq we characterized the primary Urine Derived Stem Cells (UDSCs) and the reprogrammed iPSCs. Finally, I used an MPRA to measure activity of putative regulatory elements of the gene TRNP1 across the mammalian phylogeny. We found co-evolution of one particular CRE with brain folding in old world monkeys. To validate the finding we looked for transcription factor binding sites within the identified CRE and intersected the list with transcription factors confirmed to be expressed in the cellular system using prime-seq. In addition we found that changes in the protein coding sequence of TRNP1 and neural stem cell proliferation induced by TRNP1 orthologs correlate with brain size. In summary, within my doctorate I developed methods that enable measuring gene expression and gene regulation in a comparative genomics setting. I further applied these methods in a cross mammalian study of the regulatory sequences of the gene TRNP1 and its association with brain phenotypes

    Analysis of genomic data to derive biological conclusions on (1) transcriptional regulation in the human genome and (2) antibody resistance in hepatitis C virus

    Full text link
    High­-throughput sequencing has become pervasive in all facets of genomic analysis. I developed computational methods to analyze high­-throughput sequencing data and derive biological conclusions in two research areas -- transcriptional regulation in mammals and evolution of virus under immune pressure. To investigate transcriptional regulation, I integrated data from multiple experiments performed by the ENCODE consortium. First, my analysis revealed that Transcription Factors (TFs) prefer to bind GC-­rich, histone­-depleted regions. By comparing in vivo and in vitro nucleosome dynamics, I observed that while histones have an innate preference for binding GC-­rich DNA, TF binding overrides this preference and produces a negative correlation between GC content and histone enrichment. In the next project, I found that the binding events of multiple TFs co-­occur at genomic regions enriched in activating histone marks that are typically associated with gene enhancers and promoters, suggesting that these regions may be enhancers or have TSS-­distal transcription. Lastly, I used supervised machine ­learning techniques to train histone enrichment signals and sequence features to predict transcriptional enhancers to be validated in mouse-­transgenic assays. In a post­-clinical trial exploratory analysis of Hepatitis C Virus (HCV), I traced the evolutionary path of the envelope proteins E1 and E2 in HCV-infected liver transplant patients, in response to a novel antibody. I developed a systematic amino acid­-level analysis pipeline that quantifies differences in amino acid frequencies in each position between two time points. Upon applying this method across all positions in the E1/E2 region and comparing pre-­liver­-transplant and post­-viral­-rebound time points, mutations in two positions emerged as being key to antibody evasion. Both these mutations--N415K/D and N417S--were in the epitope targeted by the antibody, but surprisingly, did not co­-occur. In post­-rebound viral genomes that contain the N417S mutation but retain the wild-­type variant at 415, N-­linked glycosylation of 415 is another possible escape mechanism. Using the same analysis pipeline, I also identified additional candidate escape mutations outside the epitope, which could be potential therapeutic targets

    The native cistrome and sequence motif families of the maize ear

    Get PDF
    Elucidating the transcriptional regulatory networks that underlie growth and development requires robust ways to define the complete set of transcription factor (TF) binding sites. Although TF-binding sites are known to be generally located within accessible chromatin regions (ACRs), pinpointing these DNA regulatory elements globally remains challenging. Current approaches primarily identify binding sites for a single TF (e.g. ChIP-seq), or globally detect ACRs but lack the resolution to consistently define TF-binding sites (e.g. DNAse-seq, ATAC-seq). To address this challenge, we developed MNase-defined cistrome-Occupancy Analysis (MOA-seq), a high-resolution (< 30 bp), high-throughput, and genome-wide strategy to globally identify putative TF-binding sites within ACRs. We used MOA-seq on developing maize ears as a proof of concept, able to define a cistrome of 145,000 MOA footprints (MFs). While a substantial majority (76%) of the known ATAC-seq ACRs intersected with the MFs, only a minority of MFs overlapped with the ATAC peaks, indicating that the majority of MFs were novel and not detected by ATAC-seq. MFs were associated with promoters and significantly enriched for TF-binding and long-range chromatin interaction sites, including for the well-characterized FASCIATED EAR4, KNOTTED1, and TEOSINTE BRANCHED1. Importantly, the MOA-seq strategy improved the spatial resolution of TF-binding prediction and allowed us to identify 215 motif families collectively distributed over more than 100,000 non-overlapping, putatively-occupied binding sites across the genome. Our study presents a simple, efficient, and high-resolution approach to identify putative TF footprints and binding motifs genome-wide, to ultimately define a native cistrome atlas

    Next-generation sequencing (NGS) in the microbiological world : how to make the most of your money

    Get PDF
    The Sanger sequencing method produces relatively long DNA sequences of unmatched quality and has been considered for long time as the gold standard for sequencing DNA. Many improvements of the Sanger method that culminated with fluorescent dyes coupled with automated capillary electrophoresis enabled the sequencing of the first genomes. Nevertheless, using this technology to sequence whole genomes was costly, laborious and time consuming even for genomes that are relatively small in size. A major technological advance was the introduction of next-generation sequencing (NGS) pioneered by 454 Life Sciences in the early part of the 21th century. NGS allowed scientists to sequence thousands to millions of DNA molecules in a single machine run. Since then, new NGS technologies have emerged and existing NGS platforms have been improved, enabling the production of genome sequences at an unprecedented rate as well as broadening the spectrum of NGS applications. The current affordability of generating genomic information, especially with microbial samples, has resulted in a false sense of simplicity that belies the fact that many researchers still consider these technologies a black box. In this review, our objective is to identify and discuss four steps that we consider crucial to the success of any NGS-related project. These steps are: (1) the definition of the research objectives beyond sequencing and appropriate experimental planning, (2) library preparation, (3) sequencing and (4) data analysis. The goal of this review is to give an overview of the process, from sample to analysis, and discuss how to optimize your resources to achieve the most from your NGS-based research. Regardless of the evolution and improvement of the sequencing technologies, these four steps will remain relevant

    The Extent and Nature Of Chromosomal Rearrangements in B Lymphocytes

    Get PDF
    Chromosomal rearrangements, including translocations, require formation and joining of DNA double strand breaks (DSBs). These events disrupt the integrity of the genome and are frequently involved in producing leukemias, lymphomas and sarcomas. Mature B cell lymphomas are unique among tumors in that they frequently carry clonal recurrent translocations. This may be the result of Activation Induced Cytidine Deaminase (AID) expression, which introduces a heretofore uncharacterized array of rearrangements. Despite the importance of these events, current understanding of their genesis is limited. To examine the origins of chromosomal rearrangements we developed Translocation Capture Sequencing (TC-Seq), a method to document chromosomal rearrangements to a fixed DSB genome-wide, in primary cells. We examined over 180,000 rearrangements obtained from 400 million activated B lymphocytes to two loci, IgH and c-myc. Our data and analysis reveal that proximity between DSBs, transcriptional activity and chromosome territories are key determinants of genome rearrangement. Specifically, rearrangements tend to occur in cis and to transcribed genes. We also find that AID induces rearrangement in specific hotspots. Hotspots are predominantly genic, transcribed, and are found as translocation partners in mature B cell lymphoma
    corecore