541 research outputs found

    Detecting and comparing non-coding RNAs in the high-throughput era.

    Get PDF
    In recent years there has been a growing interest in the field of non-coding RNA. This surge is a direct consequence of the discovery of a huge number of new non-coding genes and of the finding that many of these transcripts are involved in key cellular functions. In this context, accurately detecting and comparing RNA sequences has become important. Aligning nucleotide sequences is a key requisite when searching for homologous genes. Accurate alignments reveal evolutionary relationships, conserved regions and more generally any biologically relevant pattern. Comparing RNA molecules is, however, a challenging task. The nucleotide alphabet is simpler and therefore less informative than that of amino-acids. Moreover for many non-coding RNAs, evolution is likely to be mostly constrained at the structural level and not at the sequence level. This results in very poor sequence conservation impeding comparison of these molecules. These difficulties define a context where new methods are urgently needed in order to exploit experimental results to their full potential. This review focuses on the comparative genomics of non-coding RNAs in the context of new sequencing technologies and especially dealing with two extremely important and timely research aspects: the development of new methods to align RNAs and the analysis of high-throughput data

    COMPUTATIONAL APPROACHES IN THE ESTIMATION AND ANALYSIS OF TRANSCRIPTS DIFFERENTIAL EXPRESSION AND SPLICING: APPLICATION TO SPINAL MUSCULAR ATROPHY

    Get PDF
    Spinal Muscular Atrophy (SMA) is among the most common genetic neurological diseases that cause infant mortality. SMA is caused by deletion or mutations in the survival motor neuron 1 gene (SMN1), which are expected to generate alterations in RNA transcription, or splicing and most importantly reductions in mRNA transport within the axons of motor neurons (MNs). SMA ultimately results in the selective degeneration of MNs in spinal cord, but the underlying reason is still not clear entirely. The aim of this study is to investigate splicing abnormalities in SMA, and to identify genes presenting differential splicing possibly involved in the pathogenesis of SMA at genome-wide level. We performed RNA-Sequencing data analysis on 2 SMA patients and 2 controls, with 2 biological replicates each sample, derived from their induced Pluripotent Stem Cells-differentiated-MNs. Three types of analyses were executed. Firstly, differential expression analysis was performed to identify possibly mis-regulated genes using Cufflinks. Secondly, alternative splicing analysis was conducted to find differentially-used exons (DUEs; using DEXSeq) as splicing patterns are known to be altered in MNs by the suboptimal levels of SMN protein. Thirdly, we did RNA-binding protein (RBP) - motif discovery for the set of identified alternative cassette-DUEs, to pinpoint possible mechanisms of such alterations, specific to MNs. The gene ontology enrichment analysis of significant DEGs and alternative cassette-DUEs revealed various interesting terms including axon-guidance, muscle-contraction, microtubule-based transport, axon-cargo transport, synapse etc. which suggests their involvement in SMA. Further, promising results were obtained from motif analysis which has identified 22 RBPs out of which 7 RBPs namely, PABPC1, PABPC3, PABPC4, PABPC5, PABPN1, SART3 and KHDRBS1 are known for mRNAs stabilization and mRNA transport across MN-axon. Five RBPs from PABP family are known to interact directly with SMN protein that enhance mRNA transport in MNs. To validate our results specific wet-lab experiments are required, involving precise recognition of RNA-binding sites correspondent with our findings. Our work has provided a promising set of putative targets which might offer potential therapeutic role towards treating SMA. During the course of our study, we have observed that current methods for an effective understanding of differential splicing events within the transcriptomic landscape at high resolution are insufficient. To address this problem, we developed a computational model which has a potential to precisely estimate the \u201ctranscript expression levels\u201d within a given gene locus by disentangling mature and nascent transcription contributions for each transcript at per base resolution. We modeled exonic and intronic read coverages by applying a non-linear computational model and estimated expression for each transcript, which best approximated the observed expression in total RNA-Seq data. The performance of our model was good in terms of computational processing time and memory usage. The application of our model is in the detection of differential splicing events. At exon level, differences in the ratio of the sum of mature and the sum of nascent transcripts over all the transcripts in a gene locus gives an indication of differential splicing. We have implemented our model in R-statistical language

    Developing a workflow for the multi-omics analysis of Daphnia

    Get PDF
    In the era of multi-omics, making reasonable statistical inferences through data integration is challenged by data heterogeneity, dimensionality constraints, and data harmonization. The biological system is presumed to function as a network where the physical relationships between genes (nodes) are represented by links (edges) connecting genes that interact. This thesis aims to develop a new and efficient workflow to analyse non-model organism multi-omics data for researchers who are entangled in the biology questions by using readily available software tools. The proposed approach was applied to the transcriptome and metabolome data of Daphnia magna under various dose rates of gamma radiation. The first part of this workflow compares and contrasts the transcriptional regulation of short-and long-term gamma radiation exposure. A group of genes which share a similar expression across different samples under the same conditions are known as modules, because they are likely to be functionally relevant. Modules were identified using WGCNA but biologically meaningful modules (significant modules) were selected through a novel approach that associates genes with significantly altered expression levels as a result of radiation (i.e. differentially expressed genes) with these candidate modules. Dynamic transcriptional regulation was modelled using transcription factor (TF) DNA binding patterns to associate TFs with expression responses captured by the modules. The biological functions of significant modules and their TF regulators were verified with functional annotations and mapped into the proposed Adverse Outcome Pathways (AOP) of D. magna, which describes the key events which contribute to fecundity reduction. The findings demonstrate that short term radiation impacts are entirely different from long term and cannot be used for long term prediction. The second part investigates the coordination of gene expression and metabolites with differential abundances induced by different gamma dose rates and the underlying mechanisms contributing to the varying extent of the reduction in fecundity. Significant modules which belong to the same design model of dose rates were combined and annotated with new functionality. The abundance of metabolites was also modelled with the same design model. Integrated pathway enrichment analysis was performed to discover and create pathway diagrams for visualising the multi-omics output. Finally, the performance of this workflow on explaining the reduction of fecundity of D. magna, which has not been described in previous studies, has been evaluated. Combining the information from the metabolome and transcriptome data, new insights suggest that the alteration to the cell cycle is the underlying mechanism contributing to the varying reduction of fecundity under the effect of different dose rates of radiation.M-G

    Genetic characterization of Rhodococcus rhodochrous ATCC BAA-870 with emphasis on nitrile hydrolysing enzymes

    Get PDF
    Includes abstract.Includes bibliographical references.Rhodococcus rhodochrous ATCC BAA-870 (BAA-870) had previously been isolated on selective media for enrichment of nitrile hydrolysing bacteria. The organism was found to have a wide substrate range, with activity against aliphatics, aromatics, and aryl aliphatics, and enantioselectivity towards beta substituted nitriles and beta amino nitriles, compounds that have potential applications in the pharmaceutical industry. This makes R. rhodochrous ATCC BAA-870 potentially a versatile biocatalyst for the synthesis of a broad range of compounds with amide and carboxylic acid groups that can be derived from structurally related nitrile precursors. The selectivity of biocatalysts allows for high product yields and better atom economy than nonselective chemical methods of performing this reaction, such as acid or base hydrolysis. In order to apply BAA-870 as a nitrile biocatalyst and to mine the organism for biotechnological uses, the genome was sequenced using Solexa technology and an Illumina Genome Analyzer. The Solexa sequencing output data was analysed using the Solexa Data Analysis Pipeline and a total of 5,643,967 reads, 36-bp in length, were obtained providing 4,273,289 unique sequences. The genome sequence data was assembled using the software Edena, Velvet, and Staden. The best assembly data set was then annotated automatically using dCAS and BASys. Further matepaired sequencing, contracted to the company BaseClear® BV in Leiden, the Netherlands, was performed in order to improve the completeness of the data. The scaffolded Illumina and mate-paired sequences were further assembled and annotated using BASys. BAA-870 has a GC content of 65% and contains 6997 predicted protein-coding sequences (CDS). Of this, 54% encodes previously identified proteins of unknown function. The completed 5.83 Mb genome (with a sequencing coverage of 135 X) was submitted to the NCBI Genome data bank with accession number PRJNA78009. The genome sequence of R. rhodochrous ATCC BAA-870 is the seventh rhodococcal genome to be submitted to the NCBI and the first R. rhodochrous subtype to be sequenced. An analysis of the genome for nitril

    NBPMF: Novel Network-Based Inference Methods for Peptide Mass Fingerprinting

    Get PDF
    Proteins are large, complex molecules that perform a vast array of functions in every living cell. A proteome is a set of proteins produced in an organism, and proteomics is the large-scale study of proteomes. Several high-throughput technologies have been developed in proteomics, where the most commonly applied are mass spectrometry (MS) based approaches. MS is an analytical technique for determining the composition of a sample. Recently it has become a primary tool for protein identification, quantification, and post translational modification (PTM) characterization in proteomics research. There are usually two different ways to identify proteins: top-down and bottom-up. Top-down approaches are based on subjecting intact protein ions and large fragment ions to tandem MS directly, while bottom-up methods are based on mass spectrometric analysis of peptides derived from proteolytic digestion, usually with trypsin. In bottom-up techniques, peptide mass fingerprinting (PMF) is widely used to identify proteins from MS dataset. Conventional PMF representatives such as probabilistic MOWSE algorithm, is based on mass distribution of tryptic peptides. In this thesis, we developed a novel network-based inference software termed NBPMF. By analyzing peptide-protein bipartite network, we designed new peptide protein matching score functions. We present two methods: the static one, ProbS, is based on an independent probability framework; and the dynamic one, HeatS, depicts input dataset as dependent peptides. Moreover, we use linear regression to adjust the matching score according to the masses of proteins. In addition, we consider the order of retention time to further correct the score function. In the post processing, we design two algorithms: assignment of peaks, and protein filtration. The former restricts that a peak can only be assigned to one peptide in order to reduce random matches; and the latter assumes each peak can only be assigned to one protein. In the result validation, we propose two new target-decoy search strategies to estimate the false discovery rate (FDR). The experiments on simulated, authentic, and simulated authentic dataset demonstrate that our NBPMF approaches lead to significantly improved performance compared to several state-of-the-art methods

    Characterization of adiposity and inflammation genetic pleiotropy underlying cardiovascular risk factors in Hispanics.

    Get PDF
    The observed overlap between genetic variants associated with both adiposity and inflammatory markers suggests that changes in both adiposity and inflammation could be partially mediated by common pathways. The pervasive but sparsely characterized “pleiotropic” genetic variants associated with both adiposity and inflammation have been hypothesized to provide insight into the shared biology. This study explored and characterized the genetic pleiotropy underpinning adiposity and inflammation using genetic and phenotypic observations from the Cameron County Hispanic Cohort (CCHC). A total of 3,313 samples and \u3e9 million single nucleotide polymorphisms (SNPs) were examined in this study. Mixed model genome-wide association studies (GWAS) were performed for 9 phenotypes including C-reactive protein (CRP), Interleukin (IL)-6, IL-8, fibrinogen, body mass index (BMI), waist circumference (WC) in males and females, and waist to hip ratio (WHR) in males and females (separately). GWAS for WHR and WC were meta-analyzed to obtain sex-combined results. Pleiotropy assessment was completed using adaptive Sum of Powered Score (aSPU) test. Three genetic loci with evidence of pleiotropy on chromosome 3, 12 and 18 were fine-mapped to distinguish the set of likely vi causal variants. Causal mediation analysis was used to assess whether likely causal variants were independently associated with both inflammation and adiposity. At least 3 signals, on chromosomes 3, 12, and 12, were identified that suggested the presence of SNPs with strong pleiotropic p-values (\u3c 5 × 10−6 ). The fine-mapping of these three suspected pleiotropic regions distinguished 22 variants with posterior causality probabilities greater than 50%. The mediation analysis indicated that rs60505812, on chromosome 3, was independently associated with both an inflammatory marker (IL-6) and an adiposity measure (BMI). For the variant rs73093474, on chromosome 12, results indicated both a direct association with CRP and an indirect association (via WHR). The identification of likely pleiotropic variants indicated that 1) a considerable degree of overlapping genetic pleiotropy exists between adiposity and inflammation, and 2) evidence exists to support both the direct and indirect pleiotropy. The results showed the potential of these genetic variants to provide biological insight, intended to improve the cardiovascular health of the Hispanics, and by extension all populations

    Optimizing transcriptomics to study the evolutionary effect of FOXP2

    Get PDF
    The field of genomics was established with the sequencing of the human genome, a pivotal achievement that has allowed us to address various questions in biology from a unique perspective. One question in particular, that of the evolution of human speech, has gripped philosophers, evolutionary biologists, and now genomicists. However, little is known of the genetic basis that allowed humans to evolve the ability to speak. Of the few genes implicated in human speech, one of the most studied is FOXP2, which encodes for the transcription factor Forkhead box protein P2 (FOXP2). FOXP2 is essential for proper speech development and two mutations in the human lineage are believed to have contributed to the evolution of human speech. To address the effect of FOXP2 and investigate its evolutionary contribution to human speech, one can utilize the power of genomics, more specifically gene expression analysis via ribonucleic acid sequencing (RNA-seq). To this end, I first contributed in developing mcSCRB-seq, a highly sensitive, powerful, and efficient single cell RNA-seq (scRNA-seq) protocol. Previously having emerged as a central method for studying cellular heterogeneity and identifying cellular processes, scRNA-seq was a powerful genomic tool but lacked the sensitivity and cost-efficiency of more established protocols. By systematically evaluating each step of the process, I helped find that the addition of polyethylene glycol increased sensitivity by enhancing the cDNA synthesis reaction. This, along with other optimizations resulted in developing a sensitive and flexible protocol that is cost-efficient and ideal in many research settings. A primary motivation driving the extensive optimizations surrounding single cell transcriptomics has been the generation of cellular atlases, which aim to identify and characterize all of the cells in an organism. As such efforts are carried out in a variety of research groups using a number of different RNA-seq protocols, I contributed in an effort to benchmark and standardize scRNA-seq methods. This not only identified methods which may be ideal for the purpose of cell atlas creation, but also highlighted optimizations that could be integrated into existing protocols. Using mcSCRB-seq as a foundation as well as the findings from the scRNA-seq benchmarking, I helped develop prime-seq, a sensitive, robust, and most importantly, affordable bulk RNA-seq protocol. Bulk RNA-seq was frequently overlooked during the efforts to optimize and establish single-cell techniques, even though the method is still extensively used in analyzing gene expression. Introducing early barcoding and reducing library generation costs kept prime-seq cost-efficient, but basing it off of single-cell methods ensured that it would be a sensitive and powerful technique. I helped verify this by benchmarking it against TruSeq generated data and then helped test the robustness by generating prime-seq libraries from over seventeen species. These optimizations resulted in a final protocol that is well suited for investigating gene expression in comprehensive and high-throughput studies. Finally, I utilized prime-seq in order to develop a comprehensive gene expression atlas to study the function of FOXP2 and its role in speech evolution. I used previously generated mouse models: a knockout model containing one non-functional Foxp2 allele and a humanized model, which has a variant Foxp2 allele with two human-specific mutations. To study the effect globally across the mouse, I helped harvest eighteen tissues which were previously identified to express FOXP2. By then comparing the mouse models to wild-type mice, I helped highlight the importance of FOXP2 within lung development and the importance of the human variant allele in the brain. Both mcSCRB-seq and prime-seq have already been used and published in numerous studies to address a variety of biological and biomedical questions. Additionally, my work on FOXP2 not only provides a thorough expression atlas, but also provides a detailed and cost-efficient plan for undertaking a similar study on other genes of interest. Lastly, the studies on FOXP2 done within this work, lay the foundation for future studies investigating the role of FOXP2 in modulating learning behavior, and thereby affecting human speech
    corecore