52 research outputs found

    CRAC: an integrated approach to the analysis of RNA-seq reads

    No full text
    International audienceA large number of RNA-sequencing studies set out to predict mutations, splice junctions or fusion RNAs. We propose a method, CRAC, that integrates genomic locations and local coverage to enable such predictions to be made directly from RNA-seq read analysis. A k-mer profiling approach detects candidate mutations, indels and splice or chimeric junctions in each single read. CRAC increases precision compared with existing tools, reaching 99:5% for splice junctions, without losing sensitivity. Importantly, CRAC predictions improve with read length. In cancer libraries, CRAC recovered 74% of validated fusion RNAs and predicted novel recurrent chimeric junctions. CRAC is available at http://crac.gforge.inria.fr

    Querying large read collections in main memory: a versatile data structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High Throughput Sequencing (HTS) is now heavily exploited for genome (re-) sequencing, metagenomics, epigenomics, and transcriptomics and requires different, but computer intensive bioinformatic analyses. When a reference genome is available, mapping reads on it is the first step of this analysis. Read mapping programs owe their efficiency to the use of involved genome indexing data structures, like the Burrows-Wheeler transform. Recent solutions index both the genome, and the <it>k</it>-mers of the reads using hash-tables to further increase efficiency and accuracy. In various contexts (e.g. assembly or transcriptome analysis), read processing requires to determine the sub-collection of reads that are related to a given sequence, which is done by searching for some <it>k</it>-mers in the reads. Currently, many developments have focused on genome indexing structures for read mapping, but the question of read indexing remains broadly unexplored. However, the increase in sequence throughput urges for new algorithmic solutions to query large read collections efficiently.</p> <p>Results</p> <p>Here, we present a solution, named <it>Gk </it>arrays, to index large collections of reads, an algorithm to build the structure, and procedures to query it. Once constructed, the index structure is kept in main memory and is repeatedly accessed to answer queries like "given a <it>k</it>-mer, get the reads containing this <it>k</it>-mer (once/at least once)". We compared our structure to other solutions that adapt uncompressed indexing structures designed for long texts and show that it processes queries fast, while requiring much less memory. Our structure can thus handle larger read collections. We provide examples where such queries are adapted to different types of read analysis (SNP detection, assembly, RNA-Seq).</p> <p>Conclusions</p> <p><it>Gk </it>arrays constitute a versatile data structure that enables fast and more accurate read analysis in various contexts. The <it>Gk </it>arrays provide a flexible brick to design innovative programs that mine efficiently genomics, epigenomics, metagenomics, or transcriptomics reads. The <it>Gk </it>arrays library is available under Cecill (GPL compliant) license from <url>http://www.atgc-montpellier.fr/ngs/</url>.</p

    Differential expression of the RTP/Drg1/Ndr1 gene product in proliferating and growth arrested cells

    Get PDF
    AbstractUsing a differential display method to identify differentiation-related genes in human myelomonocytic U937 cells, we cloned the cDNA of a gene identical to Drg1 and homologous to other recently discovered genes, respectively human RTP and Cap43 and mouse Ndr1 and TDD5 genes. Their open reading frames encode proteins highly conserved between mouse and man but which do not share homology with other know proteins. Conditions in which mRNAs are up-regulated suggest a role for the protein in cell growth arrest and terminal differentiation. We raised antibodies against a synthetic peptide reproducing a characteristic sequence of the putative polypeptide chain. These antibodies revealed a protein with the expected 43 kDa molecular mass, up-regulated by phorbol ester, retinoids and 1,25-(OH)2 vitamin D3 in U937 cells. It was increased in mammary carcinoma MCF-7 cells treated by retinoids and by the anti-estrogen ICI 182,780 but not by 4-hydroxytamoxifen. The mouse Drg1 homologous protein was up-regulated by retinoic acid in C2 myogenic cells. The diversity of situations in which expression of RTP/Drg1/Ndr1 has now been observed shows that it is widely distributed and up-regulated by various agents. Here we show that ligands of nuclear transcription factors involved in cell differentiation are among the inducers of this novel protein

    Using reads to annotate the genome: influence of length, background distribution, and sequence errors on prediction capacity

    Get PDF
    Ultra high-throughput sequencing is used to analyse the transcriptome or interactome at unprecedented depth on a genome-wide scale. These techniques yield short sequence reads that are then mapped on a genome sequence to predict putatively transcribed or protein-interacting regions. We argue that factors such as background distribution, sequence errors, and read length impact on the prediction capacity of sequence census experiments. Here we suggest a computational approach to measure these factors and analyse their influence on both transcriptomic and epigenomic assays. This investigation provides new clues on both methodological and biological issues. For instance, by analysing chromatin immunoprecipitation read sets, we estimate that 4.6% of reads are affected by SNPs. We show that, although the nucleotide error probability is low, it significantly increases with the position in the sequence. Choosing a read length above 19 bp practically eliminates the risk of finding irrelevant positions, while above 20 bp the number of uniquely mapped reads decreases. With our procedure, we obtain 0.6% false positives among genomic locations. Hence, even rare signatures should identify biologically relevant regions, if they are mapped on the genome. This indicates that digital transcriptomics may help to characterize the wealth of yet undiscovered, low-abundance transcripts

    Simultaneous gene expression profiling in human macrophages infected with Leishmania major parasites using SAGE

    Get PDF
    <p>Abstract</p> <p>Background</p> <p><it>Leishmania </it>(<it>L</it>) are intracellular protozoan parasites that are able to survive and replicate within the harsh and potentially hostile phagolysosomal environment of mammalian mononuclear phagocytes. A complex interplay then takes place between the macrophage (MΦ) striving to eliminate the pathogen and the parasite struggling for its own survival.</p> <p>To investigate this host-parasite conflict at the transcriptional level, in the context of monocyte-derived human MΦs (MDM) infection by <it>L. major </it>metacyclic promastigotes, the quantitative technique of serial analysis of gene expression (SAGE) was used.</p> <p>Results</p> <p>After extracting mRNA from resting human MΦs, <it>Leishmania</it>-infected human MΦs and <it>L. major </it>parasites, three SAGE libraries were constructed and sequenced generating up to 28,173; 57,514 and 33,906 tags respectively (corresponding to 12,946; 23,442 and 9,530 unique tags). Using computational data analysis and direct comparison to 357,888 publicly available experimental human tags, the parasite and the host cell transcriptomes were then simultaneously characterized from the mixed cellular extract, confidently discriminating host from parasite transcripts. This procedure led us to reliably assign 3,814 tags to MΦs' and 3,666 tags to <it>L. major </it>parasites transcripts. We focused on these, showing significant changes in their expression that are likely to be relevant to the pathogenesis of parasite infection: (i) human MΦs genes, belonging to key immune response proteins (e.g., IFNγ pathway, S100 and chemokine families) and (ii) a group of <it>Leishmania </it>genes showing a preferential expression at the parasite's intra-cellular developing stage.</p> <p>Conclusion</p> <p>Dual SAGE transcriptome analysis provided a useful, powerful and accurate approach to discriminating genes of human or parasitic origin in <it>Leishmania</it>-infected human MΦs. The findings presented in this work suggest that the <it>Leishmania </it>parasite modulates key transcripts in human MΦs that may be beneficial for its establishment and survival. Furthermore, these results provide an overview of gene expression at two developmental stages of the parasite, namely metacyclic promastigotes and intracellular amastigotes and indicate a broad difference between their transcriptomic profiles. Finally, our reported set of expressed genes will be useful in future rounds of data mining and gene annotation.</p

    Transcriptome annotation using tandem SAGE tags

    Get PDF
    Analysis of several million expressed gene signatures (tags) revealed an increasing number of different sequences, largely exceeding that of annotated genes in mammalian genomes. Serial analysis of gene expression (SAGE) can reveal new Poly(A) RNAs transcribed from previously unrecognized chromosomal regions. However, conventional SAGE tags are too short to identify unambiguously unique sites in large genomes. Here, we design a novel strategy with tags anchored on two different restrictions sites of cDNAs. New transcripts are then tentatively defined by the two SAGE tags in tandem and by the spanning sequence read on the genome between these tagged sites. Having developed a new algorithm to locate these tag-delimited genomic sequences (TDGS), we first validated its capacity to recognize known genes and its ability to reveal new transcripts with two SAGE libraries built in parallel from a single RNA sample. Our algorithm proves fast enough to experiment this strategy at a large scale. We then collected and processed the complete sets of human SAGE tags to predict yet unknown transcripts. A cross-validation with tiling arrays data shows that 47% of these TDGS overlap transcriptional active regions. Our method provides a new and complementary approach for complex transcriptome annotation

    Développement de la thérapie par différenciation des leucémies aigues myéloïdes par exploitation des données des transcriptomes

    No full text
    La vitamine D3 (VD) est un régulateur de la différenciation myéloïde. Ainsi, la combinaison d agonistes non calcémiants du récepteur de la VD (VDR) et d inhibiteurs des histones déacétylases permet d obtenir une maturation complète des cellules de Leucémie aiguë myéloïde (LAM), issues de lignées ou d échantillons de patients. L établissement des profils d expression génique, établi pour 96 gènes, permet de définir un groupe de marqueurs prédictifs de la réponse des LAM aux agents de différenciation. L implication du corégulateur de récepteurs nucléaires, NCOA4, dans le contrôle transcriptionnel de la différenciation des LAM a été établie. Seul corégulateur modulé au cours de la différenciation d un modèle de LAM4 par une combinaison de VD et de rétinoïdes, son expression est caractéristique d une différenciation monocytaire en réponse à la VD. Par ailleurs, j ai pu montré que NCOA4 est impliqué dans la modulation de l activité du VDRVitamin D3 (VD) regulates myeloid differentiation. Combinating non calcemiant agonist of the Vitamin D Receptor (VD) allows efficient differentiation of Acute Myeloid Leukemia (AML) cells from cell lines or patient s samples. Expression profile establishment for 96 genes allows definition of a marker s cluster, predictive of AML s response to differentiation agents. Implication of nuclear receptor coregulator NCOA4 was characterised in transcriptional control of AML differentiation. Single coregulator modulated during retinoid and VD induced differentiation of a AML4 model, its expression specifically responds to VD induced monocytic differentiation. In addition, I demonstrated that NCOA4 is implied in VDR activity controlMONTPELLIER-BU Sciences (341722106) / SudocSudocFranceF

    Recherche et caractérisation de biomarqueurs pronostiques dans les leucémies myélomonocytaires chroniques et aiguës myéloïdes par exploration des transcriptomes

    No full text
    Un défi de la transcriptomique est d'explorer l'intégralité du répertoire des transcrits normaux et anormaux. Les analyses de GEP (Gene Expression Profiling) basées sur la technologie des puces à ADN sont largement exploitées en cancérologie depuis plusieurs années. Parallèlement, les nouvelles méthodes basées sur le séquençage à haut débit offrent désormais la possibilité de réaliser des analyses précises et sensibles nécessaires à l'étude des cellules normales et cancéreuses. Quelle que soit la méthode, la caractérisation de l'ensemble des transcrits codants et non-codants représente un réel défi biologique pour la recherche de nouveaux marqueurs de diagnostic et de pronostic, et pour la bonne prise en charge des patients. Dans ce travail, j'ai eu l'occasion de traiter deux aspects différents de la biologie qui convergent vers l'identification de transcrits exprimés de manière aberrante dans les leucémies myéloïdes. Le premier aspect a consisté à proposer une sélection de biomarqueurs moléculaires pour la caractérisation de la leucémie myélomonocytaire chronique (LMMC). A partir de l'expression de ces gènes, nous avons développé un score de pronostic qui a permis de définir deux groupes de patients cliniquement distincts. Nous avons ensuite complété notre étude par une caractérisation phénotypique par cytométrie en flux des sous-populations cellulaires aberrantes constituant les lignages mono- et granulocytaires. Une partie de ce travail a été étendue aux leucémies aiguës myéloïdes (LAM) à caryotype normal (CN). L'autre aspect a consisté à participer à la mise en place d'une approche computationnelle intégrée pour caractériser de nouveaux ARNs non annotés et fort probablement non-codants. En explorant des données de Digital Gene Expression (DGE), nous avons quantifié et caractérisé la fraction de ces transcrits dans les régions intergéniques. Nous avons vérifié l'expression de ces nouveaux transcrits dans les tissus normaux et cancéreux en croisant avec d'autres données d'expression, telles que le RNA-Sequencing, et regarder leur conservation et leur expression dans d'autres espèces.A challenge of transcriptomics is to explore the full repertoire of normal and abnormal transcripts. Gene expression profiling analyses based on microarray technology are widely used in cancer research since many years. Meanwhile, new methods based on high-throughput sequencing methods offers henceforth the possibility to undergo accurate and sensitive analyses necessary for studying normal and cancer cells. Despite the method, the characterization of all coding and non-coding transcripts is a real biological challenge in identifying novel diagnostic and prognostic markers, and for the proper care of patients. In the present work, I had the opportunity to address two different aspects of biology, both convergent toward the identification of aberrantly expressed transcripts in myeloid leukemia. The first aspect was to provide a selection of molecular biomarkers for the characterization of chronic myelomonocytic leukemia (CMML). We developed a gene expression-based prognostic score which identified two clinically distinct groups of patients. We then completed our study with a phenotypic characterization by flow cytometry of aberrant subpopulations constituting the myeloid and granulocytic lineages. A part of this work has been extended to acute myeloid leukemia (AML) patients with normal karyotype. The other aspect was to participate in the implementation of an integrated computational approach in order to characterize novel non annotated RNAs, more likely non-coding. We quantified and characterized the proportion of these transcripts in intergenic regions by exploring Digital Gene Expression (DGE) data. We checked their expression in normal and cancer tissues by crossing with RNA-Seq data, and their conservation and expression in other species.MONTPELLIER-BU Sciences (341722106) / SudocSudocFranceF
    corecore