672 research outputs found

    EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed <it>loci</it>. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an <it>ad hoc </it>genomic mapping.</p> <p>Methods</p> <p>EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.</p> <p>Results</p> <p>The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human <it>HOXA </it>gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the <it>Ricinus communis </it>oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.</p

    An efficient and high-throughput approach for experimental validation of novel human gene predictions

    Get PDF
    AbstractA highly automated RT-PCR-based approach has been established to validate novel human gene predictions with no prior experimental evidence of mRNA splicing (ab initio predictions). Ab initio gene predictions were selected for high-throughput validation using predicted protein classification, sequence similarity to other genomes, colocalization with an MPSS tag, or microarray expression. Initial microarray prioritization followed by RT-PCR validation was the most efficient combination, resulting in approximately 35% of the ab initio predictions being validated by RT-PCR. Of the 7252 novel genes that were prioritized and processed, 796 constituted real transcripts. In addition, high-throughput RACE successfully extended the 5′ and/or 3′ ends of >60% of RT-PCR-validated genes. Reevaluation of these transcripts produced 574 novel transcripts using RefSeq as a reference. RT-PCR sequencing in combination with RACE on ab initio gene predictions could be used to define the transcriptome across all species

    CAFTAN: a tool for fast mapping, and quality assessment of cDNAs

    Get PDF
    Background: The German cDNA Consortium has been cloning full length cDNAs and continued with their exploitation in protein localization experiments and cellular assays. However, the efficient use of large cDNA resources requires the development of strategies that are capable of a speedy selection of truly useful cDNAs from biological and experimental noise. To this end we have developed a new high-throughput analysis tool, CAFTAN, which simplifies these efforts and thus fills the gap between large-scale cDNA collections and their systematic annotation and application in functional genomics. Results: CAFTAN is built around the mapping of cDNAs to the genome assembly, and the subsequent analysis of their genomic context. It uses sequence features like the presence and type of PolyA signals, inner and flanking repeats, the GC-content, splice site types, etc. All these features are evaluated in individual tests and classify cDNAs according to their sequence quality and likelihood to have been generated from fully processed mRNAs. Additionally, CAFTAN compares the coordinates of mapped cDNAs with the genomic coordinates of reference sets from public available resources (e.g., VEGA, ENSEMBL). This provides detailed information about overlapping exons and the structural classification of cDNAs with respect to the reference set of splice variants. The evaluation of CAFTAN showed that is able to correctly classify more than 85% of 5950 selected "known protein-coding" VEGA cDNAs as high quality multi- or single-exon. It identified as good 80.6 % of the single exon cDNAs and 85 % of the multiple exon cDNAs. The program is written in Perl and in a modular way, allowing the adoption of this strategy to other tasks like EST-annotation, or to extend it by adding new classification rules and new organism databases as they become available. We think that it is a very useful program for the annotation and research of unfinished genomes. Conclusion: CAFTAN is a high-throughput sequence analysis tool, which performs a fast and reliable quality prediction of cDNAs. Several thousands of cDNAs can be analyzed in a short time, giving the curator/scientist a first quick overview about the quality and the already existing annotation of a set of cDNAs. It supports the rejection of low quality cDNAs and helps in the selection of likely novel splice variants, and/or completely novel transcripts for new experiments.German Federal Ministry of Education and Research 01GR0101 and 01GR0420 and 01GR045

    EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments

    Get PDF
    Expressed sequence tag (EST) sequencing has proven to be an economically feasible alternative for gene discovery in species lacking a draft genome sequence. Ongoing large-scale EST sequencing projects feel the need for bioinformatics tools to facilitate uniform EST handling. This brings about a renewed importance for a universal tool for processing and functional annotation of large sets of ESTs. EGassembler () is a web server, which provides an automated as well as a user-customized analysis tool for cleaning, repeat masking, vector trimming, organelle masking, clustering and assembling of ESTs and genomic fragments. The web server is publicly available and provides the community a unique all-in-one online application web service for large-scale ESTs and genomic DNA clustering and assembling. Running on a Sun Fire 15K supercomputer, a significantly large volume of data can be processed in a short period of time. The results can be used to functionally annotate genes, to facilitate splice alignment analysis, to link the transcripts to genetic and physical maps, design microarray chips, to perform transcriptome analysis and to map to KEGG metabolic pathways. The service provides an excellent bioinformatics tool to research groups in wet-lab as well as an all-in-one-tool for sequence handling to bioinformatics researchers

    Differential splicing and allelic imbalance as pathomechanisms of recurring mutations in Acute Myeloid Leukemia

    Get PDF
    Acute myeloid leukemia is an aggressive malignancy which proves fatal if left untreated. Most patients respond to intensive chemotherapy, however refractory or relapsing disease is still a major contributor of poor patient outcome. New generation sequencing methods enabled the identification of genes harboring recurrent mutations in this disease, and they are being used to inform clinical decisions. In the studies presented in this thesis the aim was to improve our understanding of these mutations to further refine clinical decision making. The first study provided an overview of splicing factor mutations, which affect around 20% of all acute myeloid leukemia patients. It highlighted the association of splicing factor mutations with clinical and molecular parameters and further showed that splicing factor mutations are not independent prognostic markers in acute myeloid leukemia. A novel differential splice junction usage pipeline was used to quantify aberrant splicing patterns in mutated patients in two large sequencing datasets. The usage of two splice junctions was shown to identify patients with poor prognosis thereby providing an example of how our findings can be translated to clinical practice. The purpose of the second study was to examine allelic imbalance of recurrent mutations, a currently underappreciated phenomenon in acute myeloid leukemia. Using a large patient sample pool with matched DNA- and RNA-sequencing data we were able to compare variant calling pipelines between both sequencing methods to determine whether recurrent mutations are over- or underrepresented in RNA. We defined weighted allelic imbalance as a parameter for statistically comparing variant allele frequencies between DNA and RNA and identified allelic imbalance in nine out of eleven recurrently mutated genes examined in this study. Furthermore, recurrent mutations in GATA2 were also shown to exhibit preferential transcription for the mutant allele in a pooled validation cohort of three independent datasets. In summary, our studies show how customized bioinformatics pipelines can lead to an improved pathomechanistic understanding of recurrent mutations in acute myeloid leukemia and provide a foothold for further study of these mutations in high throughput sequencing experiments.Die Akute Myeloische Leukämie ist eine aggressive Krebserkrankung die unbehandelt tödlich verläuft. Die Mehrheit der Patienten spricht auf eine intensive Chemotherapie an, jedoch resultieren refraktäre Erkrankungsverläufe oder Rezidive immer noch in einer schlechten Gesamtprognose. Hochdurchsatz-Sequenzierungsverfahren erlaubten die Identifikation von Genen, die in dieser Erkrankung häufig mutiert sind. Diese Mutationen ermöglichen eine Risikostratifizierung der Patienten und fließen in Therapie-Entscheidungen ein. Das Ziel der in dieser Dissertation präsentierten Studien war es, die funktionelle Bedeutung einiger dieser Mutationen genauer zu charakterisieren. Die erste Studie charakterisierte Spliceosom-Mutationen, die bei etwa 20% aller Patienten mit einer Akuten Myeloischen Leukämie beobachtet werden. Die Assoziation von Spliceosom-Mutationen mit klinischen und molekularen Parametern wurde untersucht und zeigte, dass Spliceosom-Mutationen keine unabhängige prognostische Wertigkeit besitzen. Eine neue Analyse-Methode zur Splicing-Quantifizierung wurde zur Untersuchung von aberranten Splicing-Mustern in Patienten mit Mutationen in diesen Genen entwickelt. Diese wurde in der Folge auf zwei große Sequenzierdatensätze angewandt. Zwei der aberranten Splicing-Muster konnten genutzt werden, um Patienten mit einer schlechten Prognose zu identifizieren und stellen damit die klinische Bedeutung der Ergebnisse beispielhaft dar. Das Ziel der zweiten Studie war es, ein allelisches Ungleichgewicht von häufigen Mutationen zu untersuchen. Mittels eines großen Patientenkollektivs mit gepaarten DNA- und RNA-Sequenzierungsdaten konnten eine Über- oder Unterrepräsentation von häufigen bei AML Patienten beobachteten Mutationen auf RNA-Ebene bestimmt werden. Wir definierten die “weighted allelic imbalance” als einen Parameter für den statistischen Vergleich der Allelfrequenzen von rekurrenten Mutationen in DNA und RNA und stellten ein allelisches Ungleichgewicht in neun von elf untersuchten Gen-Mutationen fest. Weiterhin konnte die bevorzugte Transkription des mutierten Allels von GATA2 in einer Validierungskohorte, bestehend aus drei unabhängigen Datensätzen, gezeigt werden. Zusammenfassend, zeigen diese Studien wie maßgeschneiderte bioinformatische Arbeitsabläufe zu einem verbesserten pathomechanistischen Verständnis von rekurrenten Mutationen in der Akuten Myeloischen Leukämie führen können und stellen einen Baustein für die weitere Erforschung solcher Mutationen mit Hilfe von Hochdurchsatz-Experimenten dar

    Swine blood transcriptomics: Application and advancement

    Get PDF
    Improving swine feed efficiency (FE) by selection for low residual feed intake (RFI) is of practical interest. However, whether selection for low RFI compromises a pig’s immune response is not clear. In addition, current RFI-based selection for improving feed efficiency was expensive and time-consuming. Seeking alternative tools to facilitate selection, such as predictive biomarkers for RFI, is of great interest. The objectives of this thesis are as follows: (1) to investigate whether selection for low RFI compromise a pig’s immune response; (2) to develop candidate biomarkers applicable at early growth stage for predicting RFI at late growth stage; (3) to improve the annotation of the porcine blood transcriptome. In Chapter 2, pigs of two lines divergently selected for RFI were injected with lipopolysaccharide (LPS). Transcriptomes of peripheral blood at baseline and multi-time points post injection were profiled by RNA-seq. LPS injection induced systemic inflammatory response in both RFI lines. However, no significant differences were detected in dynamics of body temperature, blood cell count and cytokine levels during the time course. Only a very small number of differentially expressed genes (DEGs) were detected between the lines over all time points, though ~ 50% of blood genes were differentially expressed post LPS injection compared to baseline for each line. The two lines were largely similar in most biological pathways and processes studied. Minor differences included a slightly lower level of inflammatory response in the low- versus high-RFI animals. Cross-species comparison showed that humans and pigs responded to LPS stimulation similarly at both the gene and pathway levels, though pigs are more tolerant to LPS than humans. In Chapter 3, post-weaning blood transcriptomic differences between the two lines were studied by RNA-seq. DEGs between the lines significantly overlapped gene sets associated with human diseases, such as eating disorders, hyperphagia and mitochondrial disease. Genes functioning in the mitochondrion and proteasome, and signaling had lower and higher expression in the low-RFI group relative to the high-RFI group, respectively. Expression levels of five differentially expressed genes between the two groups were significantly associated with individual animal’s RFI values. These five genes were candidate biomarkers for predicting RFI. Given limitations of current annotation of the porcine reference genome, a high-quality annotated transcriptome of porcine peripheral blood was built in the last study via a hybrid assembly strategy with a large amount of blood RNA-seq data from studies mentioned above and public databases. Taken together, this work provides evidence that selection for low RFI did not significantly compromise pigs’ immune response to systemic inflammation, offers a few candidate biomarkers for predicting RFI to facilitate RFI-based selection, and significantly advances the structural and functional annotation of porcine blood transcriptome
    • …
    corecore