62,311 research outputs found

    POPLAR GENE EXPRESSION DATA ANALYSIS PIPELINES

    Get PDF
    Analyzing large-scale gene expression data is a labor-intensive and time-consuming process. To make data analysis easier, we developed a set of pipelines for rapid processing and analysis poplar gene expression data for knowledge discovery. Of all pipelines developed, differentially expressed genes (DEGs) pipeline is the one designed to identify biologically important genes that are differentially expressed in one of multiple time points for conditions. Pathway analysis pipeline was designed to identify the differentially expression metabolic pathways. Protein domain enrichment pipeline can identify the enriched protein domains present in the DEGs. Finally, Gene Ontology (GO) enrichment analysis pipeline was developed to identify the enriched GO terms in the DEGs. Our pipeline tools can analyze both microarray gene data and high-throughput gene data. These two types of data are obtained by two different technologies. A microarray technology is to measure gene expression levels via microarray chips, a collection of microscopic DNA spots attached to a solid (glass) surface, whereas high throughput sequencing, also called as the next-generation sequencing, is a new technology to measure gene expression levels by directly sequencing mRNAs, and obtaining each mRNA’s copy numbers in cells or tissues. We also developed a web portal (http://sys.bio.mtu.edu/) to make all pipelines available to public to facilitate users to analyze their gene expression data. In addition to the analyses mentioned above, it can also perform GO hierarchy analysis, i.e. construct GO trees using a list of GO terms as an input

    A Robust scRNA-seq Data Analysis Pipeline for Measuring Gene Expression Noise

    Get PDF
    abstract: The past decade has seen a drastic increase in collaboration between Computer Science (CS) and Molecular Biology (MB). Current foci in CS such as deep learning require very large amounts of data, and MB research can often be rapidly advanced by analysis and models from CS. One of the places where CS could aid MB is during analysis of sequences to find binding sites, prediction of folding patterns of proteins. Maintenance and replication of stem-like cells is possible for long terms as well as differentiation of these cells into various tissue types. These behaviors are possible by controlling the expression of specific genes. These genes then cascade into a network effect by either promoting or repressing downstream gene expression. The expression level of all gene transcripts within a single cell can be analyzed using single cell RNA sequencing (scRNA-seq). A significant portion of noise in scRNA-seq data are results of extrinsic factors and could only be removed by customized scRNA-seq analysis pipeline. scRNA-seq experiments utilize next-gen sequencing to measure genome scale gene expression levels with single cell resolution. Almost every step during analysis and quantification requires the use of an often empirically determined threshold, which makes quantification of noise less accurate. In addition, each research group often develops their own data analysis pipeline making it impossible to compare data from different groups. To remedy this problem a streamlined and standardized scRNA-seq data analysis and normalization protocol was designed and developed. After analyzing multiple experiments we identified the possible pipeline stages, and tools needed. Our pipeline is capable of handling data with adapters and barcodes, which was not the case with pipelines from some experiments. Our pipeline can be used to analyze single experiment scRNA-seq data and also to compare scRNA-seq data across experiments. Various processes like data gathering, file conversion, and data merging were automated in the pipeline. The main focus was to standardize and normalize single-cell RNA-seq data to minimize technical noise introduced by disparate platforms.Dissertation/ThesisMasters Thesis Bioengineering 201

    Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data

    Full text link
    The technological advancements of the modern era have enabled the collection of huge amounts of data in science and beyond. Extracting useful information from such massive datasets is an ongoing challenge as traditional data visualization tools typically do not scale well in high-dimensional settings. An existing visualization technique that is particularly well suited to visualizing large datasets is the heatmap. Although heatmaps are extremely popular in fields such as bioinformatics for visualizing large gene expression datasets, they remain a severely underutilized visualization tool in modern data analysis. In this paper we introduce superheat, a new R package that provides an extremely flexible and customizable platform for visualizing large datasets using extendable heatmaps. Superheat enhances the traditional heatmap by providing a platform to visualize a wide range of data types simultaneously, adding to the heatmap a response variable as a scatterplot, model results as boxplots, correlation information as barplots, text information, and more. Superheat allows the user to explore their data to greater depths and to take advantage of the heterogeneity present in the data to inform analysis decisions. The goal of this paper is two-fold: (1) to demonstrate the potential of the heatmap as a default visualization method for a wide range of data types using reproducible examples, and (2) to highlight the customizability and ease of implementation of the superheat package in R for creating beautiful and extendable heatmaps. The capabilities and fundamental applicability of the superheat package will be explored via three case studies, each based on publicly available data sources and accompanied by a file outlining the step-by-step analytic pipeline (with code).Comment: 26 pages, 10 figure

    How to describe a cell: a path to automated versatile characterization of cells in imaging data

    Get PDF
    A cell is the basic functional unit of life. Most ulticellular organisms, including animals, are composed of a variety of different cell types that fulfil distinct roles. Within an organism, all cells share the same genome, however, their diverse genetic programs lead them to acquire different molecular and anatomical characteristics. Describing these characteristics is essential for understanding how cellular diversity emerged and how it contributes to the organism function. Probing cellular appearance by microscopy methods is the original way of describing cell types and the main approach to characterise cellular morphology and position in the organism. Present cutting-edge microscopy techniques generate immense amounts of data, requiring efficient automated unbiased methods of analysis. Not only can such methods accelerate the process of scientific discovery, they should also facilitate large-scale systematic reproducible analysis. The necessity of processing big datasets has led to development of intricate image analysis pipelines, however, they are mostly tailored to a particular dataset and a specific research question. In this thesis I aimed to address the problem of creating more general fully-automated ways of describing cells in different imaging modalities, with a specific focus on deep neural networks as a promising solution for extracting rich general-purpose features from the analysed data. I further target the problem of integrating multiple data modalities to generate a detailed description of cells on the whole-organism level. First, on two examples of cell analysis projects, I show how using automated image analysis pipelines and neural networks in particular, can assist characterising cells in microscopy data. In the first project I analyse a movie of drosophila embryo development to elucidate the difference in myosin patterns between two populations of cells with different shape fate. In the second project I develop a pipeline for automatic cell classification in a new imaging modality to show that the quality of the data is sufficient to tell apart cell types in a volume of mouse brain cortex. Next, I present an extensive collaborative effort aimed at generating a whole-body multimodal cell atlas of a three-segmented Platynereis dumerilii worm, combining high resolution morphology and gene expression. To generate a multi-sided description of cells in the atlas I create a pipeline for assigning coherent denoised gene expression profiles, obtained from spatial gene expression maps, to cells segmented in the EM volume. Finally, as the main project of this thesis, I focus on extracting comprehensive unbiased cell morphology features from an EM volume of Platynereis dumerilii. I design a fully unsupervised neural network pipeline for extracting rich morphological representations that enable grouping cells into morphological cell classes with characteristic gene expression. I further show how such descriptors could be used to explore the morphological diversity of cells, tissues and organs in the dataset

    Medium-throughput processing of whole mount in situ hybridisation experiments into gene expression domains

    Get PDF
    This is the final version of the article. Available from the publisher via the DOI in this record.Understanding the function and evolution of developmental regulatory networks requires the characterisation and quantification of spatio-temporal gene expression patterns across a range of systems and species. However, most high-throughput methods to measure the dynamics of gene expression do not preserve the detailed spatial information needed in this context. For this reason, quantification methods based on image bioinformatics have become increasingly important over the past few years. Most available approaches in this field either focus on the detailed and accurate quantification of a small set of gene expression patterns, or attempt high-throughput analysis of spatial expression through binary pattern extraction and large-scale analysis of the resulting datasets. Here we present a robust, "medium-throughput" pipeline to process in situ hybridisation patterns from embryos of different species of flies. It bridges the gap between high-resolution, and high-throughput image processing methods, enabling us to quantify graded expression patterns along the antero-posterior axis of the embryo in an efficient and straightforward manner. Our method is based on a robust enzymatic (colorimetric) in situ hybridisation protocol and rapid data acquisition through wide-field microscopy. Data processing consists of image segmentation, profile extraction, and determination of expression domain boundary positions using a spline approximation. It results in sets of measured boundaries sorted by gene and developmental time point, which are analysed in terms of expression variability or spatio-temporal dynamics. Our method yields integrated time series of spatial gene expression, which can be used to reverse-engineer developmental gene regulatory networks across species. It is easily adaptable to other processes and species, enabling the in silico reconstitution of gene regulatory networks in a wide range of developmental contexts.The laboratory of Johannes Jaeger and this study in particular was funded by the MEC-EMBL agreement for the EMBL/CRG Research Unit in Systems Biology, by grant 153 (MOPDEV) of the ERANet: ComplexityNET program, by SGR grant 406 from the Catalan funding agency AGAUR, by grant BFU2009-10184 from the Spanish Ministry of Science, and by European Commission grant FP7-KBBE-2011-5/289434 (BioPreDyn)

    TOBFAC: the database of tobacco transcription factors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Regulation of gene expression at the level of transcription is a major control point in many biological processes. Transcription factors (TFs) can activate and/or repress the transcriptional rate of target genes and vascular plant genomes devote approximately 7% of their coding capacity to TFs. Global analysis of TFs has only been performed for three complete higher plant genomes – Arabidopsis (<it>Arabidopsis thaliana</it>), poplar (<it>Populus trichocarpa</it>) and rice (<it>Oryza sativa</it>). Presently, no large-scale analysis of TFs has been made from a member of the <it>Solanaceae</it>, one of the most important families of vascular plants. To fill this void, we have analysed tobacco (<it>Nicotiana tabacum</it>) TFs using a dataset of 1,159,022 gene-space sequence reads (GSRs) obtained by methylation filtering of the tobacco genome. An analytical pipeline was developed to isolate TF sequences from the GSR data set. This involved multiple (typically 10–15) independent searches with different versions of the TF family-defining domain(s) (normally the DNA-binding domain) followed by assembly into contigs and verification. Our analysis revealed that tobacco contains a minimum of 2,513 TFs representing all of the 64 well-characterised plant TF families. The number of TFs in tobacco is higher than previously reported for Arabidopsis and rice.</p> <p>Results</p> <p>TOBFAC: the database of tobacco transcription factors, is an integrative database that provides a portal to sequence and phylogeny data for the identified TFs, together with a large quantity of other data concerning TFs in tobacco. The database contains an individual page dedicated to each of the 64 TF families. These contain background information, domain architecture via Pfam links, a list of all sequences and an assessment of the minimum number of TFs in this family in tobacco. Downloadable phylogenetic trees of the major families are provided along with detailed information on the bioinformatic pipeline that was used to find all family members. TOBFAC also contains EST data, a list of published tobacco TFs and a list of papers concerning tobacco TFs. The sequences and annotation data are stored in relational tables using a PostgrelSQL relational database management system. The data processing and analysis pipelines used the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The computationally intensive data processing and analysis pipelines were run on an Apple XServe cluster with more than 20 nodes.</p> <p>Conclusion</p> <p>TOBFAC is an expandable knowledgebase of tobacco TFs with data currently available for over 2,513 TFs from 64 gene families. TOBFAC integrates available sequence information, phylogenetic analysis, and EST data with published reports on tobacco TF function. The database provides a major resource for the study of gene expression in tobacco and the <it>Solanaceae </it>and helps to fill a current gap in studies of TF families across the plant kingdom. TOBFAC is publicly accessible at <url>http://compsysbio.achs.virginia.edu/tobfac/</url>.</p

    Transformation of metabolism with age and lifestyle in Antarctic seals: a case study of systems biology approach to cross-species microarray experiment

    Get PDF
    *_Background:_* The metabolic transformation that changes Weddell seal pups born on land into aquatic animals is not only interesting for the study of general biology, but it also provides a model for the acquired and congenital muscle disorders which are associated with oxygen metabolism in skeletal muscle. However, the analysis of gene expression in seals is hampered by the lack of specific microarrays and the very limited annotation of known Weddell seal (_Leptonychotes weddellii_) genes.&#xd;&#xa;&#xd;&#xa;*_Results:_* Muscle samples from newborn, juvenile, and adult Weddell seals were collected during an Antarctic expedition. Extracted RNA was hybridized on Affymetrix Human Expression chips. Preliminary studies showed a detectable signal from at least 7000 probe sets present in all samples and replicates. Relative expression levels for these genes was used for further analysis of the biological pathways implicated in the metabolism transformation which occurs in the transition from newborn, to juvenile, to adult seals. Cytoskeletal remodeling, WNT signaling, FAK signaling, hypoxia-induced HIF1 activation, and insulin regulation were identified as being among the most important biological pathways involved in transformation. &#xd;&#xa;&#xd;&#xa;*_Conclusion:_* In spite of certain losses in specificity and sensitivity, the cross-species application of gene expression microarrays is capable of solving challenging puzzles in biology. A Systems Biology approach based on gene interaction patterns can compensate adequately for the lack of species-specific genomics information.&#xd;&#xa

    Creation of a Computational Pipeline to Extract Genes from Quantitative Trait Loci for Diabetes and Obesity

    Get PDF
    Type 2 Diabetes is a disease of relative insulin deficiency resulting from a combination of insulin resistance and decreased beta-cell function. Over the past several years, over 60 genes have been identified for Type 2 Diabetes in human genome-wide association studies (GWAS). It is important to understand the genetics involved with Type 2 diabetes in order to improve treatment and understand underlying molecular mechanisms. Heterogeneous stock (HS) rats are derived from 8 inbred founder strains and are powerful tools for genetic studies because they provide a basis for high resolution mapping of quantitative trait loci (QTL) in a relatively short time period. By measuring diabetic traits in 1090 HS male rats and genotyping 10K single nucleotide polymorphisms (SNPs) within these rats, Dr. Solberg Woods\u27 lab conducted genetic analysis to identify 85 QTL for diabetes and adiposity traits. To identify candidate genes within these QTL, we propose creation of a bioinformatics pipeline that combines general gene information, information from the rat genome database including disease portals and Variant Visualizer as well as the Attie Diabetes Expression Database. My project has involved writing code to pull data from these databases to determine which genes within each QTL are potential candidate genes. I have scripted the code to analyze genes within a single QTL or multiple QTL simultaneously. The resulting output is a single excel file for each QTL, listing all genes that are found in the disease portals, all genes that have a highly conserved non-synonymous variant change and all genes that are differentially expressed in the Attie database. The program also highlights genes that are found in all three categories. After creating the pipeline, I ran the program for 85 QTL identified in my laboratory. The program identified 63 high priority candidate genes for future follow-up. This work has helped my laboratory rapidly identify candidate genes for type 2 diabetes and obesity. In the future, the code can be modified to identify candidate genes within QTL for any complex trait

    Large-scale event extraction from literature with multi-level gene normalization

    Get PDF
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license
    • …
    corecore