5 research outputs found

    VIPER: Visualization Pipeline for RNA-seq, a Snakemake workflow for efficient and complete RNA-seq analysis

    Get PDF
    BACKGROUND: RNA sequencing has become a ubiquitous technology used throughout life sciences as an effective method of measuring RNA abundance quantitatively in tissues and cells. The increase in use of RNA-seq technology has led to the continuous development of new tools for every step of analysis from alignment to downstream pathway analysis. However, effectively using these analysis tools in a scalable and reproducible way can be challenging, especially for non-experts. RESULTS: Using the workflow management system Snakemake we have developed a user friendly, fast, efficient, and comprehensive pipeline for RNA-seq analysis. VIPER (Visualization Pipeline for RNA-seq analysis) is an analysis workflow that combines some of the most popular tools to take RNA-seq analysis from raw sequencing data, through alignment and quality control, into downstream differential expression and pathway analysis. VIPER has been created in a modular fashion to allow for the rapid incorporation of new tools to expand the capabilities. This capacity has already been exploited to include very recently developed tools that explore immune infiltrate and T-cell CDR (Complementarity-Determining Regions) reconstruction abilities. The pipeline has been conveniently packaged such that minimal computational skills are required to download and install the dozens of software packages that VIPER uses. CONCLUSIONS: VIPER is a comprehensive solution that performs most standard RNA-seq analyses quickly and effectively with a built-in capacity for customization and expansion

    RASflow: an RNA-Seq analysis workflow with Snakemake

    Get PDF
    Background With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene/transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills. Results Utilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports the mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. The source code of RASflow is available on GitHub: https://github.com/zhxiaokang/RASflow. Conclusions RASflow is a simple and reliable RNA-Seq analysis workflow covering many use cases.publishedVersio

    Biomarker Discovery Using Statistical and Machine Learning Approaches on Gene Expression Data

    Get PDF
    My PhD is affiliated with the dCod 1.0 project (https://www.uib.no/en/dcod): decoding the systems toxicology of Atlantic cod (Gadus morhua), which aims to better understand how cods adapt and react to the stressors in the environment. One of the research topics is to discover the biomarkers which discriminate the fish under normal biological status and the ones that are exposed to toxicants. A biomarker, or biological marker, is an indicator of a biological state in response to an intervention, which can be for example toxic exposure (in toxicology), disease (for example cancer), or drug response (in precision medicine). Biomarker discovery is a very important research topic in toxicology, cancer research, and so on. A good set of biomarkers can give insight into the disease / toxicant response mechanisms and be useful to find if the person has the disease / the fish has been exposed to the toxicant. On the molecular level, a biomarker could be "genotype" - for instance a single nucleotide variant linked with a particular disease or susceptibility; another biomarker could be the level of expression of a gene or a set of genes. In this thesis we focus on the latter one, aiming to find out the informative genes that can help to distinguish samples from different groups from the gene expression profiling. Several transcriptomics technologies can be used to generate the necessary data, and among them, DNA microarray and RNA sequencing (RNA-Seq) have become the most useful methods for whole transcriptome gene expression profiling. Especially RNA-Seq has become an attractive alternative to microarrays since it was introduced. Prior to analysis of gene expression, the RNA-Seq data needs to go through a series of processing steps, so a workflow which can automate the process is highly required. Even though many workflows have been proposed to facilitate this process, their application is usually limited to such as model organisms, high-performance computers, computer fluent users, and so on. To fill these gaps, we developed a maximally general RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow), which is applicable to a wide range of applications and requires little programming skills. It takes the sequencing data as input, and maps them to either transcriptome or genome for quantification, and after that the gene expression profile can be achieved which afterwards goes through normalization and statistical tests to find out the differentially expressed genes. This work was presented in Paper I and Paper II. Differential expression analysis used in RASflow, together with other univariate methods are widely used in biomarker discovery for their simplicity and interpretability. But they rely on a hypothesis that variables are independent, so they can only identify variables that possess significant information by themselves. However, biological processes usually involve many variables that have complex interactions. Multivariate methods which take the interactions between variables into consideration are therefore also popular for biomarker discovery. To study whether there is a significant advantage of one over the other, we conducted a comparative study of various methods from these two categories and evaluated these methods on two aspects: stability and prediction accuracy, we found that a method’s performance is quite data-dependent. This work was presented in Paper III. Since the biomarker discovery methods perform quite differently on different datasets, then how to choose the most appropriate one for a particular dataset? One solution is to use the function perturbation strategy to combine the outputs from multiple methods. Function perturbation is capable of maintaining prediction accuracy compared with the original individual methods, but its stability is not satisfactory enough. On the other hand, data perturbation uses a similar ensemble learning logic: it firstly generates multiple datasets by resampling the original dataset and then combines the results from those datasets. Data perturbation has been proven to improve the stability of the biomarker discovery method. We therefore proposed a framework which combines function perturbation with data perturbation: Ensemble Feature Selection Integrating Stability (EFSIS) which achieves both high prediction accuracy and stability. This work was presented in Paper IV
    corecore