40 research outputs found

    Advantages of distributed and parallel algorithms that leverage Cloud Computing platforms for large-scale genome assembly

    Get PDF
    Background: The transition to Next Generation sequencing (NGS) sequencing technologies has had numerous applications in Plant, Microbial and Human genomics during the past decade. However, NGS sequencing trades high read throughput for shorter read length, increasing the difficulty for genome assembly. This research presents a comparison of traditional versus Cloud computing-based genome assembly software, using as examples the Velvet and Contrail assemblers and reads from the genome sequence of the zebrafish (Danio rerio) model organism. Results: The first phase of the analysis involved a subset of the zebrafish data set (2X coverage) and best results were obtained using K-mer size of 65, while it was observed that Velvet takes less time than Contrail to complete the assembly. In the next phase, genome assembly was attempted using the full dataset of read coverage 192x and while Velvet failed to complete on a 256GB memory compute server, Contrail completed but required 240hours of computation. Conclusion: This research concludes that for deciding on which assembler software to use, the size of the dataset and available computing hardware should be taken into consideration. For a relatively small sequencing dataset, such as microbial or small eukaryotic genome, the Velvet assembler is a good option. However, for larger datasets Velvet requires large-memory compute servers in the order of 1000GB or more. On the other hand, Contrail is implemented using Hadoop, which performs the assembly in parallel across nodes of a compute cluster. Furthermore, Hadoop clusters can be rented on-demand from Cloud computing providers, and therefore Contrail can provide a simple and cost effective way for genome assembly of data generated at laboratories that lack the infrastructure or funds to build their own clusters

    Bio-Docklets: virtualization containers for single-step execution of NGS pipelines

    Full text link
    Processing of next-generation sequencing (NGS) data requires significant technical skills, involving installation, configuration, and execution of bioinformatics data pipelines, in addition to specialized postanalysis visualization and data mining software. In order to address some of these challenges, developers have leveraged virtualization containers toward seamless deployment of preconfigured bioinformatics software and pipelines on any computational platform. We present an approach for abstracting the complex data operations of multistep, bioinformatics pipelines for NGS data analysis. As examples, we have deployed 2 pipelines for RNA sequencing and chromatin immunoprecipitation sequencing, preconfigured within Docker virtualization containers we call Bio-Docklets. Each Bio-Docklet exposes a single data input and output endpoint and from a user perspective, running the pipelines as simply as running a single bioinformatics tool. This is achieved using a “meta-script” that automatically starts the Bio-Docklets and controls the pipeline execution through the BioBlend software library and the Galaxy Application Programming Interface. The pipeline output is postprocessed by integration with the Visual Omics Explorer framework, providing interactive data visualizations that users can access through a web browser. Our goal is to enable easy access to NGS data analysis pipelines for nonbioinformatics experts on any computing environment, whether a laboratory workstation, university computer cluster, or a cloud service provider. Beyond end users, the Bio-Docklets also enables developers to programmatically deploy and run a large number of pipeline instances for concurrent analysis of multiple datasets

    In Vitro Mutational and Bioinformatics Analysis of the M71 Odorant Receptor and Its Superfamily

    Full text link
    We performed an extensive mutational analysis of the canonical mouse odorant receptor (OR) M71 to determine the properties of ORs that inhibit plasma membrane trafficking in heterologous expression systems. We employed the use of the M71::GFP fusion protein to directly assess plasma membrane localization and functionality of M71 in heterologous cells in vitro or in olfactory sensory neurons (OSNs) in vivo. OSN expression of M71::GFP show only small differences in activity compared to untagged M71. However, M71::GFP could not traffic to the plasma membrane even in the presence of proposed accessory proteins RTP1S or mβ2AR. To ask if ORs contain an internal “kill sequence”, we mutated ~15 of the most highly conserved OR specific amino acids not found amongst the trafficking non-OR GPCR superfamily; none of these mutants rescued trafficking. Addition of various amino terminal signal sequences or different glycosylation motifs all failed to produce trafficking. The addition of the amino and carboxy terminal domains of mβ2AR or the mutation Y289A in the highly conserved GPCR motif NPxxY does not rescue plasma membrane trafficking. The failure of targeted mutagenesis on rescuing plasma membrane localization in heterologous cells suggests that OR trafficking deficits may not be attributable to conserved collinear motifs, but rather the overall amino acid composition of the OR family. Thus, we performed an in silico analysis comparing the OR and other amine receptor superfamilies. We find that ORs contain fewer charged residues and more hydrophobic residues distributed throughout the protein and a conserved overall amino acid composition. From our analysis, we surmise that it may be difficult to traffic ORs at high levels to the cell surface in vitro, without making significant amino acid modifications. Finally, we observed specific increases in methionine and histidine residues as well as a marked decrease in tryptophan residues, suggesting that these changes provide ORs with special characteristics needed for them to function in olfactory neurons

    Fibronectin and androgen receptor expression data in prostate cancer obtained from a RNA-sequencing bioinformatics analysis

    Full text link
    Prostate cancer is the second most commonly diagnosed male cancer in the world. The molecular mechanisms underlying its development and progression are still unclear. Here we show analysis of a prostate cancer RNA-sequencing dataset that was originally generated by Ren et al. [3] from the prostate tumor and adjacent normal tissues of 14 patients. The data presented here was analyzed using our RNA-sequencing bioinformatics analysis pipeline implemented on the bioinformatics web platform, Galaxy. The relative expression of fibronectin (FN1) and the androgen receptor (AR) were calculated in fragments per kilobase of transcript per million mapped reads, and represented in FPKM unit. A subanalysis is also shown for data from three patients, that includes the relative expression of FN1 and AR and their fold change. For interpretation and discussion, please refer to the article, “miR-1207-3p regulates the androgen receptor in prostate cancer via FNDC1/fibronectin” [1] by Das et al

    RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting

    Full text link
    RNA-Seq is increasingly being used to measure human RNA expression on a genome-wide scale. Expression profiles can be interrogated to identify and functionally characterize treatment-responsive genes. Ultimately, such controlled studies promise to reveal insights into molecular mechanisms of treatment effects, identify biomarkers, and realize personalized medicine. RNA-Seq Reports (RSEQREP) is a new open-source cloud-enabled framework that allows users to execute start-to-end gene-level RNA-Seq analysis on a preconfigured RSEQREP Amazon Virtual Machine Image (AMI) hosted by AWS or on their own Ubuntu Linux machine. The framework works with unstranded, stranded, and paired-end sequence FASTQ files stored locally, on Amazon Simple Storage Service (S3), or at the Sequence Read Archive (SRA). RSEQREP automatically executes a series of customizable steps including reference alignment, CRAM compression, reference alignment QC, data normalization, multivariate data visualization, identification of differentially expressed genes, heatmaps, co-expressed gene clusters, enriched pathways, and a series of custom visualizations. The framework outputs a file collection that includes a dynamically generated PDF report using R, knitr, and LaTeX, as well as publication-ready table and figure files. A user-friendly configuration file handles sample metadata entry, processing, analysis, and reporting options. The configuration supports time series RNA-Seq experimental designs with at least one pre- and one post-treatment sample for each subject, as well as multiple treatment groups and specimen types. All RSEQREP analyses components are built using open-source R code and R/Bioconductor packages allowing for further customization. As a use case, we provide RSEQREP results for a trivalent influenza vaccine (TIV) RNA-Seq study that collected 1 pre-TIV and 10 post-TIV vaccination samples (days 1-10) for 5 subjects and two specimen types (peripheral blood mononuclear cells and B-cells)

    Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data

    Get PDF
    Background Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. Results To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (hive.biochemistry.gwu.edu/tools/csr/SRARecords_Curated.php). Conclusions Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides

    Comparing Microbiome Sampling Methods in a Wild Mammal: Fecal and Intestinal Samples Record Different Signals of Host Ecology, Evolution

    Full text link
    Processing of multimodal information is essential for an organism to respond to environmental events. However, how multimodal integration in neurons translates into behavior is far from clear. Here, we investigate integration of biologically relevant visual and auditory information in the goldfish startle escape system in which paired Mauthner-cells (M-cells) initiate the behavior. Sound pips and visual looms as well as multimodal combinations of these stimuli were tested for their effectiveness of evoking the startle response. Results showed that adding a low intensity sound early during a visual loom (low visual effectiveness) produced a supralinear increase in startle responsiveness as compared to an increase expected from a linear summation of the two unimodal stimuli. In contrast, adding a sound pip late during the loom (high visual effectiveness) increased responsiveness consistent with a linear multimodal integration of the two stimuli. Together the results confirm the Inverse Effectiveness Principle (IEP) of multimodal integration proposed in other species. Given the well-established role of the M-cell as a multimodal integrator, these results suggest that IEP is computed in individual neurons that initiate vital behavioral decisions

    RSEQREP: RNA-Seq Reports, an open-source cloud-enabled framework for reproducible RNA-Seq data processing, analysis, and result reporting [version 2; referees: 2 approved]

    Get PDF
    RNA-Seq is increasingly being used to measure human RNA expression on a genome-wide scale. Expression profiles can be interrogated to identify and functionally characterize treatment-responsive genes. Ultimately, such controlled studies promise to reveal insights into molecular mechanisms of treatment effects, identify biomarkers, and realize personalized medicine. RNA-Seq Reports (RSEQREP) is a new open-source cloud-enabled framework that allows users to execute start-to-end gene-level RNA-Seq analysis on a preconfigured RSEQREP Amazon Virtual Machine Image (AMI) hosted by AWS or on their own Ubuntu Linux machine via a Docker container or installation script. The framework works with unstranded, stranded, and paired-end sequence FASTQ files stored locally, on Amazon Simple Storage Service (S3), or at the Sequence Read Archive (SRA). RSEQREP automatically executes a series of customizable steps including reference alignment, CRAM compression, reference alignment QC, data normalization, multivariate data visualization, identification of differentially expressed genes, heatmaps, co-expressed gene clusters, enriched pathways, and a series of custom visualizations. The framework outputs a file collection that includes a dynamically generated PDF report using R, knitr, and LaTeX, as well as publication-ready table and figure files. A user-friendly configuration file handles sample metadata entry, processing, analysis, and reporting options. The configuration supports time series RNA-Seq experimental designs with at least one pre- and one post-treatment sample for each subject, as well as multiple treatment groups and specimen types. All RSEQREP analyses components are built using open-source R code and R/Bioconductor packages allowing for further customization. As a use case, we provide RSEQREP results for a trivalent influenza vaccine (TIV) RNA-Seq study that collected 1 pre-TIV and 10 post-TIV vaccination samples (days 1-10) for 5 subjects and two specimen types (peripheral blood mononuclear cells and B-cells)
    corecore