Search CORE

Repeats and EST analysis for new organisms

Author: Jonassen Inge
Malde Ketil
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Repeat masking is an important step in the EST analysis pipeline. For new species, genomic knowledge is scarce and good repeat libraries are typically unavailable. In these cases it is common practice to mask against known repeats from other species (i.e., model organisms). There are few studies that investigate the effectiveness of this approach, or attempt to evaluate the different methods for identifying and masking repeats. Results Using zebrafish and medaka as example organisms, we show that accurate repeat masking is an important factor for obtaining a high quality clustering. Furthermore, we show that masking with standard repeat libraries based on curated genomic information from other species has little or no positive effect on the quality of the resulting EST clustering. Library based repeat masking which often constitutes a computational bottleneck in the EST analysis pipeline can therefore be reduced to species specific repeat libraries, or perhaps eliminated entirely. In contrast, substantially improved results can be achived by applying a repeat library derived from a partial reference clustering (e.g., from mapping sequences against a partially sequenced genome). Conclusion Of the methods explored, we find that the best EST clustering is achieved after masking with repeat libraries that are species specific. In the absence of such libraries, library-less masking gives results superior to the current practice of using cross-species, genome-based libraries.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

RASflow: an RNA-Seq analysis workflow with Snakemake

Author: Jonassen Inge
Zhang Xiaokang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Background With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene/transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills. Results Utilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports the mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. The source code of RASflow is available on GitHub: https://github.com/zhxiaokang/RASflow. Conclusions RASflow is a simple and reliable RNA-Seq analysis workflow covering many use cases.publishedVersio

TMM@: a web application for the analysis of transmembrane helix mobility

Author: Jonassen Inge
Reuter Nathalie
Skjaerven Lars
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Background: To understand the mechanism by which a protein transmits a signal through the cell membrane, an understanding of the flexibility of its transmembrane (TM) region is essential. Normal Mode Analysis (NMA) has become the method of choice to investigate the slowest motions in macromolecular systems. It has been widely used to study transmembrane channels and pumps. It relies on the hypothesis that the vibrational normal modes having the lowest frequencies (also named soft modes) describe the largest movements in a protein and are the ones that are functionally relevant. In particular NMA can be used to study dynamics of TM regions, but no tool making this approach available for non-experts, has been available so far. Results: We developed the web-application TMM@ (TransMembrane α-helical Mobility analyzer). It uses NMA to characterize the propensity of transmembrane α-helices to be displaced. Starting from a structure file at the PDB format, the server computes the normal modes of the protein and identifies which helices in the bundle are the most mobile. Each analysis is performed independently from the others and results can be visualized using only a web browser. No additional plug-in or software is required. For users who would like to further analyze the output data with their favourite software, raw results can also be downloaded. Conclusion: We built a novel and unique tool, TMM@, to study the mobility of transmembrane α-helices. The tool can be applied to for example membrane transporters and provides biologists studying transmembrane proteins with an approach to investigate which α-helices are likely to undergo the largest displacements, and hence which helices are most likely to be involved in the transportation of molecules in and out of the cell

Springer - Publisher Connector

PubMed Central

Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data

Author: Goksøyr Anders
Jonassen Inge
Zhang Xiaokang
Publication venue: 'Exon Publications'
Publication date: 01/01/2021
Field of study

Biomarkers are of great importance in many fields, such as cancer research, toxicology, diagnosis and treatment of diseases, and to better understand biological response mechanisms to internal or external intervention. High-throughput gene expression profiling technologies, such as DNA microarrays and RNA sequencing, provide large gene expression data sets which enable data-driven biomarker discovery. Traditional statistical tests have been the mainstream for identifying differentially expressed genes as biomarkers. In recent years, machine learning techniques such as feature selection have gained more popularity. Given many options, picking the most appropriate method for a particular data becomes essential. Different evaluation metrics have therefore been proposed. Being evaluated on different aspects, a method’s varied performance across different datasets leads to the idea of integrating multiple methods. Many integration strategies are proposed and have shown great potential. This chapter gives an overview of the current research advances and existing issues in biomarker discovery using machine learning approaches on gene expression data.publishedVersio

Reconstructing ribosomal genes from large scale total RNA meta-transcriptomic data

Author: Jonassen Inge
Lanzén Anders
Xue Yaxin
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

Motivation Technological advances in meta-transcriptomics have enabled a deeper understanding of the structure and function of microbial communities. ‘Total RNA’ meta-transcriptomics, sequencing of total reverse transcribed RNA, provides a unique opportunity to investigate both the structure and function of active microbial communities from all three domains of life simultaneously. A major step of this approach is the reconstruction of full-length taxonomic marker genes such as the small subunit ribosomal RNA. However, current tools for this purpose are mainly targeted towards analysis of amplicon and metagenomic data and thus lack the ability to handle the massive and complex datasets typically resulting from total RNA experiments. Results In this work, we introduce MetaRib, a new tool for reconstructing ribosomal gene sequences from total RNA meta-transcriptomic data. MetaRib is based on the popular rRNA assembly program EMIRGE, together with several improvements. We address the challenge posed by large complex datasets by integrating sub-assembly, dereplication and mapping in an iterative approach, with additional post-processing steps. We applied the method to both simulated and real-world datasets. Our results show that MetaRib can deal with larger datasets and recover more rRNA genes, which achieve around 60 times speedup and higher F1 score compared to EMIRGE in simulated datasets. In the real-world dataset, it shows similar trends but recovers more contigs compared with a previous analysis based on random sub-sampling, while enabling the comparison of individual contig abundances across samples for the first time.publishedVersio

Masking repeats while clustering ESTs

Author: Coward Eivind
Jonassen Inge
Malde Ketil
Schneeberger Korbinian
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering

CiteSeerX

PubMed Central

Metagenome-assembled genome distribution and key functionality highlight importance of aerobic metabolism in Svalbard permafrost

Author: Jonassen Inge
Tas Neslihan
Xue Yaxin
Øvreås Lise
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2020
Field of study

Permafrost underlies a large portion of the land in the Northern Hemisphere. It is proposed to be an extreme habitat and home for cold-adaptive microbial communities. Upon thaw permafrost is predicted to exacerbate increasing global temperature trend, where awakening microbes decompose millennia old carbon stocks. Yet our knowledge on composition, functional potential and variance of permafrost microbiome remains limited. In this study, we conducted a deep comparative metagenomic analysis through a 2 m permafrost core from Svalbard, Norway to determine key permafrost microbiome in this climate sensitive island ecosystem. To do so, we developed comparative metagenomics methods on metagenomic-assembled genomes (MAG). We found that community composition in Svalbard soil horizons shifted markedly with depth: the dominant phylum switched from Acidobacteria and Proteobacteria in top soils (active layer) to Actinobacteria, Bacteroidetes, Chloroflexi and Proteobacteria in permafrost layers. Key metabolic potential propagated through permafrost depths revealed aerobic respiration and soil organic matter decomposition as key metabolic traits. We also found that Svalbard MAGs were enriched in genes involved in regulation of ammonium, sulfur and phosphate. Here, we provide a new perspective on how permafrost microbiome is shaped to acquire resources in competitive and limited resource conditions of deep Svalbard soils.publishedVersio

eScholarship - University of California