262 research outputs found

    EFSIS: Ensemble Feature Selection Integrating Stability

    Get PDF
    Ensemble learning that can be used to combine the predictions from multiple learners has been widely applied in pattern recognition, and has been reported to be more robust and accurate than the individual learners. This ensemble logic has recently also been more applied in feature selection. There are basically two strategies for ensemble feature selection, namely data perturbation and function perturbation. Data perturbation performs feature selection on data subsets sampled from the original dataset and then selects the features consistently ranked highly across those data subsets. This has been found to improve both the stability of the selector and the prediction accuracy for a classifier. Function perturbation frees the user from having to decide on the most appropriate selector for any given situation and works by aggregating multiple selectors. This has been found to maintain or improve classification performance. Here we propose a framework, EFSIS, combining these two strategies. Empirical results indicate that EFSIS gives both high prediction accuracy and stability.Comment: 20 pages, 3 figure

    Repeats and EST analysis for new organisms

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Repeat masking is an important step in the EST analysis pipeline. For new species, genomic knowledge is scarce and good repeat libraries are typically unavailable. In these cases it is common practice to mask against known repeats from other species (i.e., model organisms). There are few studies that investigate the effectiveness of this approach, or attempt to evaluate the different methods for identifying and masking repeats.</p> <p>Results</p> <p>Using zebrafish and medaka as example organisms, we show that accurate repeat masking is an important factor for obtaining a high quality clustering. Furthermore, we show that masking with standard repeat libraries based on curated genomic information from other species has little or no positive effect on the quality of the resulting EST clustering. Library based repeat masking which often constitutes a computational bottleneck in the EST analysis pipeline can therefore be reduced to species specific repeat libraries, or perhaps eliminated entirely. In contrast, substantially improved results can be achived by applying a repeat library derived from a partial reference clustering (e.g., from mapping sequences against a partially sequenced genome).</p> <p>Conclusion</p> <p>Of the methods explored, we find that the best EST clustering is achieved after masking with repeat libraries that are species specific. In the absence of such libraries, library-less masking gives results superior to the current practice of using cross-species, genome-based libraries.</p

    RASflow: an RNA-Seq analysis workflow with Snakemake

    Get PDF
    Background With the cost of DNA sequencing decreasing, increasing amounts of RNA-Seq data are being generated giving novel insight into gene expression and regulation. Prior to analysis of gene expression, the RNA-Seq data has to be processed through a number of steps resulting in a quantification of expression of each gene/transcript in each of the analyzed samples. A number of workflows are available to help researchers perform these steps on their own data, or on public data to take advantage of novel software or reference data in data re-analysis. However, many of the existing workflows are limited to specific types of studies. We therefore aimed to develop a maximally general workflow, applicable to a wide range of data and analysis approaches and at the same time support research on both model and non-model organisms. Furthermore, we aimed to make the workflow usable also for users with limited programming skills. Results Utilizing the workflow management system Snakemake and the package management system Conda, we have developed a modular, flexible and user-friendly RNA-Seq analysis workflow: RNA-Seq Analysis Snakemake Workflow (RASflow). Utilizing Snakemake and Conda alleviates challenges with library dependencies and version conflicts and also supports reproducibility. To be applicable for a wide variety of applications, RASflow supports the mapping of reads to both genomic and transcriptomic assemblies. RASflow has a broad range of potential users: it can be applied by researchers interested in any organism and since it requires no programming skills, it can be used by researchers with different backgrounds. The source code of RASflow is available on GitHub: https://github.com/zhxiaokang/RASflow. Conclusions RASflow is a simple and reliable RNA-Seq analysis workflow covering many use cases.publishedVersio

    TMM@: a web application for the analysis of transmembrane helix mobility

    Get PDF
    Background: To understand the mechanism by which a protein transmits a signal through the cell membrane, an understanding of the flexibility of its transmembrane (TM) region is essential. Normal Mode Analysis (NMA) has become the method of choice to investigate the slowest motions in macromolecular systems. It has been widely used to study transmembrane channels and pumps. It relies on the hypothesis that the vibrational normal modes having the lowest frequencies (also named soft modes) describe the largest movements in a protein and are the ones that are functionally relevant. In particular NMA can be used to study dynamics of TM regions, but no tool making this approach available for non-experts, has been available so far. Results: We developed the web-application TMM@ (TransMembrane α-helical Mobility analyzer). It uses NMA to characterize the propensity of transmembrane α-helices to be displaced. Starting from a structure file at the PDB format, the server computes the normal modes of the protein and identifies which helices in the bundle are the most mobile. Each analysis is performed independently from the others and results can be visualized using only a web browser. No additional plug-in or software is required. For users who would like to further analyze the output data with their favourite software, raw results can also be downloaded. Conclusion: We built a novel and unique tool, TMM@, to study the mobility of transmembrane α-helices. The tool can be applied to for example membrane transporters and provides biologists studying transmembrane proteins with an approach to investigate which α-helices are likely to undergo the largest displacements, and hence which helices are most likely to be involved in the transportation of molecules in and out of the cell

    Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data

    Get PDF
    Biomarkers are of great importance in many fields, such as cancer research, toxicology, diagnosis and treatment of diseases, and to better understand biological response mechanisms to internal or external intervention. High-throughput gene expression profiling technologies, such as DNA microarrays and RNA sequencing, provide large gene expression data sets which enable data-driven biomarker discovery. Traditional statistical tests have been the mainstream for identifying differentially expressed genes as biomarkers. In recent years, machine learning techniques such as feature selection have gained more popularity. Given many options, picking the most appropriate method for a particular data becomes essential. Different evaluation metrics have therefore been proposed. Being evaluated on different aspects, a method’s varied performance across different datasets leads to the idea of integrating multiple methods. Many integration strategies are proposed and have shown great potential. This chapter gives an overview of the current research advances and existing issues in biomarker discovery using machine learning approaches on gene expression data.publishedVersio

    Reconstructing ribosomal genes from large scale total RNA meta-transcriptomic data

    Get PDF
    Motivation Technological advances in meta-transcriptomics have enabled a deeper understanding of the structure and function of microbial communities. ‘Total RNA’ meta-transcriptomics, sequencing of total reverse transcribed RNA, provides a unique opportunity to investigate both the structure and function of active microbial communities from all three domains of life simultaneously. A major step of this approach is the reconstruction of full-length taxonomic marker genes such as the small subunit ribosomal RNA. However, current tools for this purpose are mainly targeted towards analysis of amplicon and metagenomic data and thus lack the ability to handle the massive and complex datasets typically resulting from total RNA experiments. Results In this work, we introduce MetaRib, a new tool for reconstructing ribosomal gene sequences from total RNA meta-transcriptomic data. MetaRib is based on the popular rRNA assembly program EMIRGE, together with several improvements. We address the challenge posed by large complex datasets by integrating sub-assembly, dereplication and mapping in an iterative approach, with additional post-processing steps. We applied the method to both simulated and real-world datasets. Our results show that MetaRib can deal with larger datasets and recover more rRNA genes, which achieve around 60 times speedup and higher F1 score compared to EMIRGE in simulated datasets. In the real-world dataset, it shows similar trends but recovers more contigs compared with a previous analysis based on random sub-sampling, while enabling the comparison of individual contig abundances across samples for the first time.publishedVersio

    Masking repeats while clustering ESTs

    Get PDF
    A problem in EST clustering is the presence of repeat sequences. To avoid false matches, repeats have to be masked. This can be a time-consuming process, and it depends on available repeat libraries. We present a fast and effective method that aims to eliminate the problems repeats cause in the process of clustering. Unlike traditional methods, repeats are inferred directly from the EST data, we do not rely on any external library of known repeats. This makes the method especially suitable for analysing the ESTs from organisms without good repeat libraries. We demonstrate that the result is very similar to performing standard repeat masking before clustering

    Metagenome-assembled genome distribution and key functionality highlight importance of aerobic metabolism in Svalbard permafrost

    Get PDF
    Permafrost underlies a large portion of the land in the Northern Hemisphere. It is proposed to be an extreme habitat and home for cold-adaptive microbial communities. Upon thaw permafrost is predicted to exacerbate increasing global temperature trend, where awakening microbes decompose millennia old carbon stocks. Yet our knowledge on composition, functional potential and variance of permafrost microbiome remains limited. In this study, we conducted a deep comparative metagenomic analysis through a 2 m permafrost core from Svalbard, Norway to determine key permafrost microbiome in this climate sensitive island ecosystem. To do so, we developed comparative metagenomics methods on metagenomic-assembled genomes (MAG). We found that community composition in Svalbard soil horizons shifted markedly with depth: the dominant phylum switched from Acidobacteria and Proteobacteria in top soils (active layer) to Actinobacteria, Bacteroidetes, Chloroflexi and Proteobacteria in permafrost layers. Key metabolic potential propagated through permafrost depths revealed aerobic respiration and soil organic matter decomposition as key metabolic traits. We also found that Svalbard MAGs were enriched in genes involved in regulation of ammonium, sulfur and phosphate. Here, we provide a new perspective on how permafrost microbiome is shaped to acquire resources in competitive and limited resource conditions of deep Svalbard soils.publishedVersio
    • …
    corecore