110 research outputs found

    Contamination detection and microbiome exploration with GRIMER

    Get PDF
    Background: Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature, could help to discover and mitigate contamination. Results: We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy, and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyzes contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for nonspecialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles. Conclusion: GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open source and available at https://gitlab.com/dacs-hpi/grimer

    Interpretable detection of novel human viruses from genome sequencing data

    Get PDF
    Viruses evolve extremely quickly, so reliable meth- ods for viral host prediction are necessary to safe- guard biosecurity and biosafety alike. Novel human- infecting viruses are difficult to detect with stan- dard bioinformatics workflows. Here, we predict whether a virus can infect humans directly from next- generation sequencing reads. We show that deep neural architectures significantly outperform both shallow machine learning and standard, homology- based algorithms, cutting the error rates in half and generalizing to taxonomic units distant from those presented during training. Further, we develop a suite of interpretability tools and show that it can be applied also to other models beyond the host pre- diction task. We propose a new approach for con- volutional filter visualization to disentangle the in- formation content of each nucleotide from its contri- bution to the final classification decision. Nucleotide- resolution maps of the learned associations between pathogen genomes and the infectious phenotype can be used to detect regions of interest in novel agents, for example, the SARS-CoV-2 coronavirus, unknown before it caused a COVID-19 pandemic in 2020. All methods presented here are implemented as easy- to-install packages not only enabling analysis of NGS datasets without requiring any deep learning skills, but also allowing advanced users to easily train and explain new models for genomics.Peer Reviewe

    LazyFox: Fast and parallelized overlapping community detection in large graphs

    Get PDF
    The detection of communities in graph datasets provides insight about a graph's underlying structure and is an important tool for various domains such as social sciences, marketing, traffic forecast, and drug discovery. While most existing algorithms provide fast approaches for community detection, their results usually contain strictly separated communities. However, most datasets would semantically allow for or even require overlapping communities that can only be determined at much higher computational cost. We build on an efficient algorithm, Fox, that detects such overlapping communities. Fox measures the closeness of a node to a community by approximating the count of triangles which that node forms with that community. We propose LazyFox, a multi-threaded version of the Fox algorithm, which provides even faster detection without an impact on community quality. This allows for the analyses of significantly larger and more complex datasets. LazyFox enables overlapping community detection on complex graph datasets with millions of nodes and billions of edges in days instead of weeks. As part of this work, LazyFox's implementation was published and is available as a tool under an MIT licence at https://github.com/TimGarrels/LazyFox.Comment: 17 pages, 5 figure

    Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides

    Get PDF
    Mass spectrometry-based proteomics provides a holistic snapshot of the entire protein set of living cells on a molecular level. Currently, only a few deep learning approaches exist that involve peptide fragmentation spectra, which represent partial sequence information of proteins. Commonly, these approaches lack the ability to characterize less studied or even unknown patterns in spectra because of their use of explicit domain knowledge. Here, to elevate unrestricted learning from spectra, we introduce ‘ad hoc learning of fragmentation’ (AHLF), a deep learning model that is end-to-end trained on 19.2 million spectra from several phosphoproteomic datasets. AHLF is interpretable, and we show that peak-level feature importance values and pairwise interactions between peaks are in line with corresponding peptide fragments. We demonstrate our approach by detecting post-translational modifications, specifically protein phosphorylation based on only the fragmentation spectrum without a database search. AHLF increases the area under the receiver operating characteristic curve (AUC) by an average of 9.4% on recent phosphoproteomic data compared with the current state of the art on this task. Furthermore, use of AHLF in rescoring search results increases the number of phosphopeptide identifications by a margin of up to 15.1% at a constant false discovery rate. To show the broad applicability of AHLF, we use transfer learning to also detect cross-linked peptides, as used in protein structure analysis, with an AUC of up to 94%

    NITPICK: peak identification for mass spectrometry data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The reliable extraction of features from mass spectra is a fundamental step in the automated analysis of proteomic mass spectrometry (MS) experiments.</p> <p>Results</p> <p>This contribution proposes a sparse template regression approach to peak picking called NITPICK. NITPICK is a Non-greedy, Iterative Template-based peak PICKer that deconvolves complex overlapping isotope distributions in multicomponent mass spectra. NITPICK is based on <it>fractional averagine</it>, a novel extension to Senko's well-known averagine model, and on a modified version of sparse, non-negative least angle regression, for which a suitable, statistically motivated early stopping criterion has been derived. The strength of NITPICK is the deconvolution of overlapping mixture mass spectra.</p> <p>Conclusion</p> <p>Extensive comparative evaluation has been carried out and results are provided for simulated and real-world data sets. NITPICK outperforms pepex, to date the only alternate, publicly available, non-greedy feature extraction routine. NITPICK is available as software package for the R programming language and can be downloaded from <url>http://hci.iwr.uni-heidelberg.de/mip/proteomics/</url>.</p

    SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented Data

    Full text link
    Training sophisticated machine learning (ML) models requires large datasets that are difficult or expensive to collect for many applications. If prior knowledge about system dynamics is available, mechanistic representations can be used to supplement real-world data. We present SimbaML (Simulation-Based ML), an open-source tool that unifies realistic synthetic dataset generation from ordinary differential equation-based models and the direct analysis and inclusion in ML pipelines. SimbaML conveniently enables investigating transfer learning from synthetic to real-world data, data augmentation, identifying needs for data collection, and benchmarking physics-informed ML approaches. SimbaML is available from https://pypi.org/project/simba-ml/.Comment: 6 pages, 1 figur

    rapmad: Robust analysis of peptide microarray data

    Get PDF
    Background: Peptide microarrays offer an enormous potential as a screening tool for peptidomics experiments and have recently seen an increased field of application ranging from immunological studies to systems biology. By allowing the parallel analysis of thousands of peptides in a single run they are suitable for high-throughput settings. Since data characteristics of peptide microarrays differ from DNA oligonucleotide microarrays, computational methods need to be tailored to these specifications to allow a robust and automated data analysis. While follow-up experiments can ensure the specificity of results, sensitivity cannot be recovered in later steps. Providing sensitivity is thus a primary goal of data analysis procedures. To this end we created rapmad (Robust Alignment of Peptide MicroArray Data), a novel computational tool implemented in R. Results: We evaluated rapmad in antibody reactivity experiments for several thousand peptide spots and compared it to two existing algorithms for the analysis of peptide microarrays. rapmad displays competitive and superior behavior to existing software solutions. Particularly, it shows substantially improved sensitivity for low intensity settings without sacrificing specificity. It thereby contributes to increasing the effectiveness of high throughput screening experiments. Conclusions: rapmad allows the robust and sensitive, automated analysis of high-throughput peptide array data. The rapmad R-package as well as the data sets are available from http://www.tron-mz.de/compmed

    Parasitic Nematodes Exert Antimicrobial Activity and Benefit From Microbiota-Driven Support for Host Immune Regulation

    Get PDF
    Intestinal parasitic nematodes live in intimate contact with the host microbiota. Changes in the microbiome composition during nematode infection affect immune control of the parasites and shifts in the abundance of bacterial groups have been linked to the immunoregulatory potential of nematodes. Here we asked if the small intestinal parasite Heligmosomoides polygyrus produces factors with antimicrobial activity, senses its microbial environment and if the anti-nematode immune and regulatory responses are altered in mice devoid of gut microbes. We found that H. polygyrus excretory/secretory products exhibited antimicrobial activity against gram+/− bacteria. Parasites from germ-free mice displayed alterations in gene expression, comprising factors with putative antimicrobial functions such as chitinase and lysozyme. Infected germ-free mice developed increased small intestinal Th2 responses coinciding with a reduction in local Foxp3+RORγt+ regulatory T cells and decreased parasite fecundity. Our data suggest that nematodes sense their microbial surrounding and have evolved factors that limit the outgrowth of certain microbes. Moreover, the parasites benefit from microbiota-driven immune regulatory circuits, as an increased ratio of intestinal Th2 effector to regulatory T cells coincides with reduced parasite fitness in germ-free mice.Peer Reviewe

    gNOMO : a multi-omics pipeline for integrated host and microbiome analysis of non-model organisms

    Get PDF
    The study of bacterial symbioses has grown exponentially in the recent past. However, existing bioinformatic workflows of microbiome data analysis do commonly not integrate multiple meta-omics levels and are mainly geared toward human microbiomes. Microbiota are better understood when analyzed in their biological context; that is together with their host or environment. Nevertheless, this is a limitation when studying non-model organisms mainly due to the lack of well-annotated sequence references. Here, we present gNOMO, a bioinformatic pipeline that is specifically designed to process and analyze non-model organism samples of up to three meta-omics levels: metagenomics, metatranscriptomics and metaproteomics in an integrative manner. The pipeline has been developed using the workflow management framework Snakemake in order to obtain an automated and reproducible pipeline. Using experimental datasets of the German cockroach Blattella germanica, a non-model organism with very complex gut microbiome, we show the capabilities of gNOMO with regard to meta-omics data integration, expression ratio comparison, taxonomic and functional analysis as well as intuitive output visualization. In conclusion, gNOMO is a bioinformatic pipeline that can easily be configured, for integrating and analyzing multiple meta-omics data types and for producing output visualizations, specifically designed for integrating paired-end sequencing data with mass spectrometry from non-model organisms

    Improving tuberculosis surveillance by detecting international transmission using publicly available whole genome sequencing data

    Get PDF
    Improving the surveillance of tuberculosis (TB) is one of the eight core activities identified by the World Health Organization (WHO) and the European Respiratory Society to achieve TB elimination, defined as less than one incident case per million [1]. Monitoring transmission is especially important for multidrug-resistant (MDR) Mycobacterium tuberculosis isolates – defined as being resistant to rifampicin and isoniazid – and for extensively drug-resistant (XDR) M. tuberculosis isolates – defined as MDR isolates with additional resistance to at least one of the fluoroquinolones and at least one of the second-line injectable drugs. In 2017, the WHO estimated that worldwide more than 450,000 people fell ill with MDR-TB and among these, more than 38,000 fell ill with XDR-TB [2]. The rapid advance in molecular typing technology – especially the availability of whole genome sequencing (WGS) to identify and characterise pathogens – gives us the chance to integrate this information into disease surveillance. For TB surveillance, it is possible to combine the results of molecular typing of isolates from the M. tuberculosis complex with traditional epidemiological information to infer or to exclude TB transmission [3,4]. This is of particular relevance if transmission occurs among multiple countries, where epidemiological data such as social contacts are more difficult to get and where data exchange is more difficult to organise. The European Centre for Disease Prevention and Control (ECDC) reported 44 events of international transmission (international clusters) of MDR-TB in different European countries between 2012 and 2015 [5]. In that report, the authors inferred TB transmission using the mycobacterial interspersed repetitive units variable number of tandem repeats (MIRU-VNTR) typing method. However, this method has limitations such as low correlation with epidemiological information in outbreak settings and low discriminatory power [3,6]. In comparison, WGS analysis offers a much higher discriminatory power and allows inferring (or excluding) TB transmission at a higher resolution [4]. In a recent systematic review, van der Werf et al. identified three studies that used WGS to investigate the international transmission of TB [7]. In recent years, the amount of available WGS data is increasing, especially because sequencing has become cheaper [8]. In addition, more and more authors deposit the raw data of their projects in open access public repositories such as the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) [9]. These publicly available raw WGS data for thousands of isolates enable the re-use and the additional analyses at a large and global scale [10]. For example, it is possible to compare genomic data among different studies or countries since the data are available in a single place. Moreover, new software tools can be tested using the same raw WGS data [11]. However, standards in bioinformatics analysis and interpretation of these WGS data for surveillance purposes are not yet fully established [12]. We aimed to assess the usefulness of raw WGS data of global MDR/XDR M. tuberculosis isolates available in public repositories to improve TB surveillance. Specifically, we wanted to identify potential international events of TB transmission and to compare the international isolates with a collection of M. tuberculosis isolates collected in Germany in 2012 and 2013.Peer Reviewe
    corecore