111 research outputs found
Contamination detection and microbiome exploration with GRIMER
Background:
Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature, could help to discover and mitigate contamination.
Results:
We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy, and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyzes contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for nonspecialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles.
Conclusion:
GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open source and available at https://gitlab.com/dacs-hpi/grimer
Interpretable detection of novel human viruses from genome sequencing data
Viruses evolve extremely quickly, so reliable meth-
ods for viral host prediction are necessary to safe-
guard biosecurity and biosafety alike. Novel human-
infecting viruses are difficult to detect with stan-
dard bioinformatics workflows. Here, we predict
whether a virus can infect humans directly from next-
generation sequencing reads. We show that deep
neural architectures significantly outperform both
shallow machine learning and standard, homology-
based algorithms, cutting the error rates in half and
generalizing to taxonomic units distant from those
presented during training. Further, we develop a
suite of interpretability tools and show that it can
be applied also to other models beyond the host pre-
diction task. We propose a new approach for con-
volutional filter visualization to disentangle the in-
formation content of each nucleotide from its contri-
bution to the final classification decision. Nucleotide-
resolution maps of the learned associations between
pathogen genomes and the infectious phenotype can
be used to detect regions of interest in novel agents,
for example, the SARS-CoV-2 coronavirus, unknown
before it caused a COVID-19 pandemic in 2020. All
methods presented here are implemented as easy-
to-install packages not only enabling analysis of NGS
datasets without requiring any deep learning skills,
but also allowing advanced users to easily train and
explain new models for genomics.Peer Reviewe
LazyFox: Fast and parallelized overlapping community detection in large graphs
The detection of communities in graph datasets provides insight about a
graph's underlying structure and is an important tool for various domains such
as social sciences, marketing, traffic forecast, and drug discovery. While most
existing algorithms provide fast approaches for community detection, their
results usually contain strictly separated communities. However, most datasets
would semantically allow for or even require overlapping communities that can
only be determined at much higher computational cost. We build on an efficient
algorithm, Fox, that detects such overlapping communities. Fox measures the
closeness of a node to a community by approximating the count of triangles
which that node forms with that community. We propose LazyFox, a multi-threaded
version of the Fox algorithm, which provides even faster detection without an
impact on community quality. This allows for the analyses of significantly
larger and more complex datasets. LazyFox enables overlapping community
detection on complex graph datasets with millions of nodes and billions of
edges in days instead of weeks. As part of this work, LazyFox's implementation
was published and is available as a tool under an MIT licence at
https://github.com/TimGarrels/LazyFox.Comment: 17 pages, 5 figure
Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides
Mass spectrometry-based proteomics provides a holistic snapshot of the entire protein set of living cells on a molecular level. Currently, only a few deep learning approaches exist that involve peptide fragmentation spectra, which represent partial sequence information of proteins. Commonly, these approaches lack the ability to characterize less studied or even unknown patterns in spectra because of their use of explicit domain knowledge. Here, to elevate unrestricted learning from spectra, we introduce ‘ad hoc learning of fragmentation’ (AHLF), a deep learning model that is end-to-end trained on 19.2 million spectra from several phosphoproteomic datasets. AHLF is interpretable, and we show that peak-level feature importance values and pairwise interactions between peaks are in line with corresponding peptide fragments. We demonstrate our approach by detecting post-translational modifications, specifically protein phosphorylation based on only the fragmentation spectrum without a database search. AHLF increases the area under the receiver operating characteristic curve (AUC) by an average of 9.4% on recent phosphoproteomic data compared with the current state of the art on this task. Furthermore, use of AHLF in rescoring search results increases the number of phosphopeptide identifications by a margin of up to 15.1% at a constant false discovery rate. To show the broad applicability of AHLF, we use transfer learning to also detect cross-linked peptides, as used in protein structure analysis, with an AUC of up to 94%
NITPICK: peak identification for mass spectrometry data
<p>Abstract</p> <p>Background</p> <p>The reliable extraction of features from mass spectra is a fundamental step in the automated analysis of proteomic mass spectrometry (MS) experiments.</p> <p>Results</p> <p>This contribution proposes a sparse template regression approach to peak picking called NITPICK. NITPICK is a Non-greedy, Iterative Template-based peak PICKer that deconvolves complex overlapping isotope distributions in multicomponent mass spectra. NITPICK is based on <it>fractional averagine</it>, a novel extension to Senko's well-known averagine model, and on a modified version of sparse, non-negative least angle regression, for which a suitable, statistically motivated early stopping criterion has been derived. The strength of NITPICK is the deconvolution of overlapping mixture mass spectra.</p> <p>Conclusion</p> <p>Extensive comparative evaluation has been carried out and results are provided for simulated and real-world data sets. NITPICK outperforms pepex, to date the only alternate, publicly available, non-greedy feature extraction routine. NITPICK is available as software package for the R programming language and can be downloaded from <url>http://hci.iwr.uni-heidelberg.de/mip/proteomics/</url>.</p
SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented Data
Training sophisticated machine learning (ML) models requires large datasets
that are difficult or expensive to collect for many applications. If prior
knowledge about system dynamics is available, mechanistic representations can
be used to supplement real-world data. We present SimbaML (Simulation-Based
ML), an open-source tool that unifies realistic synthetic dataset generation
from ordinary differential equation-based models and the direct analysis and
inclusion in ML pipelines. SimbaML conveniently enables investigating transfer
learning from synthetic to real-world data, data augmentation, identifying
needs for data collection, and benchmarking physics-informed ML approaches.
SimbaML is available from https://pypi.org/project/simba-ml/.Comment: 6 pages, 1 figur
rapmad: Robust analysis of peptide microarray data
Background: Peptide microarrays offer an enormous potential as a screening tool for peptidomics experiments and have recently seen an increased field of application ranging from immunological studies to systems biology. By allowing the parallel analysis of thousands of peptides in a single run they are suitable for high-throughput settings. Since data characteristics of peptide microarrays differ from DNA oligonucleotide microarrays, computational methods need to be tailored to these specifications to allow a robust and automated data analysis. While follow-up experiments can ensure the specificity of results, sensitivity cannot be recovered in later steps. Providing sensitivity is thus a primary goal of data analysis procedures. To this end we created rapmad (Robust Alignment of Peptide MicroArray Data), a novel computational tool implemented in R. Results: We evaluated rapmad in antibody reactivity experiments for several thousand peptide spots and compared it to two existing algorithms for the analysis of peptide microarrays. rapmad displays competitive and superior behavior to existing software solutions. Particularly, it shows substantially improved sensitivity for low intensity settings without sacrificing specificity. It thereby contributes to increasing the effectiveness of high throughput screening experiments. Conclusions: rapmad allows the robust and sensitive, automated analysis of high-throughput peptide array data. The rapmad R-package as well as the data sets are available from http://www.tron-mz.de/compmed
Parasitic Nematodes Exert Antimicrobial Activity and Benefit From Microbiota-Driven Support for Host Immune Regulation
Intestinal parasitic nematodes live in intimate contact with the host microbiota. Changes in the microbiome composition during nematode infection affect immune control of the parasites and shifts in the abundance of bacterial groups have been linked to the immunoregulatory potential of nematodes. Here we asked if the small intestinal parasite Heligmosomoides polygyrus produces factors with antimicrobial activity, senses its microbial environment and if the anti-nematode immune and regulatory responses are altered in mice devoid of gut microbes. We found that H. polygyrus excretory/secretory products exhibited antimicrobial activity against gram+/− bacteria. Parasites from germ-free mice displayed alterations in gene expression, comprising factors with putative antimicrobial functions such as chitinase and lysozyme. Infected germ-free mice developed increased small intestinal Th2 responses coinciding with a reduction in local Foxp3+RORγt+ regulatory T cells and decreased parasite fecundity. Our data suggest that nematodes sense their microbial surrounding and have evolved factors that limit the outgrowth of certain microbes. Moreover, the parasites benefit from microbiota-driven immune regulatory circuits, as an increased ratio of intestinal Th2 effector to regulatory T cells coincides with reduced parasite fitness in germ-free mice.Peer Reviewe
gNOMO : a multi-omics pipeline for integrated host and microbiome analysis of non-model organisms
The study of bacterial symbioses has grown exponentially in the recent past. However, existing bioinformatic workflows of microbiome data analysis do commonly not integrate multiple meta-omics levels and are mainly geared toward human microbiomes. Microbiota are better understood when analyzed in their biological context; that is together with their host or environment. Nevertheless, this is a limitation when studying non-model organisms mainly due to the lack of well-annotated sequence references. Here, we present gNOMO, a bioinformatic pipeline that is specifically designed to process and analyze non-model organism samples of up to three meta-omics levels: metagenomics, metatranscriptomics and metaproteomics in an integrative manner. The pipeline has been developed using the workflow management framework Snakemake in order to obtain an automated and reproducible pipeline. Using experimental datasets of the German cockroach Blattella germanica, a non-model organism with very complex gut microbiome, we show the capabilities of gNOMO with regard to meta-omics data integration, expression ratio comparison, taxonomic and functional analysis as well as intuitive output visualization. In conclusion, gNOMO is a bioinformatic pipeline that can easily be configured, for integrating and analyzing multiple meta-omics data types and for producing output visualizations, specifically designed for integrating paired-end sequencing data with mass spectrometry from non-model organisms
Improving tuberculosis surveillance by detecting international transmission using publicly available whole genome sequencing data
Improving the surveillance of tuberculosis (TB) is one of the eight core activities identified by the World Health Organization (WHO) and the European Respiratory Society to achieve TB elimination, defined as less than one incident case per million [1]. Monitoring transmission is especially important for multidrug-resistant (MDR) Mycobacterium tuberculosis isolates – defined as being resistant to rifampicin and isoniazid – and for extensively drug-resistant (XDR) M. tuberculosis isolates – defined as MDR isolates with additional resistance to at least one of the fluoroquinolones and at least one of the second-line injectable drugs. In 2017, the WHO estimated that worldwide more than 450,000 people fell ill with MDR-TB and among these, more than 38,000 fell ill with XDR-TB [2].
The rapid advance in molecular typing technology – especially the availability of whole genome sequencing (WGS) to identify and characterise pathogens – gives us the chance to integrate this information into disease surveillance. For TB surveillance, it is possible to combine the results of molecular typing of isolates from the M. tuberculosis complex with traditional epidemiological information to infer or to exclude TB transmission [3,4]. This is of particular relevance if transmission occurs among multiple countries, where epidemiological data such as social contacts are more difficult to get and where data exchange is more difficult to organise. The European Centre for Disease Prevention and Control (ECDC) reported 44 events of international transmission (international clusters) of MDR-TB in different European countries between 2012 and 2015 [5]. In that report, the authors inferred TB transmission using the mycobacterial interspersed repetitive units variable number of tandem repeats (MIRU-VNTR) typing method. However, this method has limitations such as low correlation with epidemiological information in outbreak settings and low discriminatory power [3,6]. In comparison, WGS analysis offers a much higher discriminatory power and allows inferring (or excluding) TB transmission at a higher resolution [4]. In a recent systematic review, van der Werf et al. identified three studies that used WGS to investigate the international transmission of TB [7].
In recent years, the amount of available WGS data is increasing, especially because sequencing has become cheaper [8]. In addition, more and more authors deposit the raw data of their projects in open access public repositories such as the Sequence Read Archive (SRA) of the National Center for Biotechnology Information (NCBI) [9]. These publicly available raw WGS data for thousands of isolates enable the re-use and the additional analyses at a large and global scale [10]. For example, it is possible to compare genomic data among different studies or countries since the data are available in a single place. Moreover, new software tools can be tested using the same raw WGS data [11]. However, standards in bioinformatics analysis and interpretation of these WGS data for surveillance purposes are not yet fully established [12].
We aimed to assess the usefulness of raw WGS data of global MDR/XDR M. tuberculosis isolates available in public repositories to improve TB surveillance. Specifically, we wanted to identify potential international events of TB transmission and to compare the international isolates with a collection of M. tuberculosis isolates collected in Germany in 2012 and 2013.Peer Reviewe
- …