159 research outputs found
Contamination detection and microbiome exploration with GRIMER
Background:
Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature, could help to discover and mitigate contamination.
Results:
We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy, and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyzes contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for nonspecialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles.
Conclusion:
GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open source and available at https://gitlab.com/dacs-hpi/grimer
Interpretable detection of novel human viruses from genome sequencing data
Viruses evolve extremely quickly, so reliable meth-
ods for viral host prediction are necessary to safe-
guard biosecurity and biosafety alike. Novel human-
infecting viruses are difficult to detect with stan-
dard bioinformatics workflows. Here, we predict
whether a virus can infect humans directly from next-
generation sequencing reads. We show that deep
neural architectures significantly outperform both
shallow machine learning and standard, homology-
based algorithms, cutting the error rates in half and
generalizing to taxonomic units distant from those
presented during training. Further, we develop a
suite of interpretability tools and show that it can
be applied also to other models beyond the host pre-
diction task. We propose a new approach for con-
volutional filter visualization to disentangle the in-
formation content of each nucleotide from its contri-
bution to the final classification decision. Nucleotide-
resolution maps of the learned associations between
pathogen genomes and the infectious phenotype can
be used to detect regions of interest in novel agents,
for example, the SARS-CoV-2 coronavirus, unknown
before it caused a COVID-19 pandemic in 2020. All
methods presented here are implemented as easy-
to-install packages not only enabling analysis of NGS
datasets without requiring any deep learning skills,
but also allowing advanced users to easily train and
explain new models for genomics.Peer Reviewe
LazyFox: Fast and parallelized overlapping community detection in large graphs
The detection of communities in graph datasets provides insight about a
graph's underlying structure and is an important tool for various domains such
as social sciences, marketing, traffic forecast, and drug discovery. While most
existing algorithms provide fast approaches for community detection, their
results usually contain strictly separated communities. However, most datasets
would semantically allow for or even require overlapping communities that can
only be determined at much higher computational cost. We build on an efficient
algorithm, Fox, that detects such overlapping communities. Fox measures the
closeness of a node to a community by approximating the count of triangles
which that node forms with that community. We propose LazyFox, a multi-threaded
version of the Fox algorithm, which provides even faster detection without an
impact on community quality. This allows for the analyses of significantly
larger and more complex datasets. LazyFox enables overlapping community
detection on complex graph datasets with millions of nodes and billions of
edges in days instead of weeks. As part of this work, LazyFox's implementation
was published and is available as a tool under an MIT licence at
https://github.com/TimGarrels/LazyFox.Comment: 17 pages, 5 figure
Ad hoc learning of peptide fragmentation from mass spectra enables an interpretable detection of phosphorylated and cross-linked peptides
Mass spectrometry-based proteomics provides a holistic snapshot of the entire protein set of living cells on a molecular level. Currently, only a few deep learning approaches exist that involve peptide fragmentation spectra, which represent partial sequence information of proteins. Commonly, these approaches lack the ability to characterize less studied or even unknown patterns in spectra because of their use of explicit domain knowledge. Here, to elevate unrestricted learning from spectra, we introduce ‘ad hoc learning of fragmentation’ (AHLF), a deep learning model that is end-to-end trained on 19.2 million spectra from several phosphoproteomic datasets. AHLF is interpretable, and we show that peak-level feature importance values and pairwise interactions between peaks are in line with corresponding peptide fragments. We demonstrate our approach by detecting post-translational modifications, specifically protein phosphorylation based on only the fragmentation spectrum without a database search. AHLF increases the area under the receiver operating characteristic curve (AUC) by an average of 9.4% on recent phosphoproteomic data compared with the current state of the art on this task. Furthermore, use of AHLF in rescoring search results increases the number of phosphopeptide identifications by a margin of up to 15.1% at a constant false discovery rate. To show the broad applicability of AHLF, we use transfer learning to also detect cross-linked peptides, as used in protein structure analysis, with an AUC of up to 94%
Comprehensive evaluation of peptide de novo sequencing tools for monoclonal antibody assembly
Monoclonal antibodies are biotechnologically produced proteins with various applications in research, therapeutics and diagnostics. Their ability to recognize and bind to specific molecule structures makes them essential research tools and therapeutic agents. Sequence information of antibodies is helpful for understanding antibody–antigen interactions and ensuring their affinity and specificity. De novo protein sequencing based on mass spectrometry is a valuable method to obtain the amino acid sequence of peptides and proteins without a priori knowledge. In this study, we evaluated six recently developed de novo peptide sequencing algorithms (Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo), which were not specifically designed for antibody data. We validated their ability to identify and assemble antibody sequences on three multi-enzymatic data sets. The deep learning-based tools Casanovo and PointNovo showed an increased peptide recall across different enzymes and data sets compared with spectrum-graph-based approaches. We evaluated different error types of de novo peptide sequencing tools and their performance for different numbers of missing cleavage sites, noisy spectra and peptides of various lengths. We achieved a sequence coverage of 97.69–99.53% on the light chains of three different antibody data sets using the de Bruijn assembler ALPS and the predictions from Casanovo. However, low sequence coverage and accuracy on the heavy chains demonstrate that complete de novo protein sequencing remains a challenging issue in proteomics that requires improved de novo error correction, alternative digestion strategies and hybrid approaches such as homology search to achieve high accuracy on long protein sequences.Peer Reviewe
NITPICK: peak identification for mass spectrometry data
<p>Abstract</p> <p>Background</p> <p>The reliable extraction of features from mass spectra is a fundamental step in the automated analysis of proteomic mass spectrometry (MS) experiments.</p> <p>Results</p> <p>This contribution proposes a sparse template regression approach to peak picking called NITPICK. NITPICK is a Non-greedy, Iterative Template-based peak PICKer that deconvolves complex overlapping isotope distributions in multicomponent mass spectra. NITPICK is based on <it>fractional averagine</it>, a novel extension to Senko's well-known averagine model, and on a modified version of sparse, non-negative least angle regression, for which a suitable, statistically motivated early stopping criterion has been derived. The strength of NITPICK is the deconvolution of overlapping mixture mass spectra.</p> <p>Conclusion</p> <p>Extensive comparative evaluation has been carried out and results are provided for simulated and real-world data sets. NITPICK outperforms pepex, to date the only alternate, publicly available, non-greedy feature extraction routine. NITPICK is available as software package for the R programming language and can be downloaded from <url>http://hci.iwr.uni-heidelberg.de/mip/proteomics/</url>.</p
Hitac: a hierarchical taxonomic classifier for fungal ITS sequences compatible with QIIME2
Background
Fungi play a key role in several important ecological functions, ranging from organic matter decomposition to symbiotic associations with plants. Moreover, fungi naturally inhabit the human body and can be beneficial when administered as probiotics. In mycology, the internal transcribed spacer (ITS) region was adopted as the universal marker for classifying fungi. Hence, an accurate and robust method for ITS classification is not only desired for the purpose of better diversity estimation, but it can also help us gain a deeper insight into the dynamics of environmental communities and ultimately comprehend whether the abundance of certain species correlate with health and disease. Although many methods have been proposed for taxonomic classification, to the best of our knowledge, none of them fully explore the taxonomic tree hierarchy when building their models. This in turn, leads to lower generalization power and higher risk of committing classification errors.
Results
Here we introduce HiTaC, a robust hierarchical machine learning model for accurate ITS classification, which requires a small amount of data for training and can handle imbalanced datasets. HiTaC was thoroughly evaluated with the established TAXXI benchmark and could correctly classify fungal ITS sequences of varying lengths and a range of identity differences between the training and test data. HiTaC outperforms state-of-the-art methods when trained over noisy data, consistently achieving higher F1-score and sensitivity across different taxonomic ranks, improving sensitivity by 6.9 percentage points over top methods in the most noisy dataset available on TAXXI.
Conclusions
HiTaC is publicly available at the Python package index, BIOCONDA and Docker Hub. It is released under the new BSD license, allowing free use in academia and industry. Source code and documentation, which includes installation and usage instructions, are available at https://gitlab.com/dacs-hpi/hitac
PEPerMINT: peptide abundance imputation in mass spectrometry-based proteomics using graph neural networks
Motivation
Accurate quantitative information about protein abundance is crucial for understanding a biological system and its dynamics. Protein abundance is commonly estimated using label-free, bottom-up mass spectrometry (MS) protocols. Here, proteins are digested into peptides before quantification via MS. However, missing peptide abundance values, which can make up more than 50% of all abundance values, are a common issue. They result in missing protein abundance values, which then hinder accurate and reliable downstream analyses.
Results
To impute missing abundance values, we propose PEPerMINT, a graph neural network model working directly on the peptide level that flexibly takes both peptide-to-protein relationships in a graph format as well as amino acid sequence information into account. We benchmark our method against 11 common imputation methods on 6 diverse datasets, including cell lines, tissue, and plasma samples. We observe that PEPerMINT consistently outperforms other imputation methods. Its prediction performance remains high for varying degrees of missingness, different evaluation approaches, and differential expression prediction. As an additional novel feature, PEPerMINT provides meaningful uncertainty estimates and allows for tailoring imputation to the user’s needs based on the reliability of imputed values
SimbaML: Connecting Mechanistic Models and Machine Learning with Augmented Data
Training sophisticated machine learning (ML) models requires large datasets
that are difficult or expensive to collect for many applications. If prior
knowledge about system dynamics is available, mechanistic representations can
be used to supplement real-world data. We present SimbaML (Simulation-Based
ML), an open-source tool that unifies realistic synthetic dataset generation
from ordinary differential equation-based models and the direct analysis and
inclusion in ML pipelines. SimbaML conveniently enables investigating transfer
learning from synthetic to real-world data, data augmentation, identifying
needs for data collection, and benchmarking physics-informed ML approaches.
SimbaML is available from https://pypi.org/project/simba-ml/.Comment: 6 pages, 1 figur
rapmad: Robust analysis of peptide microarray data
Background: Peptide microarrays offer an enormous potential as a screening tool for peptidomics experiments and have recently seen an increased field of application ranging from immunological studies to systems biology. By allowing the parallel analysis of thousands of peptides in a single run they are suitable for high-throughput settings. Since data characteristics of peptide microarrays differ from DNA oligonucleotide microarrays, computational methods need to be tailored to these specifications to allow a robust and automated data analysis. While follow-up experiments can ensure the specificity of results, sensitivity cannot be recovered in later steps. Providing sensitivity is thus a primary goal of data analysis procedures. To this end we created rapmad (Robust Alignment of Peptide MicroArray Data), a novel computational tool implemented in R. Results: We evaluated rapmad in antibody reactivity experiments for several thousand peptide spots and compared it to two existing algorithms for the analysis of peptide microarrays. rapmad displays competitive and superior behavior to existing software solutions. Particularly, it shows substantially improved sensitivity for low intensity settings without sacrificing specificity. It thereby contributes to increasing the effectiveness of high throughput screening experiments. Conclusions: rapmad allows the robust and sensitive, automated analysis of high-throughput peptide array data. The rapmad R-package as well as the data sets are available from http://www.tron-mz.de/compmed
- …