69 research outputs found
NIERT: Accurate Numerical Interpolation through Unifying Scattered Data Representations using Transformer Encoder
Numerical interpolation for scattered data, i.e., estimating values for
target points based on those of some observed points, is widely used in
computational science and engineering. The existing approaches either require
explicitly pre-defined basis functions, which makes them inflexible and limits
their performance in practical scenarios, or train neural networks as
interpolators, which still have limited interpolation accuracy as they treat
observed and target points separately and cannot effectively exploit the
correlations among data points. Here, we present a learning-based approach to
numerical interpolation for scattered data using encoder representation of
Transformers (called NIERT). Unlike the recent learning-based approaches, NIERT
treats observed and target points in a unified fashion through embedding them
into the same representation space, thus gaining the advantage of effectively
exploiting the correlations among them. The specially-designed partial
self-attention mechanism used by NIERT makes it escape from the unexpected
interference of target points on observed points. We further show that the
partial self-attention is essentially a learnable interpolation module
combining multiple neural basis functions, which provides interpretability of
NIERT. Through pre-training on large-scale synthetic datasets, NIERT achieves
considerable improvement in interpolation accuracy for practical tasks. On both
synthetic and real-world datasets, NIERT outperforms the existing approaches,
e.g., on the TFRD-ADlet dataset for temperature field reconstruction, NIERT
achieves an MAE of , substantially better than the
state-of-the-art approach (MAE: ). The source code of
NIERT is available at https://anonymous.4open.science/r/NIERT-2BCF.Comment: 10 pages, 7 figure
PI: An open-source software package for validation of the SEQUEST result and visualization of mass spectrum
<p>Abstract</p> <p>Background</p> <p>Tandem mass spectrometry (MS/MS) has emerged as the leading method for high- throughput protein identification in proteomics. Recent technological breakthroughs have dramatically increased the efficiency of MS/MS data generation. Meanwhile, sophisticated algorithms have been developed for identifying proteins from peptide MS/MS data by searching available protein sequence databases for the peptide that is most likely to have produced the observed spectrum. The popular SEQUEST algorithm relies on the cross-correlation between the experimental mass spectrum and the theoretical spectrum of a peptide. It utilizes a simplified fragmentation model that assigns a fixed and identical intensity for all major ions and fixed and lower intensity for their neutral losses. In this way, the common issues involved in predicting theoretical spectra are circumvented. In practice, however, an experimental spectrum is usually not similar to its SEQUEST -predicted theoretical one, and as a result, incorrect identifications are often generated.</p> <p>Results</p> <p>Better understanding of peptide fragmentation is required to produce more accurate and sensitive peptide sequencing algorithms. Here, we designed the software PI of novel and exquisite algorithms that make a good use of intensity property of a spectrum.</p> <p>Conclusions</p> <p>We designed the software PI with the novel and effective algorithms which made a good use of intensity property of the spectrum. Experiments have shown that PI was able to validate and improve the results of SEQUEST to a more satisfactory degree.</p
Improving consensus contact prediction via server correlation reduction
<p>Abstract</p> <p>Background</p> <p>Protein inter-residue contacts play a crucial role in the determination and prediction of protein structures. Previous studies on contact prediction indicate that although template-based consensus methods outperform sequence-based methods on targets with typical templates, such consensus methods perform poorly on new fold targets. However, we find out that even for new fold targets, the models generated by threading programs can contain many true contacts. The challenge is how to identify them.</p> <p>Results</p> <p>In this paper, we develop an integer linear programming model for consensus contact prediction. In contrast to the simple majority voting method assuming that all the individual servers are equally important and independent, the newly developed method evaluates their correlation by using maximum likelihood estimation and extracts independent latent servers from them by using principal component analysis. An integer linear programming method is then applied to assign a weight to each latent server to maximize the difference between true contacts and false ones. The proposed method is tested on the CASP7 data set. If the top <it>L</it>/5 predicted contacts are evaluated where <it>L </it>is the protein size, the average accuracy is 73%, which is much higher than that of any previously reported study. Moreover, if only the 15 new fold CASP7 targets are considered, our method achieves an average accuracy of 37%, which is much better than that of the majority voting method, SVM-LOMETS, SVM-SEQ, and SAM-T06. These methods demonstrate an average accuracy of 13.0%, 10.8%, 25.8% and 21.2%, respectively.</p> <p>Conclusion</p> <p>Reducing server correlation and optimally combining independent latent servers show a significant improvement over the traditional consensus methods. This approach can hopefully provide a powerful tool for protein structure refinement and prediction use.</p
ProbPS: A new model for peak selection based on quantifying the dependence of the existence of derivative peaks on primary ion intensity
<p>Abstract</p> <p>Background</p> <p>The analysis of mass spectra suggests that the existence of derivative peaks is strongly dependent on the intensity of the primary peaks. Peak selection from tandem mass spectrum is used to filter out noise and contaminant peaks. It is widely accepted that a valid primary peak tends to have high intensity and is accompanied by derivative peaks, including isotopic peaks, neutral loss peaks, and complementary peaks. Existing models for peak selection ignore the dependence between the existence of the derivative peaks and the intensity of the primary peaks. Simple models for peak selection assume that these two attributes are independent; however, this assumption is contrary to real data and prone to error.</p> <p>Results</p> <p>In this paper, we present a statistical model to quantitatively measure the dependence of the derivative peak's existence on the primary peak's intensity. Here, we propose a statistical model, named ProbPS, to capture the dependence in a quantitative manner and describe a statistical model for peak selection. Our results show that the quantitative understanding can successfully guide the peak selection process. By comparing ProbPS with AuDeNS we demonstrate the advantages of our method in both filtering out noise peaks and in improving <it>de novo </it>identification. In addition, we present a tag identification approach based on our peak selection method. Our results, using a test data set, suggest that our tag identification method (876 correct tags in 1000 spectra) outperforms PepNovoTag (790 correct tags in 1000 spectra).</p> <p>Conclusions</p> <p>We have shown that ProbPS improves the accuracy of peak selection which further enhances the performance of de novo sequencing and tag identification. Thus, our model saves valuable computation time and improving the accuracy of the results.</p
Predicting mutational effects on protein-protein binding via a side-chain diffusion probabilistic model
Many crucial biological processes rely on networks of protein-protein
interactions. Predicting the effect of amino acid mutations on protein-protein
binding is vital in protein engineering and therapeutic discovery. However, the
scarcity of annotated experimental data on binding energy poses a significant
challenge for developing computational approaches, particularly deep
learning-based methods. In this work, we propose SidechainDiff, a
representation learning-based approach that leverages unlabelled experimental
protein structures. SidechainDiff utilizes a Riemannian diffusion model to
learn the generative process of side-chain conformations and can also give the
structural context representations of mutations on the protein-protein
interface. Leveraging the learned representations, we achieve state-of-the-art
performance in predicting the mutational effects on protein-protein binding.
Furthermore, SidechainDiff is the first diffusion-based generative model for
side-chains, distinguishing it from prior efforts that have predominantly
focused on generating protein backbone structures
- …