13,570 research outputs found
NASA Thesaurus supplement: A four part cumulative supplement to the 1988 edition of the NASA Thesaurus (supplement 3)
The four-part cumulative supplement to the 1988 edition of the NASA Thesaurus includes the Hierarchical Listing (Part 1), Access Vocabulary (Part 2), Definitions (Part 3), and Changes (Part 4). The semiannual supplement gives complete hierarchies and accepted upper/lowercase forms for new terms
pLMFPPred: a novel approach for accurate prediction of functional peptides integrating embedding from pre-trained protein language model and imbalanced learning
Functional peptides have the potential to treat a variety of diseases. Their
good therapeutic efficacy and low toxicity make them ideal therapeutic agents.
Artificial intelligence-based computational strategies can help quickly
identify new functional peptides from collections of protein sequences and
discover their different functions.Using protein language model-based
embeddings (ESM-2), we developed a tool called pLMFPPred (Protein Language
Model-based Functional Peptide Predictor) for predicting functional peptides
and identifying toxic peptides. We also introduced SMOTE-TOMEK data synthesis
sampling and Shapley value-based feature selection techniques to relieve data
imbalance issues and reduce computational costs. On a validated independent
test set, pLMFPPred achieved accuracy, Area under the curve - Receiver
Operating Characteristics, and F1-Score values of 0.974, 0.99, and 0.974,
respectively. Comparative experiments show that pLMFPPred outperforms current
methods for predicting functional peptides.The experimental results suggest
that the proposed method (pLMFPPred) can provide better performance in terms of
Accuracy, Area under the curve - Receiver Operating Characteristics, and
F1-Score than existing methods. pLMFPPred has achieved good performance in
predicting functional peptides and represents a new computational method for
predicting functional peptides.Comment: 20 pages, 5 figures,under revie
PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications
Large protein language models are adept at capturing the underlying
evolutionary information in primary structures, offering significant practical
value for protein engineering. Compared to natural language models, protein
amino acid sequences have a smaller data volume and a limited combinatorial
space. Choosing an appropriate vocabulary size to optimize the pre-trained
model is a pivotal issue. Moreover, despite the wealth of benchmarks and
studies in the natural language community, there remains a lack of a
comprehensive benchmark for systematically evaluating protein language model
quality. Given these challenges, PETA trained language models with 14 different
vocabulary sizes under three tokenization methods. It conducted thousands of
tests on 33 diverse downstream datasets to assess the models' transfer learning
capabilities, incorporating two classification heads and three random seeds to
mitigate potential biases. Extensive experiments indicate that vocabulary sizes
between 50 and 200 optimize the model, whereas sizes exceeding 800
detrimentally affect the model's representational performance. Our code, model
weights and datasets are available at
https://github.com/ginnm/ProteinPretraining.Comment: 46 pages, 4figures, 9 table
What is a meaningful representation of protein sequences?
How we choose to represent our data has a fundamental impact on our ability
to subsequently extract information from them. Machine learning promises to
automatically determine efficient representations from large unstructured
datasets, such as those arising in biology. However, empirical evidence
suggests that seemingly minor changes to these machine learning models yield
drastically different data representations that result in different biological
interpretations of data. This begs the question of what even constitutes the
most meaningful representation. Here, we approach this question for
representations of protein sequences, which have received considerable
attention in the recent literature. We explore two key contexts in which
representations naturally arise: transfer learning and interpretable learning.
In the first context, we demonstrate that several contemporary practices yield
suboptimal performance, and in the latter we demonstrate that taking
representation geometry into account significantly improves interpretability
and lets the models reveal biological information that is otherwise obscured.Comment: 17 pages, 8 figures, 2 table
- …