2 research outputs found
GlycoNMR: Dataset and benchmarks for NMR chemical shift prediction of carbohydrates with graph neural networks
Molecular representation learning (MRL) is a powerful tool for bridging the
gap between machine learning and chemical sciences, as it converts molecules
into numerical representations while preserving their chemical features. These
encoded representations serve as a foundation for various downstream
biochemical studies, including property prediction and drug design. MRL has had
great success with proteins and general biomolecule datasets. Yet, in the
growing sub-field of glycoscience (the study of carbohydrates, where longer
carbohydrates are also called glycans), MRL methods have been barely explored.
This under-exploration can be primarily attributed to the limited availability
of comprehensive and well-curated carbohydrate-specific datasets and a lack of
Machine learning (ML) pipelines specifically tailored to meet the unique
problems presented by carbohydrate data. Since interpreting and annotating
carbohydrate-specific data is generally more complicated than protein data,
domain experts are usually required to get involved. The existing MRL methods,
predominately optimized for proteins and small biomolecules, also cannot be
directly used in carbohydrate applications without special modifications. To
address this challenge, accelerate progress in glycoscience, and enrich the
data resources of the MRL community, we introduce GlycoNMR. GlycoNMR contains
two laboriously curated datasets with 2,609 carbohydrate structures and 211,543
annotated nuclear magnetic resonance (NMR) chemical shifts for precise
atomic-level prediction. We tailored carbohydrate-specific features and adapted
existing MRL models to tackle this problem effectively. For illustration, we
benchmark four modified MRL models on our new datasets