Data and code to accompany the publication.
Data S1 through S3 are described in the supplementary materials.
The virtual library is contained in virtual_library.tar, a tar-archive containing bzip2-compressed CSV files each holding a chunk of 10,000 records for a total of 17,482,092 records. Each record has a unique identifier "mol_number".
For each chunk, two files are provided: VL_chunk_xxxx_smiles.csv contains only the identifier and the respective SMILES string.
The second file, VL_chunk_xxxx.csv additionally contains the predictions made for the library members.
In addition to the identifier and SMILES string, the columns of VL_chunk_xxxx.csv are:
- MoKa calculations: [number_of_ionizable_centers, center1_acidorbase, center1_pKa, center1_atom_number, center1_prediction_quality, center2_acidorbase, center2_pKa, center2_atom_number, center2_prediction_quality, center3_acidorbase, center3_pKa, center3_atom_number, center3_prediction_quality, center4_acidorbase, center4_pKa, center4_atom_number, center4_prediction_quality, center5_acidorbase, center5_pKa, center5_atom_number, center5_prediction_quality, center6_acidorbase, center6_pKa, center6_atom_number, center6_prediction_quality, center7_acidorbase, center7_pKa, center7_atom_number, center7_prediction_quality, center8_acidorbase, center8_pKa, center8_atom_number]
- Property predictions using Novartis' model: [predicted_logD_pH7.4, predicted_logSolubility_pH6.8_(mM), predicted_ionization_constant]
- Property predictions using Schrödinger: [QPlogPo/w, QPlogS]. These are calculated for the all-cis diastereomer.
- Reaction outcome predictions for up to two possible reactions leading to the product: [rxn1_smiles, rxn1_predictions, rxn1_confidence, rxn2_smiles, rxn2_predictions, rxn2_confidence