Shahid Beheshti University of Medical Sciences & Iranian Probiotic and Functional Foods Society
Doi
Abstract
Background and Objective: Identifying a milk-clotting enzyme (MCE) with high κ-casein specificity and heat sensitivity remains a challenge in cheese production. Current microbial, plant, and recombinant MCEs often exhibit low clotting activity, poor κ-casein specificity, and high thermostability, compromising cheese quality and increasing production costs. In this study, to address this, we developed a computational pipeline combining structural analysis, machine learning, and molecular dynamics simulation to design approximately 160,000 peptides from the Rhizomucor miehei protease–Pepstatin A complex (PDB ID: 2RMP).
Material and Methods: Single-site mutagenesis, ML-driven affinity re-prediction, and physicochemical filtering yielded 84 peptides. Their specificity as aspartic proteases were validated via predicted Pepstatin A binding and further screened for cross-reactivity with αs1-, αs2-, and β-caseins.
Results and Conclusion: Two candidates, Pep1 and Pep2, demonstrated superior κ-casein binding affinities (ΔG = -50.20 and -39.07 kcal/mol at 40 °C, respectively), lower melting indices (-4.05 and -3.09), and significantly enhanced specificity scores (-10.85) compared to Rhizomucor miehei protease (ΔG = -33.9 kcal/mol at 45 °C; melting index = 0.17; specificity score = 0.85). These peptides represent promising vegan- and halal-friendly alternatives to chymosins, pending experimental validation.
Keywords: Milk-clotting enzymes, Computational peptide design, κ-casein specificity, Thermolabile peptides, Sustainable cheese production, Molecular dynamics simulations, Machine learning, Vegan cheese production, Halal cheese production, Aspartic protease.
. Introduction
Identifying chymosin substitutes with comparable enzymatic characteristics particularly high specificity for κ-casein (reflected in a high milk-clotting activity to proteolytic activity ratio, MCA/PA) and optimal heat sensitivity remains a significant challenge in cheesemaking [1-3]. Chymosin, traditionally derived from the stomachs of calves, has long been the gold standard for milk clotting due to its high specificity for κ-casein and its ability to produce high-quality cheese [4]. However, economic, ethical, dietary, and religious concerns, coupled with the rising demand for cheese production and consumption, have spurred the search for viable alternatives [1, 5]. Various sources of milk-clotting enzymes (MCEs) have been explored, including other animal-derived rennet (from lamb, buffalo, and camel), plant-based coagulants, and microbial and recombinant enzymes. Lamb and buffalo chymosin have lower MCA/PA ratios than calf chymosin, compromising cheese quality, though buffalo rennet’s stability suits mozzarella production [6, 7]. In contrast, camel rennet, while showing desirable clotting properties, is difficult to obtain in large quantities, limiting its widespread use [8]. Rennet from other animals, such as pigs and rabbits, often shows poor specificity and thermal stability, restricting their utility in cheesemaking [1]. Recently, marine organisms such as the shrimp Pleoticus muelleri have also been investigated as novel sources of coagulants [9-11]. Nevertheless, animal-based MCEs continue to face ethical and religious constraints. Plant-based coagulants, such as chymopapain from papaya, offer promising vegan- and halal-friendly alternatives [12]. However, their lower specificity toward κ-casein and tendency to generate bitterness limit their use, especially in aged cheeses [1, 13]. Microbial coagulants, particularly from bacterial and fungal sources, are increasingly favoured due to their cost-effectiveness and suitability for vegetarian and organic cheese production [14]. Although bacterial coagulants offer functional diversity [5, 15-18], they often exhibit high proteolytic activity, necessitating the development of improved strains through mutation [1]. Fungal coagulants are generally more compatible with cheesemaking conditions, particularly in terms of pH and temperature [1, 19]. Still, their high thermostability can cause undesired proteolysis during cheese ripening, adversely affecting texture and flavour [20]. Among fungal coagulants, mucorpepsin (EC 3.4.23.23) from Rhizomucor miehei is widely used due to its relative specificity. However, it still acts on α- and β-caseins and possesses high thermal stability, which can result in residual enzymatic activity after curd formation. This residual activity often leads to bitterness, altered texture, and reduced yield during cheese production [21-25]. Several studies have attempted to address these limitations using physical, chemical, enzymatic, and genetic modifications to enhance specificity, reduce thermostability, and minimise off-target proteolysis while maintaining milk-clotting efficiency. For example, some efforts have successfully reduced thermostability by removing N-linked carbohydrates, one of the primary factors contributing to R. miehei protease’s thermal resistance without significantly diminishing clotting activity [1, 25-29]. Other studies have enhanced milk coagulation without altering thermostability [30-35]. Genetic engineering, mutagenesis [36-39], substrate modification[40-43], and enzyme immobilisation [44, 45], have all been explored to develop efficient coagulants suitable for industrial cheese production. Nevertheless, regulatory restrictions on genetically modified organisms and technical and economic challenges have limited the commercial adoption of many of these approaches [46]. A novel strategy for identifying milk-clotting enzymes involves targeting conserved motifs—namely FDTSSD or FDTGSSE, found in all currently known commercial coagulants. Using this approach, several in silico BLAST searches have identified new candidate proteases for cheese manufacturing. Moreover, degenerate primers based on these motifs can identify relevant genes even when full gene sequences are unknown [26]. Computational enzyme design and molecular modelling have shown great potential in optimizing enzyme activities [47-49]. Advances in bioinformatics, molecular dynamics, and machine learning now enable the prediction and engineering of enzymes with enhanced specificity, catalytic efficiency, and thermostability. Structural insights into active sites support rational design through targeted mutations[50-52]. These in silico methods reduce experimental cost and time by predicting enzyme performance before validation. Notably, ancestral sequence reconstruction has been used to design a novel aspartic protease with improved κ-casein affinity and thermolability for cheesemaking [53]. Although milk-clotting enzymes (MCEs) have been widely studied, the application of advanced computational tools for their rational design remains underexplored [54]. Moreover, current MCEs often exhibit limitations such as low substrate specificity and excessive thermostability, which can negatively affect cheese quality and increase commercial costs. This study introduces a computational pipeline to design thermolabile, κ-casein-specific milk-clotting peptides as sustainable, vegan-, and halal-friendly alternatives for cheese production. An integrated computational pipeline was developed, incorporating machine learning, rational design, and structure-based screening. A combinatorial library of 160,000 peptide variants was generated based on conserved catalytic motifs from the Rhizomucor miehei protease–Pepstatin A complex, linked through a rationally designed sequence. The library was screened using predictive models trained on binding affinity data, followed by structure-based refinement, physicochemical profiling, and validation through molecular docking and molecular dynamics simulations. This in silico framework provides a foundation for the experimental development of next-generation coagulants for dairy applications.
Materials and Methods
2.1. Aspartic peptide design strategy
We present a computational pipeline for the de novo design of short peptide-based milk-clotting enzymes (MCEs) with enhanced affinity for κ-casein and improved chemosensitivity. The design strategy involves identifying interface residues containing catalytic motifs from Rhizomucor miehei protease (RMP), linking them via an optimised linker, and screening the resulting peptide candidates for specific binding and functional activity. To benchmark the design process, RMP was used as a positive control and also served as the structural template, while a peptide with low affinity for Pepstatin A was employed as a negative control. The results for both controls are provided in the supplementary file on GitHub.
2.2. Interface residue identification
Interface residues from the crystal structure of 2RMP (RMP–Pepstatin A complex) were identified using the PDBePISA tool (https://www.ebi.ac.uk/pdbe/pisa/) [55] and served as the starting point for aspartic peptidase design. Two continuous interface segments containing the catalytic aspartic acid residues were selected: LFDTGSS (residues 36–42, harbouring the DTGS motif) and TIDTGTNFFI (residues 235–244, containing the DTGT motif). Although the ITYGT segment was identified among the interface residues, it was excluded from the design due to its lack of direct involvement in the conserved DTGS/DTGT catalytic motifs. Furthermore, incorporating non-catalytic segments would unnecessarily increase peptide length without functional contribution, potentially reducing the structural stability and target specificity of the designed peptides.
2.3. Continuous interface residues segment
Catalytic motif of 2RMP (38-41, 237-240)
LFDTGSS = 36-42 (7 residues)
TIDTGTNFFI = 235-244 (10 residues)
ITYGT = 80-84
2.4. Linker design and variant library construction
The designed peptide sequence comprises two continuous interface segments, LFDTGSS (residues 36–42) and TIDTGTNFFI (residues 235–244), totalling 17 residues. The spatial distance between the terminal residues of these segments (Ser42 and Thr235) was measured to be 13.8 Å using chimaera v1.17.3. The average distance between consecutive Cα atoms in a polypeptide chain is approximately 3.8Å [56-58].This distance corresponds to approximately four amino acids, indicating the optimal linker length required to bridge the two segments. Incorporating a four-residue linker (XXXX), the final peptide construct contains 21 amino acids. A saturated mutagenesis approach was applied to the linker, yielding 160,000 variants (20⁴ combinations) of the parent sequence, LFDTGSS-XXXX-TIDTGTNFFI.
2.5.Machine learning models for protein -protein interaction prediction: PDBbind+ and SKEMPI 2.0 Datasets
In this study, datasets were sourced from three major repositories: (1) 160,000 peptide variants generated through combinatorial substitution, (2) 7,086 mutants curated from the SKEMPI 2.0 database, which focuses on mutations in protein–protein interactions[59], and (3) 3,176 protein–protein complexes obtained from the PDBbind+ dataset[60]. For SKEMPI 2.0, duplicates, ambiguous entries (e.g., conflicting binding affinities or incomplete mutation annotations), and incomplete records (e.g., missing sequences or ΔΔG values) were removed. Similarly, the PDBbind+ dataset was filtered by excluding duplicate complexes, sequences with non-canonical residues or extreme lengths, entries lacking binding affinity or structural data, and those with physiologically irrelevant binding values. After filtering, the cleaned data were divided into training (70%), testing (20%), and validation (10%) sets. Two independent machine learning models XGBoost and a Convolutional Neural Network (CNN) were then employed to predict protein–protein interactions.
2.6. SKEMPI 2.0 dataset and CNN-based machine learning model
The SKEMPI 2.0 dataset, comprising 7,086 mutations from 295 studies, was used to train a deep learning model for predicting changes in protein–protein binding affinity[61]. After removing duplicates, ambiguous entries, and incomplete data, 343 high-quality entries were selected, each containing PDB IDs, chain identifiers, affinity values, and protein sequences. Sequences were normalised by peptide length for consistency. Model training was performed using the DeepPurpose framework (v0.1.5)[62], which employs convolutional neural networks (CNNs) for protein sequence encoding. Protein sequences were processed using DeepPurpose’s data_process() function, which internally encodes sequences and applies padding to handle variable lengths without manual preprocessing. The dataset was divided into training (70%), testing (20%), and validation (10%) sets. The model was trained over 100 epochs with a learning rate of 0.001 and a batch size of 32. Performance was evaluated using mean squared error (MSE), R², and Pearson correlation coefficient. The trained model was then used to predict the binding affinity between κ-casein and 160,000 designed peptide variants. Based on predicted scores, the top 21 peptides were selected for further analysis.
2.7. PDBbind+ dataset and XGBoost-based machine learning model
The PDBbind+ dataset, consisting of 3,176 protein–protein complexes [60], was curated by eliminating duplicates, invalid sequences, and missing data, resulting in 1,049 remaining complexes. Affinity values were standardised, log-transformed, and negative values were excluded. A sequence-based predictive model was developed using Conjoint Triad encoding (via DeepPurpose) and logarithmic normalisation to estimate protein–protein and protein–peptide binding affinities. Several machine learning models were tested using Scikit-learn, including Random Forest (R² = 0.67), Ridge (R² = –1.482), Gradient Boosting (R² = 0.647), and XGBoost (R² = 0.696). XGBoost was selected for its superior performance[63]. The model was optimised using the Optuna framework with 100 trials and 5-fold cross-validation, tuning hyperparameters such as estimators (100–1000), learning rate (0.01–0.3), max depth (3–10), subsample (0.6–1.0), and colsample_bytree (0.6–1.0). Outlier detection was performed using an Isolation Forest with a contamination rate of 0.15. The final model, evaluated using mean squared error (MSE), R², and Pearson correlation, was used to predict binding affinities for 160,000 κ-casein peptide variants, identifying the top 48 candidates for further investigation.
2.8. Single-site mutation and machine learning prediction
The top 69 sequences (48 from the model trained on PDBbind+ and 21 from the model trained on SKEMPI 2.0) were further refined through single-site mutations at positions Phe2 (PHE), Ile13 (ILE), and Phe19 (PHE). To preserve peptide structural integrity and avoid disrupting potential hydrogen bonding networks, mutations were excluded from adjacent serine (S) and threonine (T) residues surrounding the linker. Instead, three hydrophobic residues Phe2 (F2), Ile13 (I13), and Phe19 (F19) were carefully chosen as mutation sites due to their roles in local packing and potential linker interactions. Each position was systematically substituted with all 20 canonical amino acids, generating 4,140 unique sequences (20×3×69). The two trained models were used again to predict interactions between these sequences and κ-casein, resulting in 101 selected peptide sequences (96 from the model trained on PDBbind+ and 5 from the model trained on SKEMPI 2.0) with optimal binding affinity.
2.9. Physicochemical property assessment
The top 101 peptide sequences underwent comprehensive screening based on thermal stability, pH, solubility, and toxicity. DeepSTABp (https://csb-deepstabp.bio.rptu.de/) was used to predict melting temperature [64].
ToxinPred (https://webs.iiitd.edu.in/raghava/toxinpred/multi_submit.php) evaluated pH and solubility based on isoelectric point, hydrophobicity, hydrophilicity, amphipathicity, steric hindrance, charge, and molecular weight[65]. ToxinPred2 (https://webs.iiitd.edu.in/raghava/toxinpred2/) [66] and ToxinPred3 (https://webs.iiitd.edu.in/raghava/toxinpred3/index.html) [67] assessed toxicity profiles. This screening identified 84 peptide sequences with favourable physicochemical properties.
2.10. Pepstatin A binding affinity as a proxy for aspartic protease activity
Pepstatin A binding was used as a proxy to assess aspartic protease activity. A pre-trained model, Morgan_CNN_BindingDB_IC50, from the DeepPurpose framework[68], was employed to predict interactions between Pepstatin A and the 84 selected peptides. The model utilised a multi-layer perceptron for drug representation (hidden layers: [1024, 256, 64]) and a convolutional neural network for peptide representation (convolutional layers: filters [32, 64, 96]; kernel sizes [4, 8, 12]). Based on predicted binding affinity, 17 peptides with strong interaction to Pepstatin A were identified for further analysis.
2.11. Cross-reactivity profiling with non-target casein subunits
The 17 selected peptide sequences were further evaluated for binding specificity by assessing their predicted interactions with αs1-casein, αs2-casein, and β-casein, using the same trained models and datasets. Two peptides exhibited low binding affinities to these casein subunits, suggesting a high degree of specificity toward κ-casein. These sequences hold particular promise for targeted applications in unique cheese production processes.
2.12. Predicting and evaluating the 3D structures of the peptidase
The two peptides with the highest binding affinities to Pepstatin A were predicted using PEP-FOLD4 (https://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD4/) [69], a free tool for predicting linear, cyclic, and chemically modified peptide structures. It uses the improved sOPEP force field and Monte Carlo simulation with Debye-Hückel-free parameters for refinement. The Best Model is based on the lowest SOPEP score. PEP-FOLD4 also provides an SA probability map showing residue conformations: red (helices), green (β-strands), and blue (coils/loops), aiding in stability and secondary structure prediction. MD simulations were run for each peptide, and final stable structures were extracted. Structural quality was then assessed using ERRAT[70], Procheck[71], and verified for further processing.
2.13. Molecular docking of peptidase with κ-casein
Since the X-ray structure of κ-casein remains undetermined, our analysis focused on the local structure surrounding the Phe105–Met106 bond, which is susceptible to aspartic enzyme cleavage in bovine κ-casein. We aimed to explore how this local fragment interacts with RMP and the designed peptides to investigate the underlying mechanism, as highlighted in earlier studies [53, 72, 73]. The same methodology was followed for ligand preparation, with some modifications. The RMP binding site in κ-casein spans residues 102 to 108 [74], while the chymosin binding site covers residues 97 to 112 [75, 76]. The shorter segment (residues 102–108, HLSFMAI) was used as the ligand for docking with RMP, as it corresponds to the P4–P4′ binding region of Pepstatin A in the Co-Crystallised complex with RMP [74], ensuring structural relevance. In contrast, the longer fragment (residues 97–112, RHPHPHLSFMAIPPKK) was used for docking with the designed peptides (Pep 1 and Pep 2) and was also utilised throughout the machine learning pipeline to capture a broader binding context, consistent with its known role in chymosin recognition. This region was predicted using AlphaFold2[77]. Molecular docking was performed using the HADDOCK v2.4-2022.08 web server[78, 79], with the peptides as receptors and κ-casein as the ligand. Asp3 and Asp14 were defined as active residues for the peptides (first molecule), and residues 102 to 108 were set as active for κ-casein (second molecule). The top docking poses were selected based on a combination of criteria, including the size of the largest cluster, the most negative Z-score, the number of hydrogen bonds, and the lowest binding energy. Interaction analyses were carried out using LigPlot+ v1.4.5 software[80].
2.14. Molecular dynamics simulation (MD)
The well-recognised and freely licensed software used in this study was GROMACS version 2024[81]. Topology parameters for the anticipated models were produced using the AMBER force field (ff99SB) in GROMACS [82] .The constructed peptides (pep 1and Pep 2) and RMP models were placed in a cubic box with 10 Å spacing, solvated with the TIP3P water model[83]. Energy minimisation was performed using the steepest descent technique[84], and long-range electrostatics were calculated using the Particle-Mesh Ewald (PME) method[85], with a 1.0 nm cut-off for van der Waals and electrostatic interactions. The maximum number of minimisation steps was 50,000. Equilibration was carried out using 5000 ps of NVT and NPT runs. The V-rescale thermostat (0.1 ps coupling) maintained temperature at 318.15 K, which is the optimal temperature for RMP, during NVT. The Berendsen barostat was employed during the NPT equilibration phase with a coupling time of 2.0 ps to allow fast and stable pressure convergence. Although the Berendsen method does not sample a true NPT ensemble, it is widely used for initial stabilisation [86]. For the subsequent 100 ns production run, the Parrinello–Rahman barostat was used to ensure thermodynamically rigorous sampling of pressure fluctuations and system behaviour [87]. Simulations were also conducted at different temperatures for comparative analysis. Analytical tools including gmx rms, gmx rmsf, gmx gyrate, gmx hbond, and gmx sasa were used to calculate RMSD, RMSF, radius of gyration, hydrogen bonds, and SASA, respectively[88]. All data were extracted from trajectories and visualised using Origin.
2.15. MM-PBSA binding energy, principal component, and free energy landscape analyses of protease complexes
MM-PBSA binding free energy and PCA ana