Large-scale
top-down proteomics characterizes proteoforms in cells
globally with high confidence and high throughput using reversed-phase
liquid chromatography (RPLC)–tandem mass spectrometry (MS/MS)
or capillary zone electrophoresis (CZE)–MS/MS. The false discovery
rate (FDR) from the target–decoy database search is typically
deployed to filter identified proteoforms to ensure high-confidence
identifications (IDs). It has been demonstrated that the FDRs in top-down
proteomics can be drastically underestimated. An alternative approach
to the FDR can be useful for further evaluating the confidence of
proteoform IDs after the database search. We argue that predicting
retention/migration time of proteoforms from the RPLC/CZE separation
accurately and comparing their predicted and experimental separation
time could be a useful and practical approach. Based on our knowledge,
there is still no report in the literature about predicting separation
time of proteoforms using large top-down proteomics data sets. In
this pilot study, for the first time, we evaluated various semiempirical
models for predicting proteoforms’ electrophoretic mobility
(μef) using large-scale top-down proteomics data
sets from CZE–MS/MS. We achieved a linear correlation between
experimental and predicted μef of E. coli proteoforms (R2 = 0.98) with a simple
semiempirical model, which utilizes the number of charges and molecular
mass of each proteoform as the parameters. Our modeling data suggest
that the complete unfolding of proteoforms during CZE separation benefits
the prediction of their μef. Our results also indicate
that N-terminal acetylation and phosphorylation both decrease the
proteoforms’ charge by roughly one charge unit