Antibodies, an essential part of our immune system, develop through an
intricate process to bind a wide array of pathogens. This process involves
randomly mutating DNA sequences encoding these antibodies to find variants with
improved binding, though mutations are not distributed uniformly across
sequence sites. Immunologists observe this nonuniformity to be consistent with
"mutation motifs", which are short DNA subsequences that affect how likely a
given site is to experience a mutation. Quantifying the effect of motifs on
mutation rates is challenging: a large number of possible motifs makes this
statistical problem high dimensional, while the unobserved history of the
mutation process leads to a nontrivial missing data problem. We introduce an
ℓ1-penalized proportional hazards model to infer mutation motifs and
their effects. In order to estimate model parameters, our method uses a Monte
Carlo EM algorithm to marginalize over the unknown ordering of mutations. We
show that our method performs better on simulated data compared to current
methods and leads to more parsimonious models. The application of proportional
hazards to mutation processes is, to our knowledge, novel and formalizes the
current methods in a statistical framework that can be easily extended to
analyze the effect of other biological features on mutation rates