Hidden Markov Models (HMMs) are a commonly used tool for inference of
transcription factor (TF) binding sites from DNA sequence data. We exploit the
mathematical equivalence between HMMs for TF binding and the "inverse"
statistical mechanics of hard rods in a one-dimensional disordered potential to
investigate learning in HMMs. We derive analytic expressions for the Fisher
information, a commonly employed measure of confidence in learned parameters,
in the biologically relevant limit where the density of binding sites is low.
We then use techniques from statistical mechanics to derive a scaling principle
relating the specificity (binding energy) of a TF to the minimum amount of
training data necessary to learn it.Comment: 25 pages, 2 figures, 1 table V2 - typos fixed and new references
adde