Features such as punctuation, capitalization, and formatting of entities are
important for readability, understanding, and natural language processing
tasks. However, Automatic Speech Recognition (ASR) systems produce spoken-form
text devoid of formatting, and tagging approaches to formatting address just
one or two features at a time. In this paper, we unify spoken-to-written text
conversion via a two-stage process: First, we use a single transformer tagging
model to jointly produce token-level tags for inverse text normalization (ITN),
punctuation, capitalization, and disfluencies. Then, we apply the tags to
generate written-form text and use weighted finite state transducer (WFST)
grammars to format tagged ITN entity spans. Despite joining four models into
one, our unified tagging approach matches or outperforms task-specific models
across all four tasks on benchmark test sets across several domains