Humans extract and integrate the emotional content delivered through faces and voices of others. It is however poorly understood how perceptual decisions unfold in time when people discriminate the expression of emotions transmitted using dynamic facial and vocal signals, as in natural social context. In this study, we relied on a gating paradigm to track how the recognition of emotion expressions across the senses unfolds over exposure time. We first demonstrate that across all emotions tested, a discriminatory decision is reached earlier with faces than with voices. Importantly, multisensory stimulation consistently reduced the required accumulation of perceptual evidences needed to reach correct discrimination (Isolation Point). We also observed that expressions with different emotional content provide cumulative evidence at different speeds, with Fear being the expression with the fastest isolation point across the senses. Finally, the lack of correlation between the confusion patterns in response to facial and vocal signals across time suggest distinct relations between the discriminative features extracted from the two signals. Altogether, these results provide a comprehensive view on how auditory, visual and audiovisual information related to different emotion expressions accumulates in time, highlighting how multisensory context can fasten the discrimination process when minimal information is available