News entities must select and filter the coverage they broadcast through
their respective channels since the set of world events is too large to be
treated exhaustively. The subjective nature of this filtering induces biases
due to, among other things, resource constraints, editorial guidelines,
ideological affinities, or even the fragmented nature of the information at a
journalist's disposal. The magnitude and direction of these biases are,
however, widely unknown. The absence of ground truth, the sheer size of the
event space, or the lack of an exhaustive set of absolute features to measure
make it difficult to observe the bias directly, to characterize the leaning's
nature and to factor it out to ensure a neutral coverage of the news. In this
work, we introduce a methodology to capture the latent structure of media's
decision process on a large scale. Our contribution is multi-fold. First, we
show media coverage to be predictable using personalization techniques, and
evaluate our approach on a large set of events collected from the GDELT
database. We then show that a personalized and parametrized approach not only
exhibits higher accuracy in coverage prediction, but also provides an
interpretable representation of the selection bias. Last, we propose a method
able to select a set of sources by leveraging the latent representation. These
selected sources provide a more diverse and egalitarian coverage, all while
retaining the most actively covered events