The design of modern recommender systems relies on understanding which parts
of the feature space are relevant for solving a given recommendation task.
However, real-world data sets in this domain are often characterized by their
large size, sparsity, and noise, making it challenging to identify meaningful
signals. Feature ranking represents an efficient branch of algorithms that can
help address these challenges by identifying the most informative features and
facilitating the automated search for more compact and better-performing models
(AutoML). We introduce OutRank, a system for versatile feature ranking and data
quality-related anomaly detection. OutRank was built with categorical data in
mind, utilizing a variant of mutual information that is normalized with regard
to the noise produced by features of the same cardinality. We further extend
the similarity measure by incorporating information on feature similarity and
combined relevance. The proposed approach's feasibility is demonstrated by
speeding up the state-of-the-art AutoML system on a synthetic data set with no
performance loss. Furthermore, we considered a real-life click-through-rate
prediction data set where it outperformed strong baselines such as random
forest-based approaches. The proposed approach enables exploration of up to
300% larger feature spaces compared to AutoML-only approaches, enabling faster
search for better models on off-the-shelf hardware.Comment: accepted to RecSys202