34 research outputs found
OutRank: Speeding up AutoML-based Model Search for Large Sparse Data sets with Cardinality-aware Feature Ranking
The design of modern recommender systems relies on understanding which parts
of the feature space are relevant for solving a given recommendation task.
However, real-world data sets in this domain are often characterized by their
large size, sparsity, and noise, making it challenging to identify meaningful
signals. Feature ranking represents an efficient branch of algorithms that can
help address these challenges by identifying the most informative features and
facilitating the automated search for more compact and better-performing models
(AutoML). We introduce OutRank, a system for versatile feature ranking and data
quality-related anomaly detection. OutRank was built with categorical data in
mind, utilizing a variant of mutual information that is normalized with regard
to the noise produced by features of the same cardinality. We further extend
the similarity measure by incorporating information on feature similarity and
combined relevance. The proposed approach's feasibility is demonstrated by
speeding up the state-of-the-art AutoML system on a synthetic data set with no
performance loss. Furthermore, we considered a real-life click-through-rate
prediction data set where it outperformed strong baselines such as random
forest-based approaches. The proposed approach enables exploration of up to
300% larger feature spaces compared to AutoML-only approaches, enabling faster
search for better models on off-the-shelf hardware.Comment: accepted to RecSys202
Latent Graphs for Semi-Supervised Learning on Biomedical Tabular Data
In the domain of semi-supervised learning, the current approaches
insufficiently exploit the potential of considering inter-instance
relationships among (un)labeled data. In this work, we address this limitation
by providing an approach for inferring latent graphs that capture the intrinsic
data relationships. By leveraging graph-based representations, our approach
facilitates the seamless propagation of information throughout the graph,
effectively incorporating global and local knowledge. Through evaluations on
biomedical tabular datasets, we compare the capabilities of our approach to
other contemporary methods. Our work demonstrates the significance of
inter-instance relationship discovery as practical means for constructing
robust latent graphs to enhance semi-supervised learning techniques. The
experiments show that the proposed methodology outperforms contemporary
state-of-the-art methods for (semi-)supervised learning on three biomedical
datasets.Comment: Accepted at IJCLR 202