We present techniques to characterize which data is important to a
recommender system and which is not. Important data is data that contributes
most to the accuracy of the recommendation algorithm, while less important data
contributes less to the accuracy or even decreases it. Characterizing the
importance of data has two potential direct benefits: (1) increased privacy and
(2) reduced data management costs, including storage. For privacy, we enable
increased recommendation accuracy for comparable privacy levels using existing
data obfuscation techniques. For storage, our results indicate that we can
achieve large reductions in recommendation data and yet maintain recommendation
accuracy.
Our main technique is called differential data analysis. The name is inspired
by other sorts of differential analysis, such as differential power analysis
and differential cryptanalysis, where insight comes through analysis of
slightly differing inputs. In differential data analysis we chunk the data and
compare results in the presence or absence of each chunk. We present results
applying differential data analysis to two datasets and three different kinds
of attributes. The first attribute is called user hardship. This is a novel
attribute, particularly relevant to location datasets, that indicates how
burdensome a data point was to achieve. The second and third attributes are
more standard: timestamp and user rating. For user rating, we confirm previous
work concerning the increased importance to the recommender of data
corresponding to high and low user ratings.Comment: Extended version of RecSys 2013 pape