In real-world scenarios, recommenders face non-functional requirements of technical nature and must handle dynamic data in the form

of sequential streams. Evaluation of recommender systems must

take these issues into account in order to be maximally informative.

In this paper, we present Idomaar—a framework that enables the

efficient multi-dimensional benchmarking of recommender algorithms. Idomaar goes beyond current academic research practices

by creating a realistic evaluation environment and computing both

effectiveness and technical metrics for stream-based as well as set-based evaluation. A scenario focussing on “research to prototyping

to productization” cycle at a company illustrates Idomaar’s potential.

We show that Idomaar simplifies testing with varying configurations

and supports flexible integration of different data

Hopfgartner, F.

Kille, B.

Larson, M.

Lommatzsch, A.

Malagoli, D.

Plumbaum, T.

Scriminaci, M.

Serény, A.

White Rose Research Online

This is a repository copy of Idomaar : a framework for multi-dimensional benchmarking of recommender algorithms.White Rose Research Online URL for this paper:https://eprints.whiterose.ac.uk/175097/Version: Accepted VersionProceedings Paper:Scriminaci, M., Lommatzsch, A., Kille, B. et al. (5 more authors) (2016) Idomaar : a framework for multi-dimensional benchmarking of recommender algorithms. In: Guy, I. andSharma, A., (eds.) Proceedings of the Poster Track of the 10th ACM Conference on Recommender Systems (RecSys 2016). 10th ACM Conference on Recommender Systems (RecSys 2016), 17 Sep 2016, Boston, USA. CEUR Workshop Proceedings . © 2016 The Authors. This is an author-produced version of a paper subsequently published in Proceedings of the Poster Track of the 10th ACM Conference on Recommender Systems (RecSys 2016). Uploaded in accordance with the publisher's self-archiving policy.eprints@whiterose.ac.ukhttps://eprints.whiterose.ac.uk/Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing eprints@whiterose.ac.uk including the URL of the record and the reason for the withdrawal request. Idomaar: A Framework for Multi-dimensionalBenchmarking of Recommender AlgorithmsMario Scriminaci1, Andreas Lommatzsch2, Benjamin Kille2, Frank Hopfgartner3,Martha Larson4, Davide Malagoli1, Andras Sereny5, Till Plumbaum21ContentWise R&D – Moviri, Milan, Italy, {firstname.lastname}@moviri.com2TU Berlin – DAI-Lab, Berlin, Germany, {firstname.lastname}@dai-labor.de3University of Glasgow, Glasgow, UK, frank.hopfgartner@glasgow.ac.uk4TU Delft, Delft, The Netherlands, m.a.larson@tudelft.nl5Gravity R&D, Budapest, Hungary, sereny.andras@gravityrd.comABSTRACTIn real-world scenarios, recommenders face non-functional require-ments of technical nature and must handle dynamic data in the formof sequential streams. Evaluation of recommender systems musttake these issues into account in order to be maximally informative.In this paper, we present Idomaar—a framework that enables theefficient multi-dimensional benchmarking of recommender algo-rithms. Idomaar goes beyond current academic research practicesby creating a realistic evaluation environment and computing botheffectiveness and technical metrics for stream-based as well as set-based evaluation. A scenario focussing on “research to prototypingto productization” cycle at a company illustrates Idomaar’s potential.We show that Idomaar simplifies testing with varying configurationsand supports flexible integration of different data.1. INTRODUCTION AND MOTIVATIONIncreasingly, we witness a shift of recommender system researchtoward large-scale systems developed for industry settings. Thetrend was already well described in Amatriain’s 2012 tutorial onbuilding large-scale real-world recommender systems at ACM Rec-Sys 2012 [1]. Given commercial systems’ complexity and the de-mand for high performance, evaluation is subject to additional re-quirements: contribution of complementary information, reliablilityon handling large-scale problems, and use of different methods andmetrics. Evaluation must allow both offline parameter tuning as wellas monitoring systems online.Benchmarking the performance of recommender systems by theseaspects is challenging. Mark Levy pointed this out during hiskeynote at the ACM RecSys 2013 workshop on Reproducibility andReplication in Recommender Systems [6]. Said and Bellogín [7]concur with his point as they analyze existing frameworks’ abil-ities. Many commonly used software suites do not provide therequired functionalities to benchmark different aspects, or they aretoo complex to set up.We introduce Idomaar to address this challenge.1 It enables re-searchers to evaluate different algorithm with respect to multiplecriteria. The framework uses large-scale static data sets to simulatelive data streams, bringing offline evaluation closer to online A/Btesting. By comparing the performance of recommender algorithmsoperating in a live system (e.g., as studied in the living lab News-REEL [2]) and these simulated data streams, the framework can1Idomaar is available at https://github.com/crowdrec/idomaarsee also http://rf.crowdrec.euCopyright is held by the authors.RecSys 2016 Poster Proceedings, September 15-19, 2016, Boston, USA .be used to study the transferability of offline evaluation to an on-line setting. Finally, Idomaar enables multi-dimensional evaluationwhich simultaneously measures the performance of algorithms withrespect to precision-related and technical aspects. These cover CTRand scalability-related measures (throughput and response time).2. APPROACH AND FRAMEWORKThe reference framework Idomaar is a tool to evaluate recom-mendation services in real-world settings. As opposed to typicalrecommender system evaluation which assumes static information,real-world applications process data in form of a stream of infor-mation. In fact, users, items, and interaction amid both collectionscontinue generating events fed to the recommender system. For in-stance, new users register or existing users cancel their subscription;new items emerge; users consume items. Such information must beingested and processed as soon as possible (e.g., by updating the rec-ommendation models) in order to be available. All these messagesare asynchronously handled. However, the system also has to syn-chronously serve incoming recommendation requests within stricttime constraints. Practically, the whole flow of incoming messagesis managed by means of queues.Idomaar mimics the work flow of such real-world scenario byusing state-of-the-art technologies (e.g., Apache Flume and ApacheKafka) to manage data streaming. The architecture has been splitinto four main modules, as depicted in Fig. 1: the data container, theevaluator, the orchestrator, and the computing environment.Figure 1: The Idomaar reference framework architecture.The data container stores the data (entities and relations, in ac-cordance to the format defined in [9]. Part of the data bootstrapsthe recommender system for training algorithms, while most of theremaining data feeds the recommender system at real-time while ithas to serve incoming recommendation requests for test purposes.Finally, the remaining subset of the data, the ground truth, is hiddenfrom the recommender system and used to evaluate the quality ofthe service in terms of the user metrics. The data is read by a customApache Flume source and sent into an Apache Kafka queue.The recommender system runs within a virtual machine, re-ferred to as computing environment, whose environment is cre-ated with Vagrant (https://www.vagrantup.com/) and whereall required libraries are automatically provisioned with Puppet(https://puppetlabs.com/). The recommender system sub-scribes to the required Apache Kafka channel and receives theasynchronous messages (i.e., users, items, interactions, etc.). Rec-ommendation requests are synchronously sent via a HTTP interface(or, alternatively, a 0MQ interface).The evaluator compares recommendations generated by the com-puting environment with the ground truth. In addition to standarduser metrics (RMSE, recall, precision), Idomaar evaluates businessmetrics (e.g., scalability, response time, throughput), so to provide a360-degree evaluation of the recommendation infrastructure.Finally, the orchestrator coordinates all processes, includinglaunching and provisioning the computing environment, instruct-ing the evaluator to split data into training, test, and ground truth,feeding the recommender system with the incoming messages inaccordance to their timestamp, collecting the generated recommen-dations, and computing the quality metrics.Moving from an offline toward an online scenario (where datastream is not simulated from historical information, but the realflow of data) means either replacing the Apache Flume source withanother one (e.g., that reads from log data) or ingesting the datadirectly into the Apache Kafka queue.3. RELATED WORKVarious frameworks have been proposed to facilitate evaluatingrecommender systems. Ekstrand et al. [3] introduce LensKit toincrease comparability of recommender system evaluation. Mahoutis a scalable machine learning toolkit implemented in Java. Bothframeworks ship with a selection of recommendation algorithmsand some evaluators. Gantner et al. [4] created MyMediaLite as alightweight recommender system framework. It comprises somerecommendation algorithms along with predefined evaluation proto-cols. Said and Bellogín [8] proposed RiVal to facilitate comparingvarious recommendation algorithms. The framework’s architecturesupports cross-framework comparisons. The variety in frameworksemphasizes the demand for tools to evaluate recommender sys-tems. Although the presented tools support evaluation, all presentedframeworks measure quality only in terms of predictive performance.Operating recommender systems face additional challenges. Forinstance, they might be subject to response time restrictions or expe-rience heavy load. Finally, running above mentioned frameworkson different hardware still yields inconsistent results. For thesereasons, we propose Idomaar a language-agnostic framework withcloud-support and the ability to measure time and space complexity.4. PROTOTYPE TO PRODUCTIVIZATIONIdomaar was used in the “research to prototyping to operating”cycle of a recommender system service provider to validate itsusefulness. The focus of the validation was on multidimensionalevaluation that simultaneously takes effectiveness and technicalconstraints into account. The only limitation identified was themonitoring of performance metrics like CPU and memory usage(cf. [5, 10]).With respect to the cycle itself, we found that having a standardin terms of data formats and APIs increases the reusability of codein all phases and helps data scientists to produce code that can betransformed into effective prototypes. Our “research to prototypingto operating” cycle shows:• Idomaar allowed easy testing using different datasets with dif-ferent algorithms that share the same input types and subjects(e.g., implicit or explicit events, sessions or users).• The Idomaar format is flexible enough to change subjects,events types, or to integrate contextual information, both onevents and on recommendation requests.• Idomaar can be considered as a suited tool for recommendersystem research: reuse of code speeds up prototyping and stan-dardization of datasets helps merging different data sources.In the future, Idomaar will go beyond classical recommendersystems domains (e.g., movies or products) and consider additionaltypes such as actions or navigation trees. Supporting generic objectsand additional evaluation functions promise to establish Idomaar asstandard research tool for recommender system. Such a standardcould provide valuable support for the current trend of researchersparticipating in community-wide recommender system challenges.Idomaar has already been applied in such a challenge.5. CONCLUSIONIn this paper, we present the Idomaar framework, which enablesthe efficient, reproducible evaluation of recommender algorithmsin real-world stream-based scenarios. Idomaar simplifies the multi-dimensional evaluation taking into account precision-related metricsas well as technical aspects.Acknowledgment: The research leading to these results was performedin the CrowdRec project, which has received funding from the EU 7thFramework Programme FP7/2007-2013 under grant agreement No. 610594.6. REFERENCES[1] X. Amatriain. Building industrial-scale real-worldrecommender systems. In RecSys ’12, pages 7–8, 2012.[2] T. Brodt and F. Hopfgartner. Shedding light on a living lab:the CLEF NEWSREEL open recommendation platform. InIIiX ’14, pages 223–226, 2014.[3] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl.Rethinking the recommender research ecosystem:Reproducibility, openness, and lenskit. In RecSys’11, 2011.[4] Z. Gantner, S. Rendle, C. Freudenthaler, andL. Schmidt-Thieme. Mymedialite: A free recommendersystem library. In RecSys’11, pages 305–308. ACM, 2011.[5] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, andC. Pu. An analysis of performance interference effects invirtual environments. In ISPASS’07. IEEE, 2007.[6] M. Levy. Offline evaluation of recommender systems: all painand no gain? In RecSys 2013, page 1, 2013.[7] A. Said and A. Bellogín. Comparative recommender systemevaluation: Benchmarking recommendation frameworks. InRecSys’14, RecSys ’14, pages 129–136. ACM, 2014.[8] A. Said and A. Bellogín. Rival: A toolkit to fosterreproducibility in recommender system evaluation. InRecSys’14, pages 371–372, 2014.[9] A. Said, B. Loni, R. Turrin, and A. Lommatzsch. An extendeddata model format for composite recommendation. InRecSys’14 (Posters), 2014.[10] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell. Modelingvirtual machine performance: challenges and approaches.SIGMETRICS Perf. Evaluation Review, 37(3):55–60, 2010.

Idomaar : a framework for multi-dimensional benchmarking of recommender algorithms

In real-world scenarios, recommenders face non-functional requirements

of technical nature and must handle dynamic data in the form

of sequential streams. Evaluation of recommender systems must

take these issues into account in order to be maximally informative.

In this paper, we present Idomaar—a framework that enables the

efficient multi-dimensional benchmarking of recommender algorithms.

Idomaar goes beyond current academic research practices

by creating a realistic evaluation environment and computing both

effectiveness and technical metrics for stream-based as well as setbased

evaluation. A scenario focussing on “research to prototyping

to productization” cycle at a company illustrates Idomaar’s potential.

We show that Idomaar simplifies testing with varying configurations

and supports flexible integration of different data

Scriminaci, Mario

Lommatzsch, Andreas

Kille, Benjamin

Hopfgartner, Frank

Larson, Martha

Malagoli, Davide

Sereny, Andras

Enlighten

     Scriminaci, M., Lommatzsch, A., Kille, B., Hopfgartner, F., Larson, M., Malagoli, D., and Sereny, A. (2016) Idomaar: A Framework for Multi-dimensional Benchmarking of Recommender Algorithms. 10th ACM Conference on Recommender Systems, Baltimore, MA, USA, 15-19 Sept 2016.    There may be differences between this version and the published version. You are advised to consult the publisher’s version if you wish to cite from it.   http://eprints.gla.ac.uk/121693/            Deposited on: 16 August 2016                       Enlighten – Research publications by members of the University of Glasgow http://eprints.gla.ac.uk Idomaar: A Framework for Multi-dimensionalBenchmarking of Recommender AlgorithmsMario Scriminaci1, Andreas Lommatzsch2, Benjamin Kille2, Frank Hopfgartner3,Martha Larson4, Davide Malagoli1, Andras Sereny5, Till Plumbaum21ContentWise R&D – Moviri, Milan, Italy, {firstname.lastname}@moviri.com2TU Berlin – DAI-Lab, Berlin, Germany, {firstname.lastname}@dai-labor.de3University of Glasgow, Glasgow, UK, frank.hopfgartner@glasgow.ac.uk4TU Delft, Delft, The Netherlands, m.a.larson@tudelft.nl5Gravity R&D, Budapest, Hungary, sereny.andras@gravityrd.comABSTRACTIn real-world scenarios, recommenders face non-functional require-ments of technical nature and must handle dynamic data in the formof sequential streams. Evaluation of recommender systems musttake these issues into account in order to be maximally informative.In this paper, we present Idomaar—a framework that enables theefficient multi-dimensional benchmarking of recommender algo-rithms. Idomaar goes beyond current academic research practicesby creating a realistic evaluation environment and computing botheffectiveness and technical metrics for stream-based as well as set-based evaluation. A scenario focussing on “research to prototypingto productization” cycle at a company illustrates Idomaar’s potential.We show that Idomaar simplifies testing with varying configurationsand supports flexible integration of different data.1. INTRODUCTION AND MOTIVATIONIncreasingly, we witness a shift of recommender system researchtoward large-scale systems developed for industry settings. Thetrend was already well described in Amatriain’s 2012 tutorial onbuilding large-scale real-world recommender systems at ACM Rec-Sys 2012 [1]. Given commercial systems’ complexity and the de-mand for high performance, evaluation is subject to additional re-quirements: contribution of complementary information, reliablilityon handling large-scale problems, and use of different methods andmetrics. Evaluation must allow both oﬄine parameter tuning as wellas monitoring systems online.Benchmarking the performance of recommender systems by theseaspects is challenging. Mark Levy pointed this out during hiskeynote at the ACM RecSys 2013 workshop on Reproducibility andReplication in Recommender Systems [6]. Said and Bellogín [7]concur with his point as they analyze existing frameworks’ abil-ities. Many commonly used software suites do not provide therequired functionalities to benchmark different aspects, or they aretoo complex to set up.We introduce Idomaar to address this challenge.1 It enables re-searchers to evaluate different algorithm with respect to multiplecriteria. The framework uses large-scale static data sets to simulatelive data streams, bringing oﬄine evaluation closer to online A/Btesting. By comparing the performance of recommender algorithmsoperating in a live system (e.g., as studied in the living lab News-REEL [2]) and these simulated data streams, the framework can1Idomaar is available at https://github.com/crowdrec/idomaarsee also http://rf.crowdrec.euCopyright is held by the authors.RecSys 2016 Poster Proceedings, September 15-19, 2016, Boston, USA .be used to study the transferability of oﬄine evaluation to an on-line setting. Finally, Idomaar enables multi-dimensional evaluationwhich simultaneously measures the performance of algorithms withrespect to precision-related and technical aspects. These cover CTRand scalability-related measures (throughput and response time).2. APPROACH AND FRAMEWORKThe reference framework Idomaar is a tool to evaluate recom-mendation services in real-world settings. As opposed to typicalrecommender system evaluation which assumes static information,real-world applications process data in form of a stream of infor-mation. In fact, users, items, and interaction amid both collectionscontinue generating events fed to the recommender system. For in-stance, new users register or existing users cancel their subscription;new items emerge; users consume items. Such information must beingested and processed as soon as possible (e.g., by updating the rec-ommendation models) in order to be available. All these messagesare asynchronously handled. However, the system also has to syn-chronously serve incoming recommendation requests within stricttime constraints. Practically, the whole flow of incoming messagesis managed by means of queues.Idomaar mimics the work flow of such real-world scenario byusing state-of-the-art technologies (e.g., Apache Flume and ApacheKafka) to manage data streaming. The architecture has been splitinto four main modules, as depicted in Fig. 1: the data container, theevaluator, the orchestrator, and the computing environment.Figure 1: The Idomaar reference framework architecture.The data container stores the data (entities and relations, in ac-cordance to the format defined in [9]. Part of the data bootstrapsthe recommender system for training algorithms, while most of theremaining data feeds the recommender system at real-time while ithas to serve incoming recommendation requests for test purposes.Finally, the remaining subset of the data, the ground truth, is hiddenfrom the recommender system and used to evaluate the quality ofthe service in terms of the user metrics. The data is read by a customApache Flume source and sent into an Apache Kafka queue.The recommender system runs within a virtual machine, re-ferred to as computing environment, whose environment is cre-ated with Vagrant (https://www.vagrantup.com/) and whereall required libraries are automatically provisioned with Puppet(https://puppetlabs.com/). The recommender system sub-scribes to the required Apache Kafka channel and receives theasynchronous messages (i.e., users, items, interactions, etc.). Rec-ommendation requests are synchronously sent via a HTTP interface(or, alternatively, a 0MQ interface).The evaluator compares recommendations generated by the com-puting environment with the ground truth. In addition to standarduser metrics (RMSE, recall, precision), Idomaar evaluates businessmetrics (e.g., scalability, response time, throughput), so to provide a360-degree evaluation of the recommendation infrastructure.Finally, the orchestrator coordinates all processes, includinglaunching and provisioning the computing environment, instruct-ing the evaluator to split data into training, test, and ground truth,feeding the recommender system with the incoming messages inaccordance to their timestamp, collecting the generated recommen-dations, and computing the quality metrics.Moving from an oﬄine toward an online scenario (where datastream is not simulated from historical information, but the realflow of data) means either replacing the Apache Flume source withanother one (e.g., that reads from log data) or ingesting the datadirectly into the Apache Kafka queue.3. RELATEDWORKVarious frameworks have been proposed to facilitate evaluatingrecommender systems. Ekstrand et al. [3] introduce LensKit toincrease comparability of recommender system evaluation. Mahoutis a scalable machine learning toolkit implemented in Java. Bothframeworks ship with a selection of recommendation algorithmsand some evaluators. Gantner et al. [4] created MyMediaLite as alightweight recommender system framework. It comprises somerecommendation algorithms along with predefined evaluation proto-cols. Said and Bellogín [8] proposed RiVal to facilitate comparingvarious recommendation algorithms. The framework’s architecturesupports cross-framework comparisons. The variety in frameworksemphasizes the demand for tools to evaluate recommender sys-tems. Although the presented tools support evaluation, all presentedframeworks measure quality only in terms of predictive performance.Operating recommender systems face additional challenges. Forinstance, they might be subject to response time restrictions or expe-rience heavy load. Finally, running above mentioned frameworkson different hardware still yields inconsistent results. For thesereasons, we propose Idomaar a language-agnostic framework withcloud-support and the ability to measure time and space complexity.4. PROTOTYPE TO PRODUCTIVIZATIONIdomaar was used in the “research to prototyping to operating”cycle of a recommender system service provider to validate itsusefulness. The focus of the validation was on multidimensionalevaluation that simultaneously takes effectiveness and technicalconstraints into account. The only limitation identified was themonitoring of performance metrics like CPU and memory usage(cf. [5, 10]).With respect to the cycle itself, we found that having a standardin terms of data formats and APIs increases the reusability of codein all phases and helps data scientists to produce code that can betransformed into effective prototypes. Our “research to prototypingto operating” cycle shows:• Idomaar allowed easy testing using different datasets with dif-ferent algorithms that share the same input types and subjects(e.g., implicit or explicit events, sessions or users).• The Idomaar format is flexible enough to change subjects,events types, or to integrate contextual information, both onevents and on recommendation requests.• Idomaar can be considered as a suited tool for recommendersystem research: reuse of code speeds up prototyping and stan-dardization of datasets helps merging different data sources.In the future, Idomaar will go beyond classical recommendersystems domains (e.g., movies or products) and consider additionaltypes such as actions or navigation trees. Supporting generic objectsand additional evaluation functions promise to establish Idomaar asstandard research tool for recommender system. Such a standardcould provide valuable support for the current trend of researchersparticipating in community-wide recommender system challenges.Idomaar has already been applied in such a challenge.5. CONCLUSIONIn this paper, we present the Idomaar framework, which enablesthe efficient, reproducible evaluation of recommender algorithmsin real-world stream-based scenarios. Idomaar simplifies the multi-dimensional evaluation taking into account precision-related metricsas well as technical aspects.Acknowledgment: The research leading to these results was performedin the CrowdRec project, which has received funding from the EU 7thFramework Programme FP7/2007-2013 under grant agreement No. 610594.6. REFERENCES[1] X. Amatriain. Building industrial-scale real-worldrecommender systems. In RecSys ’12, pages 7–8, 2012.[2] T. Brodt and F. Hopfgartner. Shedding light on a living lab:the CLEF NEWSREEL open recommendation platform. InIIiX ’14, pages 223–226, 2014.[3] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl.Rethinking the recommender research ecosystem:Reproducibility, openness, and lenskit. In RecSys’11, 2011.[4] Z. Gantner, S. Rendle, C. Freudenthaler, andL. Schmidt-Thieme. Mymedialite: A free recommendersystem library. In RecSys’11, pages 305–308. ACM, 2011.[5] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, andC. Pu. An analysis of performance interference effects invirtual environments. In ISPASS’07. IEEE, 2007.[6] M. Levy. Oﬄine evaluation of recommender systems: all painand no gain? In RecSys 2013, page 1, 2013.[7] A. Said and A. Bellogín. Comparative recommender systemevaluation: Benchmarking recommendation frameworks. InRecSys’14, RecSys ’14, pages 129–136. ACM, 2014.[8] A. Said and A. Bellogín. Rival: A toolkit to fosterreproducibility in recommender system evaluation. InRecSys’14, pages 371–372, 2014.[9] A. Said, B. Loni, R. Turrin, and A. Lommatzsch. An extendeddata model format for composite recommendation. InRecSys’14 (Posters), 2014.[10] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell. Modelingvirtual machine performance: challenges and approaches.SIGMETRICS Perf. Evaluation Review, 37(3):55–60, 2010.

English

Idomaar: A Framework for Multi-dimensional Benchmarking of Recommender Algorithms

Enlighten: Publications

 
 
 
 
 
Scriminaci, M., Lommatzsch, A., Kille, B., Hopfgartner, F., Larson, M., Malagoli, 
D., and Sereny, A. (2016) Idomaar: A Framework for Multi-dimensional 
Benchmarking of Recommender Algorithms. 10th ACM Conference on 
Recommender Systems, Baltimore, MA, USA, 15-19 Sept 2016. 
 
 
 
There may be differences between this version and the published version. You are 
advised to consult the publisher’s version if you wish to cite from it. 
 
 
http://eprints.gla.ac.uk/121693/ 
     
 
 
 
 
 
 
Deposited on: 16 August 2016 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Enlighten – Research publications by members of the University of Glasgow 
http://eprints.gla.ac.uk 
Idomaar: A Framework for Multi-dimensional
Benchmarking of Recommender Algorithms
Mario Scriminaci1, Andreas Lommatzsch2, Benjamin Kille2, Frank Hopfgartner3,
Martha Larson4, Davide Malagoli1, Andras Sereny5, Till Plumbaum2
1ContentWise R&D – Moviri, Milan, Italy, {firstname.lastname}@moviri.com
2TU Berlin – DAI-Lab, Berlin, Germany, {firstname.lastname}@dai-labor.de
3University of Glasgow, Glasgow, UK, frank.hopfgartner@glasgow.ac.uk
4TU Delft, Delft, The Netherlands, m.a.larson@tudelft.nl
5Gravity R&D, Budapest, Hungary, sereny.andras@gravityrd.com
ABSTRACT
In real-world scenarios, recommenders face non-functional require-
ments of technical nature and must handle dynamic data in the form
of sequential streams. Evaluation of recommender systems must
take these issues into account in order to be maximally informative.
In this paper, we present Idomaar—a framework that enables the
efficient multi-dimensional benchmarking of recommender algo-
rithms. Idomaar goes beyond current academic research practices
by creating a realistic evaluation environment and computing both
effectiveness and technical metrics for stream-based as well as set-
based evaluation. A scenario focussing on “research to prototyping
to productization” cycle at a company illustrates Idomaar’s potential.
We show that Idomaar simplifies testing with varying configurations
and supports flexible integration of different data.
1. INTRODUCTION AND MOTIVATION
Increasingly, we witness a shift of recommender system research
toward large-scale systems developed for industry settings. The
trend was already well described in Amatriain’s 2012 tutorial on
building large-scale real-world recommender systems at ACM Rec-
Sys 2012 [1]. Given commercial systems’ complexity and the de-
mand for high performance, evaluation is subject to additional re-
quirements: contribution of complementary information, reliablility
on handling large-scale problems, and use of different methods and
metrics. Evaluation must allow both offline parameter tuning as well
as monitoring systems online.
Benchmarking the performance of recommender systems by these
aspects is challenging. Mark Levy pointed this out during his
keynote at the ACM RecSys 2013 workshop on Reproducibility and
Replication in Recommender Systems [6]. Said and Bellogín [7]
concur with his point as they analyze existing frameworks’ abil-
ities. Many commonly used software suites do not provide the
required functionalities to benchmark different aspects, or they are
too complex to set up.
We introduce Idomaar to address this challenge.1 It enables re-
searchers to evaluate different algorithm with respect to multiple
criteria. The framework uses large-scale static data sets to simulate
live data streams, bringing offline evaluation closer to online A/B
testing. By comparing the performance of recommender algorithms
operating in a live system (e.g., as studied in the living lab News-
REEL [2]) and these simulated data streams, the framework can
1Idomaar is available at https://github.com/crowdrec/idomaar
see also http://rf.crowdrec.eu
Copyright is held by the authors.
RecSys 2016 Poster Proceedings, September 15-19, 2016, Boston, USA .
be used to study the transferability of offline evaluation to an on-
line setting. Finally, Idomaar enables multi-dimensional evaluation
which simultaneously measures the performance of algorithms with
respect to precision-related and technical aspects. These cover CTR
and scalability-related measures (throughput and response time).
2. APPROACH AND FRAMEWORK
The reference framework Idomaar is a tool to evaluate recom-
mendation services in real-world settings. As opposed to typical
recommender system evaluation which assumes static information,
real-world applications process data in form of a stream of infor-
mation. In fact, users, items, and interaction amid both collections
continue generating events fed to the recommender system. For in-
stance, new users register or existing users cancel their subscription;
new items emerge; users consume items. Such information must be
ingested and processed as soon as possible (e.g., by updating the rec-
ommendation models) in order to be available. All these messages
are asynchronously handled. However, the system also has to syn-
chronously serve incoming recommendation requests within strict
time constraints. Practically, the whole flow of incoming messages
is managed by means of queues.
Idomaar mimics the work flow of such real-world scenario by
using state-of-the-art technologies (e.g., Apache Flume and Apache
Kafka) to manage data streaming. The architecture has been split
into four main modules, as depicted in Fig. 1: the data container, the
evaluator, the orchestrator, and the computing environment.
Figure 1: The Idomaar reference framework architecture.
The data container stores the data (entities and relations, in ac-
cordance to the format defined in [9]. Part of the data bootstraps
the recommender system for training algorithms, while most of the
remaining data feeds the recommender system at real-time while it
has to serve incoming recommendation requests for test purposes.
Finally, the remaining subset of the data, the ground truth, is hidden
from the recommender system and used to evaluate the quality of
the service in terms of the user metrics. The data is read by a custom
Apache Flume source and sent into an Apache Kafka queue.
The recommender system runs within a virtual machine, re-
ferred to as computing environment, whose environment is cre-
ated with Vagrant (https://www.vagrantup.com/) and where
all required libraries are automatically provisioned with Puppet
(https://puppetlabs.com/). The recommender system sub-
scribes to the required Apache Kafka channel and receives the
asynchronous messages (i.e., users, items, interactions, etc.). Rec-
ommendation requests are synchronously sent via a HTTP interface
(or, alternatively, a 0MQ interface).
The evaluator compares recommendations generated by the com-
puting environment with the ground truth. In addition to standard
user metrics (RMSE, recall, precision), Idomaar evaluates business
metrics (e.g., scalability, response time, throughput), so to provide a
360-degree evaluation of the recommendation infrastructure.
Finally, the orchestrator coordinates all processes, including
launching and provisioning the computing environment, instruct-
ing the evaluator to split data into training, test, and ground truth,
feeding the recommender system with the incoming messages in
accordance to their timestamp, collecting the generated recommen-
dations, and computing the quality metrics.
Moving from an offline toward an online scenario (where data
stream is not simulated from historical information, but the real
flow of data) means either replacing the Apache Flume source with
another one (e.g., that reads from log data) or ingesting the data
directly into the Apache Kafka queue.
3. RELATEDWORK
Various frameworks have been proposed to facilitate evaluating
recommender systems. Ekstrand et al. [3] introduce LensKit to
increase comparability of recommender system evaluation. Mahout
is a scalable machine learning toolkit implemented in Java. Both
frameworks ship with a selection of recommendation algorithms
and some evaluators. Gantner et al. [4] created MyMediaLite as a
lightweight recommender system framework. It comprises some
recommendation algorithms along with predefined evaluation proto-
cols. Said and Bellogín [8] proposed RiVal to facilitate comparing
various recommendation algorithms. The framework’s architecture
supports cross-framework comparisons. The variety in frameworks
emphasizes the demand for tools to evaluate recommender sys-
tems. Although the presented tools support evaluation, all presented
frameworks measure quality only in terms of predictive performance.
Operating recommender systems face additional challenges. For
instance, they might be subject to response time restrictions or expe-
rience heavy load. Finally, running above mentioned frameworks
on different hardware still yields inconsistent results. For these
reasons, we propose Idomaar a language-agnostic framework with
cloud-support and the ability to measure time and space complexity.
4. PROTOTYPE TO PRODUCTIVIZATION
Idomaar was used in the “research to prototyping to operating”
cycle of a recommender system service provider to validate its
usefulness. The focus of the validation was on multidimensional
evaluation that simultaneously takes effectiveness and technical
constraints into account. The only limitation identified was the
monitoring of performance metrics like CPU and memory usage
(cf. [5, 10]).
With respect to the cycle itself, we found that having a standard
in terms of data formats and APIs increases the reusability of code
in all phases and helps data scientists to produce code that can be
transformed into effective prototypes. Our “research to prototyping
to operating” cycle shows:
• Idomaar allowed easy testing using different datasets with dif-
ferent algorithms that share the same input types and subjects
(e.g., implicit or explicit events, sessions or users).
• The Idomaar format is flexible enough to change subjects,
events types, or to integrate contextual information, both on
events and on recommendation requests.
• Idomaar can be considered as a suited tool for recommender
system research: reuse of code speeds up prototyping and stan-
dardization of datasets helps merging different data sources.
In the future, Idomaar will go beyond classical recommender
systems domains (e.g., movies or products) and consider additional
types such as actions or navigation trees. Supporting generic objects
and additional evaluation functions promise to establish Idomaar as
standard research tool for recommender system. Such a standard
could provide valuable support for the current trend of researchers
participating in community-wide recommender system challenges.
Idomaar has already been applied in such a challenge.
5. CONCLUSION
In this paper, we present the Idomaar framework, which enables
the efficient, reproducible evaluation of recommender algorithms
in real-world stream-based scenarios. Idomaar simplifies the multi-
dimensional evaluation taking into account precision-related metrics
as well as technical aspects.
Acknowledgment: The research leading to these results was performed
in the CrowdRec project, which has received funding from the EU 7th
Framework Programme FP7/2007-2013 under grant agreement No. 610594.
6. REFERENCES
[1] X. Amatriain. Building industrial-scale real-world
recommender systems. In RecSys ’12, pages 7–8, 2012.
[2] T. Brodt and F. Hopfgartner. Shedding light on a living lab:
the CLEF NEWSREEL open recommendation platform. In
IIiX ’14, pages 223–226, 2014.
[3] M. D. Ekstrand, M. Ludwig, J. A. Konstan, and J. T. Riedl.
Rethinking the recommender research ecosystem:
Reproducibility, openness, and lenskit. In RecSys’11, 2011.
[4] Z. Gantner, S. Rendle, C. Freudenthaler, and
L. Schmidt-Thieme. Mymedialite: A free recommender
system library. In RecSys’11, pages 305–308. ACM, 2011.
[5] Y. Koh, R. Knauerhase, P. Brett, M. Bowman, Z. Wen, and
C. Pu. An analysis of performance interference effects in
virtual environments. In ISPASS’07. IEEE, 2007.
[6] M. Levy. Offline evaluation of recommender systems: all pain
and no gain? In RecSys 2013, page 1, 2013.
[7] A. Said and A. Bellogín. Comparative recommender system
evaluation: Benchmarking recommendation frameworks. In
RecSys’14, RecSys ’14, pages 129–136. ACM, 2014.
[8] A. Said and A. Bellogín. Rival: A toolkit to foster
reproducibility in recommender system evaluation. In
RecSys’14, pages 371–372, 2014.
[9] A. Said, B. Loni, R. Turrin, and A. Lommatzsch. An extended
data model format for composite recommendation. In
RecSys’14 (Posters), 2014.
[10] O. Tickoo, R. Iyer, R. Illikkal, and D. Newell. Modeling
virtual machine performance: challenges and approaches.
SIGMETRICS Perf. Evaluation Review, 37(3):55–60, 2010.


https://eprints.whiterose.ac.uk/175097/1/Idomaar.pdf

Idomaar : a framework for multi-dimensional benchmarking of recommender algorithms

Abstract

Similar works

Full text

Available Versions

White Rose Research Online

Enlighten

Enlighten: Publications