Recommendation systems are ubiquitous in our digital world, influencing our daily lives from online shopping and music streaming to movie and TV show selection. But how can we ensure that these systems are truly effective, fair, and useful? This is where evaluation comes in.
Without thorough evaluation, we cannot know whether a recommendation system truly “works.” This means we have no clear understanding of how accurate the recommendations are, how fairly they are designed, or what new information they provide to users. Evaluation is the key to measuring the performance of a system and understanding where it needs improvement.
The evaluation of recommendation systems is an interdisciplinary task that brings together different fields:
Information Retrieval: Many common metrics such as Recall and Precision originate from this field. These metrics help assess the accuracy and relevance of recommendations.
Machine Learning: Methods like training/test-set splits are essential for testing and validating the performance of models.
Human-Computer Interaction: To determine the actual value of recommendations for users, methods from human-computer interaction are also needed, such as user studies.
By integrating these different disciplines, a comprehensive and effective evaluation can be achieved.
One of the biggest challenges in evaluating recommendation systems is involving all relevant stakeholders in the process. Recommendation systems are multi-stakeholder systems that must consider not only the user perspective but also the perspective of the providers – the stakeholders who want to promote their products or services.
This is where industry can make a significant contribution through its domain knowledge. It can help identify the needs of all stakeholders and ensure that they are considered in the evaluation process.
Advances in the evaluation of recommendation systems have a direct impact on future research and development in this field. Only through precise and comprehensive evaluation methods can we identify progress in research. This, in turn, enables the development of novel recommendation systems, such as those based on generative artificial intelligence (Large Language Models).
Thorough and interdisciplinary evaluation is therefore not only a means of quality control but also a driving force for innovation and progress in the world of recommendation systems. By strengthening collaboration between science and industry and including all relevant perspectives, we can develop the next generation of recommendation systems that are even more accurate, fair, and useful.
The Dagstuhl Workshop, held by Dr. Dominik Kowald, Area Head of the Fair AI Department, who laid the foundation for this text, highlighted these insights and more.
For more information on our research, please visit the Research Area Fair AI page.