Early findings from a large-scale user study of CHESTNUT: Validations and implications

. Towards a serendipitous recommender system with user-centred understanding, we have built CHESTNUT , an Information Theory-based Movie Recommender System, which introduced a more comprehensive understanding of the concept. Although oﬀ-line evaluations have already demonstrated that CHESTNUT has greatly improved serendip-ity performance, feedback on CHESTNUT from real-world users through online services are still unclear now. In order to evaluate how serendip-itous results could be delivered by CHESTNUT , we consequently designed, organized and conducted large-scale user study, which involved 104 participants from 10 campuses in 3 countries. Our preliminary feed-back has shown that, compared with mainstream collaborative ﬁlter-ing techniques, though CHESTNUT limited users’ feelings of unex-pectedness to some extent, it showed signiﬁcant improvement in their feelings about certain metrics being both beneﬁcial and interesting , which substantially increased users’ experience of serendipity. Based on them, we have summarized three key takeaways, which would be bene-ﬁcial for further designs and engineering of serendipitous recommender systems, from our perspective. All details of our large-scale user study could be found at https://github.com/unnc-idl-ucc/Early-Lessons-From-CHESTNUT


Introduction
Towards a more comprehensive understanding of serendipity, we have built CHESTNUT , the first serendipitous movie recommender system with an Information Theory-based algorithm, to embed a more comprehensive understanding of serendipity in a practical recommender system [16,24]. Although experimental studies on static data sets have shown that CHESTNUT could achieve significant improvements (i.e. around 2.5x), compared with other mainstream collaborative filtering approaches, in the incidence of serendipity, it remains necessary for a user study to be conducted to allow validations of CHESTNUT and impose further investigations into the concept of serendipity and the engineering of serendipitous recommender systems.
Therefore, we carried out a large-scale user study around CHESTNUT , along with its experimental benchmark systems. To enable a detailed study, we first designed a plan to ensure all participants were capable of experiencing serendipity, by excluding the effects of any environmental factors as much as possible. We then launched the study, and invited 104 participants to contribute, from whom we collected extensive data and qualitative records from real-world users over a ten-month period.
Our initial results indicate that, although CHESTNUT limited users' feelings of "unexpectedness", when compared with item-based and user-based collaborative filtering approaches, it did show significant improvement in users' feelings about certain metrics being both "beneficial" and "interesting", which substantially increased their experience on serendipity. The low quantity of "unexpectedness", through our interviews, have been addressed due to relatively old movies from CHESTNUT .
Based on these preliminary statistics and context-based investigations, we summarized three key takeaways for future work, which lied on the Design Principles of User Interfaces, Novel Integration of More Contentbased Approaches and Introspection of Serendipity Metrics. We believe they are extremely useful for further designs and engineering of serendipitous recommender systems.
More specifically, we have made three main contributions here: (1) A Large-scale User Study among CHESTNUT and two mainstream Collaborative Filtering Systems. We have performed a large-scale user study among CHESTNUT, Item-based and User-based Collaborative Filtering approaches. Our study has lasted for around 10 months, which involved 104 participants across 3 countries. All details of our large-scale user study could be found at https://github.com/unnc-idl-ucc/Early-Lessons-From-CHESTNUT (2) Validations and Implications of the Improvements from CHEST-NUT in Serendipity. Through this study, we have validated the effectiveness of CHESTNUT in terms of serendipitous recommendations, compared with widely commercialized algorithms. Our initial results also indicates some limitations of our current end-to-end prototype, which has limited the performance of CHESTNUT.
(3) Takeaways for Principles of Designs, Developments and Evaluations in Engineering of Serendipitous Recommender Systems. Based on several implications from this study, we have summarized three key takeaways, as potential future work directions, to discuss about future principles of designs, developments and evaluations for serendipitous Recommender Systems. This paper would be organized as follow. Section 2 would provide necessary background information and illustrate our motivation of this study. Section 3 would introduce details around this study, spanning from methodology to technical adjustments. Section 4 would report our initial results and relevant analysis from this study. Section 5 would present our discussion and introspection to motivate and stimulate potential principles and follow-up work in the future.

Background and Motivation
For a decade, serendipity has been understood narrowly within the Recommender System field, and it has been defined in previous research as receiving an unexpected and fortuitous item recommendation [13]. Such mindset have led to many efforts in the development and investigation of serendipitous recommender systems through modelling and algorithmic designs and optimizations, instead of rethinking the natural understanding of the concept. [1, 2, 5, 6, 3, 8-10, 12, 11, 14, 15, 17, 18, 20-22].
CHESTNUT was built to validate a novel insight around serendipitous recommender system, by merging insight, unexpectedness and usefulness to provide a more comprehensive understanding of serendipity, as the first usercentered serendipitous recommender system. In the context of movie recommendation, CHESTNUT enables connection-making between users through their directors' information (cInsigt), filter out non-popular and non-familiar movies (cUnexpectedness) and then generate recommendations through rating prediction (cUsefulness). The above three steps ensured relevance, unexpectedness and values respectively.
Although the theoretical support of CHESTNUT [24], its effectiveness [23] and practical system performance [16] has been examined earlier, the missing validation from real-world users is still missing. Also, a large-scale user study would also help to uncover several issues, which are not capable to be found through off-line evaluations, and enhance its practicality. Therefore, we have performed a large-scale user study since we believe such a study is essential, important and meaningful for both CHESTNUT -related work and the communities of Recommender Systems and Information Management.

Methodology
In this section, we introduce the research methods used in the CHESTNUT user study. Other than environmental factors, previous user studies of serendipity has pointed out that users' willingness to participate would undoubtedly affect their serendipitous experiences [23]. To allow us to collect satisfactory feedback, we scheduled face-to-face interviews as suggested by participants. However, we were unable to manage all interviews in this way, owing to geographical limitations. For those who couldn't attend in person, we applied a mobile diary method to record relevant details, which was a systematic method used in previous user studies on serendipity [19,23].

Participants
In total, 104 undergraduate students were invited to take part in this user study, with each participant having made at least 30 movies' ratings. Although a previous study invited professional scholars to take part in serendipity interviews  (i.e. because their speciality made experiencing serendipity easier), this study aimed to investigate serendipity within a more generalized group [23]. Details about all participants' geographical distribution are reported in Table  1. Details about all participants' personal information are reported in Table 2. Details about the levels of involvements from participants are reported in Table  2. All participants' names reported in this study are aliases.
There are two things to be illustrated in Table 1: 1) The term Countries refers to those countries which the corresponding campus bases on; 2) The term Other Campuses refers to those campuses, which only has one participant.

Procedure
Before the bulk of this study began, a pilot study was performed with two male participants on campus for a period of four days. The detailed experiment issues such as time arrangement, system functionality and interaction preparation, were all decided based on this pilot study.
The bulk of this study was then conducted. For each participant, there were two parts to the whole study -a pre-interviews and an empirical interviews. First, for the pre-interviews, each participant was invited individually to a short meeting (around 30 minutes), to introduce the purpose of study and to collect their own movie rating records. Since the majority of users do not use IMDb as their channel with which to manage their favorite movies, we had to perform this collection procedure within the interview, which also meant that participants could prepare in advance.
Second, for the empirical interviews, each participant made their own schedule in advance. During this stage, users were able to view their recommendation results from the website. Within the time period between the two interviews, we processed collected user profiles in CHESTNUT , and placed their recommendation results online. During the interviews, participants were guided, both by the researchers and the system, to review their profiles, and to check and comment on their recommendation results step by step. We have sketched the interfaces in Figures 1 and 2.

. User Ratings and Comment Page
This user study has lasted for ten months because we followed the "usercentred" arrangements, and eac schedule was determined by the participants involved. In addition to CHESTNUT , we also set up two benchmark systems (i.e. an item-based and a user-based approach) and included their results together within this study. Each system produced five recommendation results, according to their submitted profiles. Since not every user could attend a faceto-face interview, we had to perform interviews via online applications (e.g. Wechat), where necessary.

Data Collection
Two types of data were collected: 1) the user experiences from all recommended movies, generated from all three systems. For each movie, the participants were able to rate feelings on whether the information they were presented with was "unexpected", "Interesting" and "Beneficial", according to the scale shown in Table 2. 2) under each page, the users also had the option to leave their comments on any of the movies, if they felt it was necessary.

System Modifications
Previous studies on CHESTNUT applied the relatively small HeteRec 2011 data set from Movielens (i.e. 2113 users with 800,000 ratings) [4]. Since CHEST-NUT is a memory-based collaborative filtering system, its baseline data set had to be expanded to adapt the system into a real-world scenario. We chose the most recent 30 years' movies and related ratings from ml-20m [7], which is one of the latest and largest data set from Movielens (i.e. 138471 users with 150,000,000 ratings).

Results
In this section, we will introduce the preliminary results and analysis, drawn from our user study. After collecting all feedback, we performed a series of preliminary analysises to examine the effectiveness of CHESTNUT . First, we will give a performance overview of CHESTNUT and its benchmark systems, which relied on data collected from online questionnaires (i.e. to sketch the level of users' feelings under the different metrics). Next, we will sketch out our perspectives as preliminary hypothesis, which will entail additional investigations and discussions around both serendipity and CHESTNUT .

Performance Overview
Our performance overview is divided into three parts, as arranged in the online questionnaires. We will examine the rating levels for whether the participants thought the information provided was "unexpected", "interesting" and "beneficial" separately. This was done for CHESTNUT , and for the item-based and user-based systems respectively. First, we examined all participants' levels of "unexpected" feelings towards their results, as shown in Figure 3. On the one hand, CHESTNUT performed the worst with regard to users' feelings of unexpectedness, with the level only reaching 2.465 on average. On the other hand, the user-based and item-based approaches achieved better ratings for this factor, with levels of 2.731 and 2.91 on average respectively. We then explored the levels of feelings on relating to whether they found the information to be "interesting", and the results are shown in Figure 4. In terms of providing interesting recommendation results, CHESTNUT achieved 3.523 on average. As for the user-based and item-based approaches, they only achieved average ratings of 3.163 and 2.881 respectively. The results support the fact that CHESTNUT is able to provide more interesting recommendation results. Finally, we checked quantitatively the levels of feelings relating to whether they found the information to be "beneficial", as illustrated in Figure 5. In this case, CHESTNUT achieved a rating of 3.325 on average, but the user-based and item-based approaches only reached 2.951 and 2.819 on average respectively. Similar to the results for how "interesting" the participants found the information, the results support the fact that CHESTNUT is able to provide much more beneficial recommendation results than the two conventional approaches.

Why
Were the Results Not So "Unexpected"? The major concern of our user study is that results from CHESTNUT were not as highly rated in terms of their unexpectedness as we predicted. Normally, CHESTNUT has a particular functional unit (i.e. "cUnexpectedness") to ensure that all results are unexpected. However, during this study, CHESTNUT seemed to fail to make users feel that the results were unexpected.
Having looked further into this phenomenon, we found that, as participants claimed, they felt the results were unexpected when they encountered relatively old movies (i.e. those made approximately 20 years ago), because most only kept track of more recent productions. We further confirmed this by drawing the year distribution of recommendation results from different systems, as shown in Figure 6. We believed this is the main reason why they outperformed than CHESTNUT in their "unexpected" ratings.

Discussions and Takeaways
Based on preliminary results and analysis from our study, we hereby discuss revealed issues and relevant takeaways, to stimulate novel insights and follow-up investigations. In general, there are three aspects, which we want to highlight: -Design Principles of User Interfaces.
The current User Interface design of CHESTNUT directly reflects highlevel overview information of different items, which has indicated that general design choices could potentially hurt users' capability to encounter serendipitous information. Particularly, in our case, the tag "year" has led to a lot of negative effects in the domain of "unexpectedness".
We believe future serendipity-oriented interfaces demands content-based interaction and personalized mechanisms. For instance, in the context of movies, we could use Movie Trailer as preview resources, and let users customize their interfaces by re-ordering the priority of displayed information as they prefer.
-Novel Integration of More Content-based Approaches.
The current study of CHESTNUT has only been exploited with the setting of "Director", as the connection-making resources. Given the fact that Directors are dependent to active periods, levels of productivity and genres, it's reasonable to lead to "old movies appear frequently".
We believe future studies among CHESTNUT and other serendipitous recommender systems would take the categories of information, as the guiding resources, into account. In CHESTNUT, we have already included support for other information categories in cInsight, such as Years and Genres. This would also impose novel integration of different content-based approaches together, which aims to provide personalized recommendation results.
The comparison between our real-world study and off-line evaluations have indicated that, there is a huge gap of current serendipity measures in the context of Recommender Systems.
We believe this also imposes a lot of opportunities to develop novel schemes and frameworks for the validations of serendipitous designs and implementations. More specifically, the results from our experimental study and realworld feedback around CHESTNUT are extremely valuable, especially when combining both of them together.
Although we have addressed three aspects of our takeaways, we still believe there are a lot of challenges and opportunities beyond CHESTNUT. Hereby, we only provide reflections from our own experiences and we hope they would stimulate more interesting and novel ideas for serendipitous designs and engineering, in both Recommender Systems and other relevant communities.

Conclusion and Future Work
In this paper, we have presented our early findings from a large-scale user study of CHESTNUT , which involved 104 participants over a ten months period. According to our initial analysis, the results have shown that, compared with mainstream collaborative filtering techniques, though CHESTNUT limited users' feelings of unexpectedness to some extent, it showed significant improvement in their feelings about certain metrics being both beneficial and interesting, which substantially increased users' experience of serendipity. Based on them, we have summarized three key takeaways, which would be beneficial for further designs and engineering of serendipitous recommender systems.
Our future work will make variants of further in-depth studies, based on our summarized takeaways, to investigate both the concept of serendipity and the optimizations of CHESTNUT through these empirical data. Beyond CHEST-NUT , we are particularly interested in bridging real-world feedback and off-line results for more sophisticated frameworks, to further validate many variants of other serendipitous recommender system designs and implementations.