4. Evaluation Methods

Last modified by Mark Neerincx on 2023/03/01 10:35

For the evaluation of a prototype, there are several frameworks that can be followed, starting with DECIDE (Kurniawan, 2004). Decide stands for:

  • Determine the goals
  • Explore the questions
  • Choose evaluation approach and methods
  • Identify practical issues
  • Decide about ethical issues
  • Evaluate, analyze, interpret, present data

First, we would have to determine the high-level goals for the study and the motivation behind them, since they can influence how we approach it. Then, we choose the evaluation approach, the methods that will be used, whether these are based on quantitative or qualitative data, and the process of data collecting, analysis, and presentation. At the same time, any practical issues, such as participants, budget, or schedule, are identified and a pilot study is performed if needed. It is important to adhere to any ethical procedures that are in place, to ensure the participant knows their rights and is protected. Finally, the evaluation of the data takes place, where it is determined whether the results are in any way unreliable, invalid, biased, related to the environment and whether they generalize well.

Another framework often used is IMPACT (Benyon et al., 2005):

  • Intention
  • Measures and metrics
  • People
  • Activities
  • Context
  • Technologies

These are the several elements that need to be considered when trying to establish evaluation objectives. First, we should present the objectives and the claims of the study. Furthermore, the specific measures and metrics that will be used need to be determined, followed by the participants and the activities they will perform based on a specific use case. We should also define the context, social, ethical, physical, or environmental, and finally, we must decide on the technologies we will use, both regarding hardware and software.

Regarding the evaluation methods, there are two types: formative and summative evaluation. The formative evaluation is based on open-ended questions that have to do with specific processes of the interaction, whereas the summative evaluation focuses on the overall effect and summarizes whether the objective has been reached. For measuring these evaluations, both qualitative and quantitative data can be examined. The goal of the qualitative data is to explore and discover patterns and themes and that of the quantitative data is to describe, explain and predict based on the outcomes. A combination of the two is often optimal.

Another factor that needs to be considered during an evaluation study is the experiment design. There are two types: the within-subjects design and the between-subject design. The former calls for performing all the test conditions on all participants and getting repeated measures. This design needs fewer subjects and reduces the variance, however, there is the possibility of carry-over effects from one interaction to the next and the setup can be more challenging since it requires more time. The latter is performed on different groups of participants, each undergoing only one test condition. It is much simpler to execute but there might be greater variance due to inter-subject differences in characteristics.

When executing an experiment, it is also interesting to examine the interaction from multiple lenses, in order to identify issues and opportunities that are not immediately obvious. That can mean from the perspective of a different stakeholder that is not the main user, other groups that might not directly interact with the system or even a more technical or legal perspective.

For testing the general and food & music claims, we opted for already existing questionnaires that have been previously validated (measuring the robot's pleasantness and the participants' mood), in combination with some of our own questions that consider food and music. A subset of the EVEA (Sanz, 2001) questionnaire was used to assess the participant's mood before and after each interaction, whereas a subset of the Godspeed (Bartneck et al., 2009) questionnaire was given to the participants to rate how pleasant and intelligent they found the robot to be. A few additional questions were used to determine the relation of the participant with the food and the music, either before or during the interaction, and we also asked some questions in an interview-style to gather qualitative data.

For statistical interpretation of the numerical results, a paired Wilcoxon signed-rank test was used (Wilcoxon, 1945). The test is a non-parametrical statistical test than can be used to compare paired samples, such as the before-after distributions, or the distributions of the same question with two versions of the robot. The test was used to test three different null-hypotheses; the hypothesis that the first distribution median is larger/different/smaller than the median of the second distribution.

  1. Bartneck, C., Kulić, D., Croft, E., & Zoghbi, S. (2009). Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International journal of social robotics, 1(1), 71-81.
  2. Benyon, D., Turner, P., & Turner, S. (2005). Designing interactive systems: People, activities, contexts, technologies. Pearson Education.
  3. HAl-Anssari, H., Al-Anssari, H., Abdel-Qader, I., Abdel-Qader, I., Mickus, M., & Mickus, M. (2021). Food Intake Vision-Based Recognition System via Histogram of Oriented Gradients and Support Vector Machine for Persons With Alzheimer's Disease. International Journal Of Healthcare Information Systems And Informatics, 16(4), 1-19. https://doi.org/10.4018/ijhisi.295817
  4. Kigozi, E., Egwela, C., Kamoga, L., Nalugo Mbalinda, S., & Kaddumukasa, M. (2021). Nutrition Challenges of Patients with Alzheimer’s Disease and Related Dementias: A Qualitative Study from the Perspective of Caretakers in a Mental National Referral Hospital. Neuropsychiatric Disease And Treatment, Volume 17, 2473-2480. https://doi.org/10.2147/ndt.s325463
  5. Kurniawan, S. (2004). Interaction design: Beyond human-computer interaction by Preece, Sharp, and Rogers (2001), ISBN 0471492787.
  6. Poor appetite and dementia. Alzheimer's Society. (2022). Retrieved 2 April 2022, from https://www.alzheimers.org.uk/get-support/daily-living/poor-appetite-dementia#:~:text=A%20person%20with%20dementia%20may%20lose%20interest%20in%20food.,loss%20and%20less%20muscle%20strength.
  7. Sanz, J. (2001). SCALE FOR MOOD ASSESSMENT (EVEA).
  8. Wilcoxon, F. (1945). Individual Comparisons by Ranking Methods. Biometrics Bulletin, 1(6), 80–83. https://doi.org/10.2307/3001968