4. Evaluation Methods
The following steps were used to design and evaluate the prototype proposed against the corresponding control condition:
1. Confirm the prototype: For the pilot study, the scenario to be tested, and the control situation were setup at the Insyght Lab at TU Delft, and preliminary testing was done by the team members. This includes the robots with and without interactive storytelling which were confirmed and working. The voice input and touch input to the robot were verified.
2. Develop Questions: We now develop the metrics on which the robot must be evaluated. We decided to use a modified version of the Godspeed questionnaire, which each participant was made to fill after interacting with the robot. This questionnaire has been elaborated below.
3. Invite participants: Due to limited time and resources, patients with dementia (the actual users) could not be used for the study. We instead use TU Delft students to test the prototype.
Research Question
"Is interactive storytelling more engaging and beneficial than storytelling in the third person for persons suffering from dementia?"
Thus, our control situation is the scenario of a robot narrating a story without any involvement of the patient, and the scenario we want to evaluate is the one where the robot narrates the same story while trying to engage and take inputs from the patient. With this, we aim to find whether it is beneficial and engaging for patients with dementia.
The Within-Subject Design
As part of the experiment design, we chose the within subject design over between subject. This means that each participant will interact with the robot twice. This was done due to the limited number of participants, and to avoid any biases of participant preferences.
Summative Evaluation
We will evaluate the prototype's effectiveness at the end of the experiment, i.e whether interactive storytelling was beneficial as compared to non interactive storytelling. Since we are comparing two robots, we follow summative evaluation. Using a questionnaire, we will try to assess the usefulness and effectiveness of the robot. Due to limited time of the course, this will be the last evaluation. However, in the absence of time constraints, we would need to do a formative evaluation to get feedback for the next versions of the robot.
Questionnaire
We used a modified version of the Godspeed questionnaire for our evaluation [1]. It measures the anthropomorphism, animacy, likeability, intelligence, and safety of the robot. This uses a Likert scale where the user must rate questions as a number between 1 and 5; both numbers being at opposite poles. To measure whether patients with dementia completed the activity they were meant to do, and to evaluate whether storytelling made a difference to their meal, we added the following questions:
1. Please rate the question according to the following attributes. - Mood of the patient after the activity. (Scale of 1 to 5)
2. Please rate the question according to the following attributes. - Patient's feedback about the story experience (Scale of 1 to 5)
3. Please rate the question according to the following attributes. - Patient's enjoyment (Scale of 1 to 5)
4. Did the patient complete the activity? (Yes/No)
5. How many minutes did the patient take to complete the activity? (<10 minutes, 10-25 minutes, 25-40 minutes, >40 minutes)
Prototype
We present a low fidelity prototype of the robot, which means a simple demonstration of the initial stages of the robot, meant for formative feedback. We wizard-of-oz the approach, and for now just present one story (in interactive and non interactive modes) for purposes of the experiment. The final robot is expected to have various templates of stories.
For prototyping, we will use incremental prototyping, which means adding features one by one and testing for each. We start with the most basic feature, complete a cycle of testing, and then add on new features to create new versions of the prototype. For the robot, we will first build the non interactive storytelling robot, then add music to it, and then add gestures. With each stage, we test the working of it, and if working as expected, we will move on to adding the next feature.
Evaluation of Results
We decided to use the paired sampled t test since the experiment was a within subject experiment. The one tailed t test was used since we want to find if one condition is better than the other. Though the one tailed t test is more powerful, it could be debatable whether it is better than the two tailed t test in this scenario, since with the one tailed t test, we assume already that the experimental scenario will perform better than the control scenario.
[1]C. Bartneck, D. Kuli´c, E. Croft, and S. Zoghbi, “Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots,” International Journal of Social Robotics, vol. 1, no. 1, p. 71–81, 2008.