b. Test

Last modified by Nicolas Asman on 2025/10/24 14:49

1. Introduction

The Prototype section presented four different prototypes (P11, P12, P21, P22) of the robotic dog companion for the use case UC02.1. Each prototype represents a unique combination of movement and sound designed to communicate the robot's intention to seek companionship with the Person with Dementia. All prototypes were created as animations in Blender using robot model meshes sourced from the official Miro-E Simulation Codebase, with videos recorded from a third-person perspective showing the robot approaching and requesting interaction.

In this test, the videos will be assessed by participants in an online evaluation to test claim C02: whether the robot's method of requesting companionship is perceived as non-intrusive and appropriate by PwDs. The hypotheses are that participants will correctly interpret the robot's intention, that certain combinations of movement and sound will be perceived as more appropriate than others, and that the most effective prototypes will communicate clearly without being perceived as intrusive or inappropriate. The measures are tested with an online questionnaire (immediately after viewing one out of four videos).

The participants will be students from other groups that take the course, firends, family and acquaintances of the Team. The data will be anonymized. It is intended to be a between-subjects design, in which each participant views only one of the four prototype videos, with random assignment to conditions.

2. Method

The prototypes were evaluated through an online experiment with multiple participants.

2.1 Participants

Students from other project groups within the same course. The target sample size is 32 participants, distributed approximately evenly across the four prototype conditions (8 participants per prototype).

2.2 Experimental design

For the experiment, we used a between-subjects design. Each participant was randomly assigned to view only one of the four prototype videos (P11, P12, P21, or P22). This design was chosen to avoid learning effects and to capture their natural, unbiased interpretation of each prototype's conveyed information. Random assignment ensures that any pre-existing differences between participants are distributed evenly across conditions.

2.3 Tasks

Participants were presented with a contextual scenario describing a situation where they are sitting on the couch having a lazy day, and a robotic dog they recognize but have never seen before enters the room. They then watch a single video showing one of the four prototypes. After viewing, participants completed a short questionnaire about their interpretation of the robot's behavior.

2.4 Measures

We measured the effectiveness of each prototype's ability to communicate companionship-seeking intention. Our primary quantitative measure was the percentage of participants who correctly identified "Wants companionship/to keep you company" from a multiple-choice list of possible robot intentions. Participants were instructed to select all interpretations they felt confident about, allowing us to also measure clarity through the number of options selected (fewer selections indicating a clearer message). Secondary measures included clarity ratings (1-5 scale), appropriateness ratings (1-5 scale), and patterns of incorrect interpretations to understand sources of confusion.

2.5 Procedure

The procedure was conducted as follows:

  1. Participants access the online form via a distributed link.
  2. Brief introduction explaining they will watch a short video and answer questions about it.
  3. Random assignment to one of four prototype videos (P11, P12, P21, or P22).
  4. Presentation of contextual scenario text.
  5. Video viewing (single prototype only).
  6. Complete questionnaire with four questions:
    • Q1: Multiple-choice interpretation (select all confident answers)
    • Q2: Clarity rating (1-5 scale)
    • Q3: Appropriateness rating (1-5 scale)
    • Q5: Attention Check
  7. Automatic submission and data collection.

The entire process took approximately 2-3 minutes per participant. No technical issues were anticipated or encountered, as the evaluation relied on simple video playback and form submission through standard online survey platforms.

2.6 Material

2.6.1 Online survey form

To collect participant responses efficiently and anonymously, we used Microsoft Forms. The form included the contextual scenario text, embedded video player, and questionnaire items. Data was automatically collected and anonymized.

2.6.2 Prototype Videos

Four videos (P11, P12, P21, P22) were created in Blender, each showing the robotic dog performing a unique combination of movement and sound to initiate companionship. Videos were rendered at [resolution/duration] and hosted on [platform] for embedding in the survey form.

2.6.3 Stimulus Control

All videos matched in duration (10 seconds), camera angle, and background.  Audio cues were normalized to make sure there was equal loudness across prototypes. Additionally, the lighting, playback frame rate (30 fps), and resolution (1920 × 1080) were identical to each other. On top of this, participants were prompted to watch in a quiet environment or use headphones to reduce the difference in perception.

2.7 Claim Mapping

MeasureOperational DefinitionLinked Claim(s)Expected Outcome
AccuracyPercentage of people who correctly selected “Wants companionship / to keep you company” from the list.C03High accuracy indicates that the robot’s sound + movement are understood as companionship seeking.
ClarityParticipants rate how easy the robot’s intention was to interpret.C02Higher clarity would suggest that guidance cues are easily recognisable.
AppropriatenessParticipants rate how socially acceptable and non-intrusive the robot’s behaviour felt.C02A high appropriateness score would mean that the robot's companionship request is polite and comfortable to observe (not intrusive/annoying).

3. Results

3.1 Preliminary Results

Our n = 36 participants random choice of one out of four videos was uniformly distributed. A chi-square test was performed with a Chi-Square test statistic of 0.66 and a p-value = 0.88 (p > 0.05).

q0.png

3.2 Primary Measure: Multiple-choice interpretation

Our key measure was whether participants correctly selected the robot’s intention. Additionally, in the total score per video calculation we deducted points for negative interpretations. Here is each possible response and the assigned weight/score:

Positive:

    'Wants to keep you company': 3
    'Is inviting you to engage with it': 2
    'Is greeting you': 2
    'Is asking permission for something': 1

Neutral:

    'Just wandering around': 0
    'Wants to play with you': 0
    'Wants to go outside for a walk': 0
    'Has completed a task and is reporting back': 0
    'Is feeling lonely and seeking comfort': 0
    'Wants you to follow it somewhere': 0

Negative:

    'Is malfunctioning or acting strangely': -3
    'Is confused or doesn't know what to do': -3
    'Is warning you about something': -3

q1.png

3.3 Clarity Rating

Our secondary measure was how clear participants found the message of video they watched. This question is in a 5-point scale. We performed a Kruskal-Wallis H-test for clarity across videos. The H-statistic was 2.68 with a p-value = 0.44 (p > 0.05) so there were no statistically significant differences in clarity across the 4 videos. Here are the detailed results:

q2.png

3.4 Appropriateness rating

Our last investigation concerned how appropriate participants found the behavior displayed by the dog. We performed another Kruskal-Wallis H-test for appropriateness across videos. The H-statistic was 4.56 and the p-value = 0.20 (p > 0.05). Thus, no significant differences in ratings were observed across our 4 videos. Here are the detailed results:

q3.png

3.5 Composite Scoring

To evaluate overall video performance, we computed a composite score that integrates the three key metrics. First, each metric was normalized to a 0-1 scale to enable fair comparison. The interpretation score (which ranged from negative to positive values based on our three-tier scoring system) was normalized by dividing by the maximum interpretation score observed across all videos. Clarity and appropriateness ratings, originally measured on 5-point Likert scales, were normalized using the formula (score - 1) / 4. These normalized metrics were then combined using a weighted average that reflects their relative importance: interpretation was weighted at 50% (as the primary measure of communicative success), with clarity and appropriateness each contributing 25%. The final composite score represents a balanced evaluation of each video's ability to convey the desired message while being clear and appropriate.

composite.png

3.6 Error Patterns and Alternative Interpretations

To understand how participants misinterpreted the robot's behavior, we analyzed the distribution of selected responses across videos.

Video D showed the highest proportion of neutral interpretations (61%), with the lowest negative responses (6%). This suggests that Video D was perceived as appropriate and non-threatening, but its message may have been somewhat ambiguous.

Videos A and C exhibited similar error patterns, with relatively high negative interpretation rates (19% and 21% respectively). The primary negative interpretations were "Is malfunctioning or acting strangely" and "Is confused or doesn't know what to do," suggesting that its movement-sound combination were perceived as dysfunctional. We are particularly concerned by these results, as these negative feelings undermine our goal of non-intrusiveness.

Video B had a balanced profile of 47% neutral interpretations and the lowest negative rate among A, B, and C (14%). The most frequent alternative interpretations were "Just wandering around" and "Wants to play with you". Both neutral options represent positive intentions that are closely aligned with our target behavior.

Lastly, the results reveal that videos with higher positive interpretation scores (A, C) also attracted more negative interpretations. At the same time Video D achieved high appropriateness and clarity at the cost of reduced specificity. This suggests that more expressive behaviors may carry more communicative potential but a higher risk of misinterpretation by test subjects.

4. Discussion

These findings connect back to Human Factors. Subtle and familiar cues support Emotional Regulation and Social Connectedness. These are key needs for people with dementia. The prototypes we designed that used calmer, gentler cues were seen by people as slightly more comfortable. This reflects the importance of designing emotionally reassuring and intuitive interactions.

Our results provide mixed support for claim C02. The statistical tests revealed no significant differences in appropriateness ratings across videos (H = 4.56, p = 0.20), suggesting that all four prototypes achieved similar levels of appropriateness from the perspective of observers. Next, clarity ratings showed no significant variation (H = 2.68, p = 0.44), indicating that participants found all videos reasonably interpretable.

However, the interpretation accuracy results show substantial variation in how effective the videos were at communicating the dog's intent. Video D's performance shows that it can successfully convey companionship-seeking intention, while being perceived as clear and appropriate. At the same time Videos A, B, and C (despite being perceived as appropriate) struggled to communicate the specific intended message with the same effectiveness, and attracted considerably higher levels of negative interpretations.

The high proportion of neutral interpretations across all videos (36-61%) suggests that participants consistently recognized the robot's social intent. However, they did not always correctly identify the specific goal. We consider this to be a shortcoming of our choice to use non-verbal communication, and to rely entirely on realistic animal sounds and body language.

5. Limitations

First, our participants were primarily young adults without cognitive impairment, whereas the target population is elderly individuals with dementia. PwDs may interpret robot behaviors differently due to cognitive differences and expectations about social interaction.

Second, the online video-based evaluation lacks the quality of in-person robot interaction. This can have a substantial effect on behavioral interpretations.

Third, our sample size of 8-10 participants per video was simply too small to capture significant statistical differences between the conditions in our research.

6. Conclusions

This evaluation tested four prototypes of a robotic dog companion initiating companionship. We assessed whether the robot's communication method would be perceived as non-intrusive and appropriate. Our findings indicate that prototype design has an impact on the participant's perception,  with Video D showing the best performance across all metrics. All prototypes were considered acceptably appropriate by participants. However, the large variation in interpretation accuracy highlights the need for more research of specific movement-sound combinations.


6.1 Next-iteration plan:

Based on our findings, Prototype D (Movement 2 + Sound 1) will more than likely serve as the baseline for our next physical MiRo-E physical prototype. We will integrate verbal and musical cues to strengthen activity recognition (testing C03 further). A future study with caregivers and PwDs will test these new cues in real-world contexts and give us updated claims and requirements.