|

UI Design Update Newsletter – November, 2004
Insights from Human Factors International

|
Observer
Effects in Usability Testing... or, how to collect data without
messing it up
|
|
John Sorflaten,
Ph.D., CUA, CPE, Project Director at HFI, discusses the effect of
the observer on usability testing and the differing results between
laboratory and unmoderated remote testing. |
|
 |
| Observer Effects in Usability Testing... |
| |
|
|
|
The big
picture
|
|
The hallmark of
modern particle physics centers on the "uncertainty principle".
Namely, you can know either the position of a so-called
wave-particle or its momentum, but not both. The reason for this
conundrum is that the very act of observation changes "reality" from
probability (the wave-particle might be here) to actuality
(hah, gotcha pinned down).
Could the same
hold true for usability testing?
|
 |
| Enter
reality |
|
As Bob Dylan
once asked, "what is real"?
Have you
wondered about the effect of "thinking out loud" and whether it
causes your test subject to be more attentive than they otherwise
might be? Or perhaps the think-aloud effort competes for attention
and distracts the subject from engaging fully in their
task?
As a clue to
how this works, see if you can remember any moments when your test
subject started to fade out... they squished up their brows, hunched
over the keyboard, stared intensely at the monitor, and simply faded
out... no talk, no anything.
Of course, your
standard procedure is to say "tell me what you're thinking," or
"tell me what you're looking at." (Your measure of professional
attainment is the variety of ways in which you can ask these
questions without sounding like a parrot-savant.)
But are you
actually derailing an intense thought process with these innocent
reminders?
Have you
occasionally just let your subject ruminate, get lost in airy spaces
of problem solving, hoping they would soon come back to terra firma
and report their findings to you?
But then, can
they remember the exquisite and lengthy chain of logic that lead
them to the "wrong choice"? |
 |
| Giving up to win |
|
Like Aikido,
Kung Fu, maybe even kick boxing, the art of winning often resolves
down to the art of leverage... using our opponent's momentum and
making it go where you want it to go rather than landing on your
soft body. Definitely, this is the art of altering wave-particle
probabilities towards your favor.
In usability
testing we have some choices, too. How can we reduce the
probabilities of getting hit with bad data? Can we shape the
trajectory of the testing event to allow subject insights we know
can improve our design?
Van Den Haak
and colleagues (2003) investigated the effect of thinking out loud
and compared it to an alternative approach called "retrospective"
reporting. They compared two approaches to the same usability
testing scenarios – university students trying to use an
online library catalog.
Subjects were
2rd and 3th year Dutch students majoring in the same subject and
with some knowledge of online catalogs.
Twenty of these
students completed 7 tasks using the normal "concurrent" think aloud
method we all use, and twenty others did the same tasks with no
verbal comments during the task. However, the latter did post-test
commentaries during their video review of their
performance. |
 |
| Test parity achieved |
|
The outcomes
fell into two categories: problems that the facilitator
observed and problems that the subject
verbalized.
In the
concurrent, think-aloud method, the facilitator observed
more problems happening than during the "retrospective" test where
the subject did not talk during the task.
So, why were
there more problems for the concurrent, think-aloud method? Did the
test subject pay more attention to their task and thus recognize
more problems? Or did the extra effort of talking during the test
"cause" more problems? More on this later.
When
considering verbalization of problems, the advantage
switched to the retrospective method. While watching their videos,
the retrospective subjects offered many more comments than the
concurrent subjects.
This implies
that the concurrent subjects may not have said what they really
felt – they failed to report all their observations. Or, as
suggested by the researchers, the retrospective subjects were able
to report additional problems that were not related to the observed
problems.
In both cases,
the researchers indicate that the combined observed and
verbalized problems came out the same. Thus, they concluded
the two methods of testing were more or less equally sensitive to
problems. Consequently, they suggest, you could use either test
method, depending on how easily subjects can verbalize during the
test.
If subjects
have difficulty speaking during the task (due to heavy mental
work-out), then the retrospective method is fine. Otherwise, use the
concurrent method because it takes only half the time (you don't
need to review the video with the subject).
|
 |
| Probability again |
|
However, the
researchers asked if there was still some evidence that the extra
cognitive work of concurrently thinking out loud during the test
influenced the overall outcome. Indeed, they calculated that the
concurrent subjects completed only 2.6 tasks successfully, whereas
the retrospective subjects completed 3.3 tasks
successfully.
Although this
difference was not statistically significant, it was close enough to
suggest that the extra workload of thinking out loud could influence
the results.
Again, we are
left with wave-particle duality and a probabilistic interpretation
of the results. Well, if you must come to a concrete
solution, the statistics say there was no real difference.
Just a probability of a difference. You get to
decide...
|
 |
| Now, a real difference in results |
|
A different
research group investigated the influence of an unmoderated "remote
testing environment" compared to the typical "laboratory"
environment.
Schulte-Mecklenbeck and Huber (2003) asked 40 German
university students to complete certain Web-based tasks in a typical
lab setting. Meanwhile, they had 32 similar students do the same at
their homes, using an automated, unmoderated, remote testing
paradigm.
Interestingly,
the students in the lab completed the tasks using about twice as
many clicks and twice as much time compared to the students at home.
Whew!
Was the actual
task twice as hard in the lab? Or was the perceived pressure twice
as much in the lab?
|
 |
| Observer affects the observed |
|
Since the task
was identical in both cases, the authors conclude that subjects in
the lab felt more pressure to perform well, and thus made more
effort to do well. The authors suggest that they perceived the
facilitator as an "authority figure". (How much authority do you
command in your lab? Any at all?)
Also, subjects
may have perceived the lab setting as "more important" than using
the Web at home. (Well, home is for relaxing…but wait, maybe that's
where your Web site is most often used?)
In both cases,
we see that the observer has influenced the results. In this case
the automated test, with subjects out of view of an overseer,
appears to have produced more honest results if that is the
environment used in the real-life version of the tasks.
These results
differ from prior research comparing Web vs. lab. Prior research
suggests that Web and laboratory behaviors are about the same. For
example, results from a psychological test taken on the Web were
comparable to test norms obtained through pencil and paper. In other
cases, subjects responded to line drawings and photographs presented
on the Web with results similar to face-to-face responses. Even
risk-seeking in a lottery setting was similar between Web and
laboratory settings.
What was
different about this particular test? Why should this laboratory
experience generate more diligent responses?
The authors
speculate that the nature of the task lends itself to a greater
range of responsiveness among subjects. The tasks in this study were
open-ended, information-seeking tasks. That is, participants were
not seeking a "correct answer." Rather, they sought to find enough
information to make a decision.
Thus,
participants decided for themselves when to stop. They decided when
they had found enough information to make a decision.
Does this sound
like the tasks your Web site supports? Probably so, if you work on
one of the many large-scale information sites on the Internet –
or even as found on Intranets.
In any event,
we just saw an example where the act of observation definitely
influenced the outcome. The uncertainty principle operates in
usability testing, just as in testing the behavior of sub-atomic
particles.
|
 |
| Back to the surface again |
|
Just to give us
a sense of the normal reality found in classical physics, let's take
a look at another comparison study.
Thomas Tullis,
a well-known usability author and researcher, worked with his
colleagues (2002) to check out whether automated, unmoderated,
remote usability testing gives similar results as laboratory
testing. Subjects were employees of a US corporation.
In contrast to
the study reported above, Tullis used "closed-ended" tasks. That is,
the results were either right or wrong. Does that sound like some of
your testing outcomes?
If so, you can
have some assurance that the laboratory setting doesn't upset your
results. Tullis found that for the 13 tasks they used, subjects in
the lab gave similar results as subjects in the unmoderated, remote
testing environments. His team found similar task success rates and
similar task times. No authority effect, here.
Actually,
Tullis and team were more interested in whether the unmoderated
remote testing was as effective for finding problems as the lab
environment. They found that remote testing worked well and had
benefits that complemented the lab environment. (Do you recall the
benefits of testing more subjects? See our May, 2004
newsletter.)
Aha! More
subjects can be better, they found. Whereas the lab setting found 9
issues, the remote test found 17 issues. Well, what do we expect if
the lab only has 8 subjects and the remote test has 88 subjects?
|
 |
| Small is good, too |
|
Interestingly,
the law of diminishing returns did not penalize the lab environment
unduly. Tullis and crew felt that both the lab and remote
environments discovered the three major problems (overloaded home
page, general terminology problems, and unclear navigation wording).
Plus, seven out
of the nine problems found in the lab were also found in the remote
test.
However, more
subjects can be better, as we said. And that was the benefit of the
remote test. After all, we test to find problems, don't
we?
|
 |
| Benefits of both unobserved and observed testing
|
|
What other
influences of the remote testing environment appear valuable, aside
from the absence of the observer?
Tullis and
group found they got greater diversity of user types in task
experiences, computer experiences, and individual characteristics.
They also got more hardware variety, such as screen resolution. And
they were pleasantly surprised by the completeness and insights of
the typed responses.
But the lab
offered value, too. For example, the remote test revealed usage of
1024 screen resolution among nearly all subjects and revealed a
problem with small fonts. The lab setting forced usage of 800
resolution and resulted in detection of excessive scroll
requirements. The lab revealed that certain navigation options were
overlooked, although the remote results showed most subjects found
the options anyway over time.
Tullis and
group recommend a combination of both remote and lab testing to
cover the range of issues.
|
 |
| Certain conclusions –
probably |
|
So, now we know
the observer influences the observation, just like real-life
physics.
1) We saw that
possibility first in the case of concurrent "think-aloud" usability
testing – fewer tasks were successfully completed in the lab,
but the authors said this difference was not statistically certain.
It could be a fluke on the experiment.
But certainly,
the concurrent, think-aloud method gave more observed problems than
the retrospective reporting. And it gave fewer verbalized problems.
However, the combined amount of observed and verbalized problems was
equivalent between the two environments. Thus, if we have complex
tasks that make it hard for subjects to talk aloud when doing tasks,
then feel comfortable showing them their video. They can talk plenty
during the video.
2) We saw the
influence of the observer increase dramatically in the case of
"active information search." Subjects in the lab spent twice as much
effort on their tasks compared to remote subjects at home. The lab
subjects probably felt obliged to please an "authority figure"
appearing in the form of the facilitator. But remember, they worked
on an open-ended task – unlike many closed-ended tasks found
in transaction-oriented applications. But open-ended tasks are
common, too, like found with your information sites.
3) Meanwhile,
the observer does not always get in the way. In our last study,
lab-based observation allowed sight of body language, strained
perception, and vocal hesitancy that would be missed by unmoderated
remote testing. Here, the observer added value. But the larger
number of subjects possible in the remote testing also added a lot
of new issues, otherwise missed in the lab. So both methods
complement each other. Do both.
Amidst these
suggestions, we do find some guidelines – albeit, the
findings may be qualified by the nationality of the subjects, or
their student status, or any other of many differences compared with
your target population.
But that's
life.
All we can say,
like physicists do, is take a chance. And make it work.
|

If you want to subscribe to this newsletter go to:
http://www.humanfactors.com/downloads/subscribe.asp
|
|