RelevantView. The dynamic web site usability testing lab that puts you in control without browser restrictions or special downloads.
Profile
Management
Clients
Partners
News


UI Design Update Newsletter – November, 2004

Insights from Human Factors International

Observer Effects in Usability Testing... or, how to collect data without messing it up

 

John Sorflaten, Ph.D., CUA, CPE, Project Director at HFI, discusses the effect of the observer on usability testing and the differing results between laboratory and unmoderated remote testing.

  person in lab and person at home doing usability test

Observer Effects in Usability Testing...
     

The big picture

 

The hallmark of modern particle physics centers on the "uncertainty principle". Namely, you can know either the position of a so-called wave-particle or its momentum, but not both. The reason for this conundrum is that the very act of observation changes "reality" from probability (the wave-particle might be here) to actuality (hah, gotcha pinned down).

Could the same hold true for usability testing?

Enter reality  

As Bob Dylan once asked, "what is real"?

Have you wondered about the effect of "thinking out loud" and whether it causes your test subject to be more attentive than they otherwise might be? Or perhaps the think-aloud effort competes for attention and distracts the subject from engaging fully in their task?

As a clue to how this works, see if you can remember any moments when your test subject started to fade out... they squished up their brows, hunched over the keyboard, stared intensely at the monitor, and simply faded out... no talk, no anything.

Of course, your standard procedure is to say "tell me what you're thinking," or "tell me what you're looking at." (Your measure of professional attainment is the variety of ways in which you can ask these questions without sounding like a parrot-savant.)

But are you actually derailing an intense thought process with these innocent reminders?

Have you occasionally just let your subject ruminate, get lost in airy spaces of problem solving, hoping they would soon come back to terra firma and report their findings to you?

But then, can they remember the exquisite and lengthy chain of logic that lead them to the "wrong choice"?

Giving up to win  

Like Aikido, Kung Fu, maybe even kick boxing, the art of winning often resolves down to the art of leverage... using our opponent's momentum and making it go where you want it to go rather than landing on your soft body. Definitely, this is the art of altering wave-particle probabilities towards your favor.

In usability testing we have some choices, too. How can we reduce the probabilities of getting hit with bad data? Can we shape the trajectory of the testing event to allow subject insights we know can improve our design?

Van Den Haak and colleagues (2003) investigated the effect of thinking out loud and compared it to an alternative approach called "retrospective" reporting. They compared two approaches to the same usability testing scenariosuniversity students trying to use an online library catalog.

Subjects were 2rd and 3th year Dutch students majoring in the same subject and with some knowledge of online catalogs.

Twenty of these students completed 7 tasks using the normal "concurrent" think aloud method we all use, and twenty others did the same tasks with no verbal comments during the task. However, the latter did post-test commentaries during their video review of their performance.

Test parity achieved  

The outcomes fell into two categories: problems that the facilitator observed and problems that the subject verbalized.

In the concurrent, think-aloud method, the facilitator observed more problems happening than during the "retrospective" test where the subject did not talk during the task.

So, why were there more problems for the concurrent, think-aloud method? Did the test subject pay more attention to their task and thus recognize more problems? Or did the extra effort of talking during the test "cause" more problems? More on this later.

When considering verbalization of problems, the advantage switched to the retrospective method. While watching their videos, the retrospective subjects offered many more comments than the concurrent subjects.

This implies that the concurrent subjects may not have said what they really feltthey failed to report all their observations. Or, as suggested by the researchers, the retrospective subjects were able to report additional problems that were not related to the observed problems.

In both cases, the researchers indicate that the combined observed and verbalized problems came out the same. Thus, they concluded the two methods of testing were more or less equally sensitive to problems. Consequently, they suggest, you could use either test method, depending on how easily subjects can verbalize during the test.

If subjects have difficulty speaking during the task (due to heavy mental work-out), then the retrospective method is fine. Otherwise, use the concurrent method because it takes only half the time (you don't need to review the video with the subject).

Probability again  

However, the researchers asked if there was still some evidence that the extra cognitive work of concurrently thinking out loud during the test influenced the overall outcome. Indeed, they calculated that the concurrent subjects completed only 2.6 tasks successfully, whereas the retrospective subjects completed 3.3 tasks successfully.

Although this difference was not statistically significant, it was close enough to suggest that the extra workload of thinking out loud could influence the results.

Again, we are left with wave-particle duality and a probabilistic interpretation of the results. Well, if you must come to a concrete solution, the statistics say there was no real difference. Just a probability of a difference. You get to decide...

Now, a real difference in results  

A different research group investigated the influence of an unmoderated "remote testing environment" compared to the typical "laboratory" environment.

Schulte-Mecklenbeck and Huber (2003) asked 40 German university students to complete certain Web-based tasks in a typical lab setting. Meanwhile, they had 32 similar students do the same at their homes, using an automated, unmoderated, remote testing paradigm.

Interestingly, the students in the lab completed the tasks using about twice as many clicks and twice as much time compared to the students at home. Whew!

Was the actual task twice as hard in the lab? Or was the perceived pressure twice as much in the lab?

Observer affects the observed  

Since the task was identical in both cases, the authors conclude that subjects in the lab felt more pressure to perform well, and thus made more effort to do well. The authors suggest that they perceived the facilitator as an "authority figure". (How much authority do you command in your lab? Any at all?)

Also, subjects may have perceived the lab setting as "more important" than using the Web at home. (Well, home is for relaxing…but wait, maybe that's where your Web site is most often used?)

In both cases, we see that the observer has influenced the results. In this case the automated test, with subjects out of view of an overseer, appears to have produced more honest results if that is the environment used in the real-life version of the tasks.

These results differ from prior research comparing Web vs. lab. Prior research suggests that Web and laboratory behaviors are about the same. For example, results from a psychological test taken on the Web were comparable to test norms obtained through pencil and paper. In other cases, subjects responded to line drawings and photographs presented on the Web with results similar to face-to-face responses. Even risk-seeking in a lottery setting was similar between Web and laboratory settings.

What was different about this particular test? Why should this laboratory experience generate more diligent responses?

The authors speculate that the nature of the task lends itself to a greater range of responsiveness among subjects. The tasks in this study were open-ended, information-seeking tasks. That is, participants were not seeking a "correct answer." Rather, they sought to find enough information to make a decision.

Thus, participants decided for themselves when to stop. They decided when they had found enough information to make a decision.

Does this sound like the tasks your Web site supports? Probably so, if you work on one of the many large-scale information sites on the Internetor even as found on Intranets.

In any event, we just saw an example where the act of observation definitely influenced the outcome. The uncertainty principle operates in usability testing, just as in testing the behavior of sub-atomic particles.

Back to the surface again  

Just to give us a sense of the normal reality found in classical physics, let's take a look at another comparison study.

Thomas Tullis, a well-known usability author and researcher, worked with his colleagues (2002) to check out whether automated, unmoderated, remote usability testing gives similar results as laboratory testing. Subjects were employees of a US corporation.

In contrast to the study reported above, Tullis used "closed-ended" tasks. That is, the results were either right or wrong. Does that sound like some of your testing outcomes?

If so, you can have some assurance that the laboratory setting doesn't upset your results. Tullis found that for the 13 tasks they used, subjects in the lab gave similar results as subjects in the unmoderated, remote testing environments. His team found similar task success rates and similar task times. No authority effect, here.

Actually, Tullis and team were more interested in whether the unmoderated remote testing was as effective for finding problems as the lab environment. They found that remote testing worked well and had benefits that complemented the lab environment. (Do you recall the benefits of testing more subjects? See our May, 2004 newsletter.)

Aha! More subjects can be better, they found. Whereas the lab setting found 9 issues, the remote test found 17 issues. Well, what do we expect if the lab only has 8 subjects and the remote test has 88 subjects?

Small is good, too  

Interestingly, the law of diminishing returns did not penalize the lab environment unduly. Tullis and crew felt that both the lab and remote environments discovered the three major problems (overloaded home page, general terminology problems, and unclear navigation wording).

Plus, seven out of the nine problems found in the lab were also found in the remote test.

However, more subjects can be better, as we said. And that was the benefit of the remote test. After all, we test to find problems, don't we?

Benefits of both unobserved and observed testing  

What other influences of the remote testing environment appear valuable, aside from the absence of the observer?

Tullis and group found they got greater diversity of user types in task experiences, computer experiences, and individual characteristics. They also got more hardware variety, such as screen resolution. And they were pleasantly surprised by the completeness and insights of the typed responses.

But the lab offered value, too. For example, the remote test revealed usage of 1024 screen resolution among nearly all subjects and revealed a problem with small fonts. The lab setting forced usage of 800 resolution and resulted in detection of excessive scroll requirements. The lab revealed that certain navigation options were overlooked, although the remote results showed most subjects found the options anyway over time.

Tullis and group recommend a combination of both remote and lab testing to cover the range of issues.

Certain conclusions – probably  

So, now we know the observer influences the observation, just like real-life physics.

1) We saw that possibility first in the case of concurrent "think-aloud" usability testingfewer tasks were successfully completed in the lab, but the authors said this difference was not statistically certain. It could be a fluke on the experiment.

But certainly, the concurrent, think-aloud method gave more observed problems than the retrospective reporting. And it gave fewer verbalized problems. However, the combined amount of observed and verbalized problems was equivalent between the two environments. Thus, if we have complex tasks that make it hard for subjects to talk aloud when doing tasks, then feel comfortable showing them their video. They can talk plenty during the video.

2) We saw the influence of the observer increase dramatically in the case of "active information search." Subjects in the lab spent twice as much effort on their tasks compared to remote subjects at home. The lab subjects probably felt obliged to please an "authority figure" appearing in the form of the facilitator. But remember, they worked on an open-ended taskunlike many closed-ended tasks found in transaction-oriented applications. But open-ended tasks are common, too, like found with your information sites.

3) Meanwhile, the observer does not always get in the way. In our last study, lab-based observation allowed sight of body language, strained perception, and vocal hesitancy that would be missed by unmoderated remote testing. Here, the observer added value. But the larger number of subjects possible in the remote testing also added a lot of new issues, otherwise missed in the lab. So both methods complement each other. Do both.

Amidst these suggestions, we do find some guidelinesalbeit, the findings may be qualified by the nationality of the subjects, or their student status, or any other of many differences compared with your target population.

But that's life.

All we can say, like physicists do, is take a chance. And make it work.




If you want to subscribe to this newsletter go to:
http://www.humanfactors.com/downloads/subscribe.asp



©2008 RelevantView LLC. RelevantView and ActiveSandbox are registered trademarks of RelevantView, LLC Privacy | Site Map