The Science of Great Site Navigation

Video Transcript

So in June 1914, on this bridge in Sarajevo, the Arch Duke Ferdinand of Austria was assassinated.

And his assassination set into motion the events that would ultimately lead to the first world war.

Just a couple weeks earlier to this event, in the issue of Science magazine, there was an article published by a Robert Jurchzes on the then burgeoning field of psychological measurement and human intelligence. And his ideas and methods that he was talking about there would go on to be instrumental in both recruiting and sorting the millions of soldiers who were recruited and drafted and ultimately served in World War I.

In that same issue of science, a less-known article was also published on “cards as psychological apparatus”. It was actually an interesting article on how important of a tool, and cheap of a tool, what they called library cards at the time, could be used to sort, and have participants sort, to find out all kinds of interesting characteristics of human behaviour. So this method of sorting cards to measure people, people's attitudes, performance, has been around for almost 100 years now—in many cases older than that—for almost 100 years it's been talked about as a valuable and cheap method. And that's applied to the field of Human-Computer Interaction as well, as most people are familiar with this method of card sorting.

The traditional method of card sorting as we know it now largely took off in the 1990s, certainly in the literature and people using it, as the internet itself started to explode, and people started to now have to navigate through information architectures and taxonomies on websites.

With the increase in technology now, instead of physically being confined to a physical location, you could do this card sorting virtually now, online. I'm showing you, for example, the screenshot of UserZoom, where you've got the cards on the left, and participants who could be recruited in an unmoderated fashion, in an asynchronous fashion as well, can then drag these cards, and create the set of categories. And this is an example of an open sort, where they're also naming the categories.

And the method of card sorting, as I mentioned, is popular. You can see this data comes from the 2011 UPA salary survey. And you can see that a little over half of the participants report using the method of card sorting, which is roughly the same proportion as those who say they do lab-based usability testing.

So we certainly say we do this a lot, and it's come to define what we do. And I'm going to talk a little bit about the quantification of that and its combination of use with tree testing as well.

A couple of terms I've thrown around, for those who are less familiar with card sorting. An open sort is where you're having participants sort these cards, either virtually or in person, physically, and they actually give the names of these categories. So there's not a predefined set of names. Whereas in closed sorting, you have a predefined set of categories and names, and you have users put the cards in those. That's called a closed sort.

And then a reversed card sort, which now is basically just called a tree test, is where you basically ask users to find these items in a hierarchy, but the absence of any type of visual cues or design. It's just the taxonomy itself.

And one important concept you see with websites, and we certainly see with our data in the websites that we've tested, is that most users do start with a browsing strategy on a website. The vast majority, in fact, we've found. And certainly across some tasks there's much more search, sometimes on densely packed ecommerce sites, and depending on the item it can be high. But in most cases, “browse” is the primary way that users are going to interact, initially certainly, with a website in trying to find either items or information. So it's vitally important to understand how they're going about finding things, how they think about it. And the reason why card sorting is so popular, as we saw, is because like many user-centered methodologies, instead of setting up a hierarchy based upon the organisational structure, or what the developers or VPs or designers think should be laid out, the idea is that that should reflect what generally how people come to websites, how they think. And so card sorting is still used to facilitate getting that mental model from the user, as opposed to the company reflected on the website.

Probably the best way to articulate this is just to go through case study. So we picked target.com, we figured most people are probably familiar with that, and talked about a case where wed used both card sorting and tree testing.

So Target's a pretty well refined, well populated website, but as you can see they've got a pretty large navigation taxonomy as well. So let's imagine we wanted to get a baseline set of measures for how findable items are, how well this taxonomy matches to the users' mental model, and identify some areas for improvement.

So we get a set of baseline measures, make improvements, then measure again. That's what we've got here.

And in order to do that, we pick a variety of items that we want to test in a taxonomy. Now we're just doing a hypothetical case study here; but usually when I'm working with folks, doing card sorting or tree testing, there's a lot of hotly debated opinions inside the company about why products are or are not selling. Is it “Could they find it?” “It should be located at the higher end of the category.” Those are usually very good candidates to put into a card sort or tree test, and at least identify, is this an item that is not matching well into people's mental model.

Basically we're going to conduct an open sort with a variety of items, and pick some items, and then as you can see on the next page, we've picked for example for Target, we started with 40 items across several departments. We tried to pick some that we thought were going to be difficult, others that we knew had been difficult across other ecommere sites, or similar type ones, some that we should be a slam dunk—easy for baseline comparison. And this is sort of an art of providing the item names to the participants, but not in such a way that it's impossible for them to know what the item is unless they know the brand name, for example. Or so blatantly obvious that they know exactly where to go in the hierarchy.

So it takes a little bit of back and forth on what this is. And often if you have search data, and you're collecting search data, you can see what the nomenclature is that people use when looking for particular items. That's what we've got here—some combination of more specific, more general, more common, less common. For example, you can't just say “The Presto Granpappy” and expect people to go find it. You have to specify that it's a deep fryer.

With these 40 items, we recruited 50 participants, most of them female. And then had them sort those 40 items into 10 fixed categories. We picked 10 because that's roughly the number, plus or minus a couple, that you can get across the top of the navigation. And when you do card sorting, we did it using UserZoom, we collected these participants remotely and recruited them. And with any unmoderated recruiting session you need to look at the data, because you're not monitoring what the users are doing, to make sure it's not bad data. And we excluded two in there.

And you can go through users' cards and their sorts, to see what they sorted, and pretty quickly look and see where there are categories that don't make any sense, they're just haphazardly put together. And in this case we excluded two of those.

So it took 11 minutes, on average, that was the median time for participants to sort those 40 items. That's probably one of the biggest, first questions people ask us about—how many items, and how long is it going to take. As you can see, there's this baseline. It does depend, but you can see here, for a site like Target, and those 40 items, it was taking about 11 minutes.

And what this has given us is a dendrogram—I'll just keep calling it a dendagram, 'cause it's a lot easier to say. And then a set of open names that the users came up with. Obviously they're clusters. And then we asked about the items which were most difficult to sort for participants. This is sort of getting a mix of output, without making the study too long.

The signature output of card sorting is this dendrogram. And in any type of cluster analysis you get it. It shows the euclidean distance that each item is from each other. Effectively when they're close together like this, it means participants put these largely in the same category. With any cluster analysis there's a sort of art to determining where there really are clusters and where there are not clusters. So this goes beyond card sorting. There's a bit of discretion that the researcher has to bring to it.

But you go look for where there are natural breaks. So what you can see is users have suggested some. I've identified 12 clusters here. And I went ahead and gave them preliminary names. So you could see the items in this first cluster in the dendrogram roughly reflect mens and boys items. And you get womens and girls. Then you've got accessories. Then you've got what I refer to here as a “runt”. It's just the jargon used in dendrograms to refer to items that are sorted in their own category. So participants weren't quite sure where to put the NBA Golden State Warriors blue water bottle. They put it by themselves. And that goes for the Commodore Big Intel Exec oversized chair. They didn't seem to sort too well together. Same thing with the shaver.

And then you've got some other interesting categories. There's only two. You generally like to see where there's a larger cluster of categories. So these are sort of set preliminary hypothesis to say, “Is this just because we had only one or two items in that category? Or is it because these reflect items that don't match our navigation and they're different to the users' mental model?”

As you can see for example, this $5 - $1,000 puppy gift card? And the Dog Zip blue leash? They're categorised together, and I'm going to use this as an example as to maybe this is a false sort, reflecting the artificiality of the items we gave.

So zooming in a little—I know dendrograms are a little hard to see without blowing them up. As you can see there's the distances between this categorisation of mens, womens, accessories, and there's that one runt in there. So this is your first set of categories—if you give participants this set of items, this is how they're going to sort them. And then the next thing I like to look at is, “What did they call those categories?” Because this is an open sort, they can call them anything. And any time you allow users to call it something in an open text field, say generally, will call them anything! You'll see all sorts of varieties of spelling, different punctuation, grammar and capitalisation. I think we had something like 260 unique categories.

But in many cases if you just sort those open-ended items into a simple content analysis, you can actually generate a more manageable, closed set of categories. And that's what we did here.

So this graph shows a percentage of people that used a category that was similar to “electronics”. And also shown are the 90% confidence intervals. So if we were to increase our sample size substantially, we would expect some portion, for example, even though 84% in our sample of 50, we wouldn't expect any more than about 92%, or less than 75% to identify some category as electronics. So it's a very popular category.

Bear in mind that people will call it “electronic” or “electronic apartment”, or it will be a capital E or a lowercase e. So all those together get lumped into something called electronics. So we do a really conservative sort, and we try and keep the categories, but we put them together so we can manage them—womens, mens, homewares, for example.

So with confidence intervals one thing that's nice about it is, you can say, well, when there's no overlap, you can be at least 90% confident that there's a difference. Homewares, for example—the lower bound of the confidence interval is what seeds pretty much everything here. We could assume that in almost every card sort, a majority are going to use homewares as a major categorization strategy; the same thing with mens and womens. Whereas, with things like appliances and sporting goods, you start getting into the tail here ... your upper end of a quarter of the people are using that search strategy. Now again, is that because of the items we're using, or is that because of just the way that people think of things, and how they search from the top level down?

One thing we love to ask is, quite simply, what was the most difficult item to categorise? Again we're starting to see now, some confirmation of the things we were starting to see in that dendrogram. The item that 40% of participants listed as the most difficult to categorise, or as difficult to categorise, was this $5 - $1,000 iconic puppy gift card.

We picked this item as a gift card, and it was the actual label on target, but this goes to show you where your good intentions can go awry. Because what this refers to is the actual Target gift card, with a little puppy face on it. Let me show you what that looks like here. This thing right here, it's a $5 - $1,000 puppy gift card. So there's the cute puppy right there. So I think participants were unsure: “Am I getting a puppy for $5 - $1,000? Or is this just a gift card?” So that could be an abhorration in our testing. But nevertheless, gift cards tend to be outside the range of where people look, and that's reflect in how they're sorted as well.

We got some other ones too that, alright, they were harder to sort. Other ones that seemed to be slam dunks. They seemed to know where to sort these — for example, the sewing machine. They seemed to know pretty well where to put that together. So this is our second step of the data as we're forming our ideas. And usually this sort of thing, to get right away, will help answer some questions that are hotly debated within a company about which items should be together and which ones are difficult to find.

Oh, there's our puppy right there.

Alright, so as I talked about, this is the lower part of that dendrogram, we've got that Nano Tech Shaver, which is a runt. There's that gift card ... and as you can see, because people were seeing “puppy,” they were also associating that with this leash, which people also assumed goes with “dogs”. And you'll see this a lot of times in card sorting. It's called, basically, a “false sort”. People are just focussing on the wrong item, in this case, and part of that is just an abherration of the method. Those are things to look out for.

And then, if you were to change this to “$5 - $1,000 Gift Card”, we'd expect that to just go away, or at least form somewhere else.

So again, a lot of data. This is kind of an eye chart, but part of what comes out of a card sort is: you get to see everywhere people put things, what they called it, what the first, most popular categories were. So I had here in the dendrogram different categories, so we can look in here to see where they fell. So for example, I've got highlighted the large sunglasses. And the way to look at this is: these are the items on the left side of the 40. And this first slide shows where people identified on the primary category there's large agreement, well at least above 40% of people put it in roughly the same category.

So people mostly put large sunglasses in a category called “accessories”. 46% of people did. The second choice they put it in is “womens”, so a bit of a jump. The third choice is they put it in “clothing”, “men” and then “miscellaneous”. Next you see something like the Apple iPod Touch case: 65% - so there's strong agreement on where people would go to look for that, at least that's what the data is suggesting there. And there's a much bigger drop between the second, the third and the fourth choice. This is another good spot that says: “These second and third choices are good places to have cross-linkage within your hierarchy.” Even at best, only maybe 60 or 70% of people are going to look in electronics to find something like an Apple iPod. So good navigation will support multiple points.

And then as we go down the list, at the bottom end, we've got less agreement than those first percent of categories fell into. For example, the Nano Tech Shaver? 38%. So it's less than 40. Some people put it in electronics. Some put it in bath, homeware, men, miscellaneous. There's that puppy gift card again, you can see some people put it in Pets, suggesting that it's a false sort, because you don't just buy puppies with it.

But then you get all the way down to the bottom of the list, there's that other runt, the Golden State Warriors blue water bottle. As you can see, boy, people just didn't know where to look. We saw that in the dendrogram. And we're also seeing a really good split here between categories. It could be miscellaneous, it could be accessories. Either one are good, and it suggests you'd want to have multiple links there as well.

So that's a first ... here's an idea of a mental model. This by itself will allow you, like I said, to go answer particular questions about hypotheses. It usually generates more questions than answers, and that's where the tree test really is the good complement. You go through and you say, of the items that let's say, users had a particularly hard time finding, or that we want to get more data on, or we had an additional set of questions about, “Well, we've got low sales on the Brother sewing machine. Can people find it?” Often in companies there's heated debate about who gets top billing on the top navigation or the secondary nav. Everyone wants to be on the home page and that link, but certain departments are responsible for revenue or year-over-year growth. So a tree test is going to at least allow you to measure findability, like a more traditional usability test.

So we selected 14 ... some were easy ones. Like, the mens belt we expected to be easy ... some of the womens things there was a lot of agreement. And other ones where we thought it was going to be a little bit harder to find.

And so we again recruited another set of 50 participants, mostly women. We didn't have to reject anybody this time. It took 17 minutes to have those 50 participants, on average, find 14 items. So then again you'll see, for a fraction of the items it took longer. Now this is partly because at the end of findability task, we're asking them a couple of questions. So the tasks take a little longer for people to find, and then we're also adding a little bit more overhead here. So that's one of the reasons why tree tests, in many cases, will also have a pretty significant subset of the items that you'll have. So you'll want to keep the time typically for a study less than 30 minutes, as close to possible.

So as I alluded to earlier, in UserZoom, after each participant tried to find the item they're presented with, how confident they were that they found the item correctly, and then how difficult it was. And we use these questions in most of our studies. This is the Single Ease Question (SEQ). It's just seven points, but we've got data that shows it's a good scale, and we've got 250 other web-based tasks, so we can normalise the score and get an idea about how difficult it is.

This is what the user sees in UserZoom. So as you can see here they're presented with a task at the top: “Where would you go to find the Nano Tech Shaver?” And then here is that hierarchy, at least a selection of it, just a screenshot showing where it is, without any design elements to show the participants or give them any cues. And they will go through the navigation, it opens and closes just like a tree, and then they select it and we get: where their first clicks were, where they ultimately clicked.

And then prior to doing this we went through and identified where we figured the optimal paths were, and acceptable paths as well, so we could get an idea of success rates.

And that's the first thing you want to look at in tree test, just like a traditional task-based usability test. It's the fundamental usability metric: “Are users able to find the item?”

As we expected, we had some good wins. We thought the belt would be easy, and everybody was able to find that. And most people were able to find everything up here.

Interestingly enough, we had some items down here, this backpack. Now, nobody could find the backpack. This wasn't showing up as the hardest thing on the open sorts ... that's left us kind of scratching our heads. This Mens Brown Swiss Gear Jackson Hiker, these are actually shoes, so we thought, well maybe that's because the label isn't completely indicative.

And there's that sewing machine, which seemed like most people were finding it. So this is our first set of data, again, which can answer a) Can people find it? This is our first step. And then, from that, we also again asked that question that we like to ask, which is: which items were the most difficult to locate? And they can pick more here than one.

The Number 1 thing here, surprisingly, was “sewing machine”. Something you'd think would be easy to find. Most people said that was hard to find. Then again, we saw that backpack, there's those hiking shoes, the shaver, which was kind of by itself, it was a runt in the open sort. Now, it's barely made the list. Again, 90% confidence intervals are going to tell us if we need to increase our sample size. No less than about half the participants are going to have trouble with the sewing machine. So if I'm in the sewing machine department, and I'm trying to push more sewing machines, this is data that suggests people are having a hard time finding sewing machines in our current navigation structure.

We also like to ask participants, based upon what they pick, we ask them: Why did you have a hard time with these particular ones? And so we summarise those verbatims into a couple of digestible nuggets, so we can sort of say, “oh, that makes sense.” So for example with the sewing machine, it just, in many cases, it just didn't have a separate section. It didn't fit in the categories.

And this Kaleidoscope backpack, this one we scratched our head, but then when we saw it, it's like: Oh, I think people were focussing on “kaleidoscope”. Is it a toy? Is it camping? Is it luggage? Or is it in the childrens section? Again, a great place to make sure you've got cross navigation here for findability.

Then we were concerned about, maybe it was just an abherration because maybe we should have put “hiker boots”. And they, as you can see, did not know what this was. Was it a boot? Was it a jacket? And that suggests that's probably why that was identified as something difficult.

And interesting enough, Nano Tech Shaver: is it for a man or a woman? Calvin Klein, so ...

Here's the results of the Single Ease Question. You know, we're sort of triangulating ... multiple measures are going to correlate, but each one sort of tells a different story. So this is the Single Ease Question. It's been normalised. So 50% = average findability; 100% means it's extremely easy to find; 0 means it's very difficult to find.

So the ones that were most successful in general are also the ones that people tended to find easier to use. But it's not always the case - you tend to get more discrimination here.

And so, see that sewing machine! I mean, 42% of people were able to find it! But when it scores the 5th percentile for the SEQ, that means it's more difficult to find than 95% of products. That's a powerful message if you're a product manager for that product, or you're a buyer, or if you're trying to sell that particular sewing machine.

Same thing here with that backpack. It may just be an abherration. If you're trying to introduce a new product line, like wedding invitations, it suggests that the current navigation doesn't support that too well.

We asked about that confidence? Here's the raw score of confidence from 1 to 7, again telling how confident people generally are. You can see a similarity with the difficulty. What we like to do with confidence is, we like to put that in a four-block with task success. So everything that's in the upper right corner here shows where participants thought the items ... they were very confident that they found it in the right location. This is where it would be in the tree test. And in fact they did, they had a high success - above 50 and above 50.

Confidence in this case was just normalised, so it went from 0 to 100%.

So down here, in the lower left quadrant. It's what I call the Difficult quadrant. It's difficult ... below 50%. People know it. In the Deceptive quadrant, you've got: people are generally, well at least half of them are finding them. But they're not terribly confident. So for example, Party Favor, and there's that sewing machine creeping in there ... wedding cards.

The Disaster quadrant is where people are like: “Oh yeah, I'd find that in the BBQ section.” That's generally where people are looking. But in fact that success rate of 40% is pretty indicative that perhaps there's something else going on. It's probably worth investigating to work out why, looking at the verbatims, part of the redesign, these are the sort of things you want to be concerned about.

And of course you want to get as many folks up here as possible into the right hand quadrant. You want people to be finding it, and you want them to be sure about it. Now in any usability test it's a contrived scenario. You're compensating participants. It's what I call, in any study, it's “Because you asked me to do it, it must be there” bias. In real life, if they're looking for something, they don't actually know if it's actually carried by Target. So they'll look for it, and they'll give up, assuming it's not there.

And so that's why both confidence and correct is important. You want people to be both sure that they're finding it, and finding it.

So now what I've done is I've shown that same question that we asked in both the tree sort and the open sort: “What was the hardest item to find?” And this is where it really illustrates the important complementary nature of card sorting and tree testing. Most researchers are very familiar with card sorting, and have conducted it. But card sorting in many cases is just the beginning of the conversation. So what this graph is showing us, for example: look at that sewing machine. In the open sort, it was barely registering as one of the most difficult items. It didn't really even stand out as very difficult. Yet it's by far the most difficult item. So people are having not a very hard time sorting that, but they're having a very hard time finding that.

And the reason why this is important, and why tree testing is an important complement of that conversation, is because at the end of the day, users don't come to your website and sort cards. They come to try and find things: find information, find products. So tree testing is essential validation. And then you've got some down here where we have good consistency, where it wasn't hard to find in both tests.

When I went through and just did a correlation - a correlation between the difficulty to find and the tree test and the card, It's got a 0.4, meaning “difficulty sorting explains only about 16% of difficulty finding”. And that's pretty consistent with what we've seen across other tree tests/card sort. It's about between 16% and 25% explanatory power. Meaning there's similarity - you're seeing similar things as you see here. But you're going to get some differences, and so both of these together are providing a better picture.

So one additional thing to talk about that I wanted to mention, because this comes up all the time, is “Well, how did you come up with your sample size?” “What sample size do you use?” Surprisingly there's not much literature in the area of a sample size for card sorting. And part of that stems from the fact the nature of the output of a dendrogram is difficult to identify an optimal sample size. Some work that Tom Tullis had done I think now almost 20 years ago, and that he published about 15 years ago, talks about ideal sizes to get similar structures.

But another way to identify the sample size is based upon the other output measures that we saw today, which were things like success rates of items, or percent difficulty in finding them. When we have confidence intervals, we can just work backwards from those confidence intervals and generate the desired precision that we want. So you can see here we were bouncing somewhere between 45 and 54, and so most of our margins of error, those error bars, were about plus or minus 10 percent. Sometimes smaller, sometimes larger, depending on where it was relative to the proportion. If it's closer to 50% it'll be wider. And then for the the continuous measures like SEQ and confidence, it'll be narrower as well.

But you could see sort of what you need to get a higher level of precision, the number of participants you would need to do to sort. But certainly it's a nice sweet spot, as you can see, the bang for buck happens in here. I mean: 50? You'd have to quadruple your sample size in order to cut your margin of error in half. So even at 50 ... I know we were coming in at plus or minus 10%, which means we're able to detect the more obvious differences between success rates, different groupings, and that was good certainly for this project, but it's also there's usually more things for researchers to tackle, design teams to work on. So you want to identify those big ones.

10% is also reasonable level of precision when we want to go forward next time and say, “we've made changes, now let's test against our measures to see how well we've improved or not improved at all.” And you're going to need a slightly different calculation to come up with that.

Some conclusions when we go through this. Typically we'll have a pretty good discussion with product teams, or people that are going to be changing the product hierarchy, to see whether particular hypotheses were either confirmed or denied. Which you basically want to say: should we create new categories? Should we consolidate those categories? How are people searching? Are they looking by gender, or is it more age? Should there be more sub-categories? And then methodologically watch for those false sorts we talked about: that puppy gift card, the mens hiking boots - people had no problem sorting that because it said “mens”, they just threw it in the mens category. But then when they had to find it, they suddenly had to pay more attention. And then it's like, “oh, well, is that in shoes? What exactly is that?”

And then for those key items that might be hard to find, like the sewing machine and the backpack, maybe this is an opportunity to move them, and test that move. Find out if you've made an improvement from the baseline measures you had. And then as I've illustrated in those last few slides: it's hard to do a card sort without then doing a follow-up tree test, because it just tends to open up so many more questions. And once you've got those items that are difficult to sort, or some of the ones that are easy, I love to go immediately into tree test, to validate. And to find out more areas that the card sort is not finding. Because, like I said, users don't sort cards on your website. They look for things.

It's vital, of course it's important to get at their mental model when building that taxonomy. But you want to get some idea about a realistic search scenario.

So that's what I've got for the presentation portion of today. To find out more, a book that Jim Lewis and I have: Quantifying the User Experience. It's a couple of months out from Morgan Kaufmann. You can find out more about it, and some of the things I talked about also, on the website measuringusability.com.

Video Transcript

Jeff Sauro

UserZoom