Design Metrics That Matter | Jared Spool, UIE | BoS USA 2018

February 4, 2019 by Paddy Heaton

Jared Spool, Founder, UIE

Design is a process where we finely tune our intuition to create great user experiences. Yet, sometimes, what we think is best rivals the metrics. What do we believe – our gut or the data? In the world of KPIs, some practices, like growth hacking Monthly Average Users (MAUs), have hurt the online experience. What can help designers have better instincts? What do you really need to help management interpret data and create analytical experiments that provide design insights? This talk will show you:

What easily-collected analytics, like bounce rate, actually tell us about our users’ experiences
Why techniques like a money-left-on-the-table analysis can show us how metrics impact design
Why asking, ”Would you recommend this?”, is a bad way to measure brand engagement

See the Video, Slides, and Transcript below

Video

https://businessofsoftware.wistia.com/medias/t65oa6eke1?embedType=async&videoFoam=true&videoWidth=640

Slides

Jared Spool (Founder, UIE) – Design Metrics That Matter from Business of Software Conference

Learn how great SaaS & software companies are run

We produce exceptional conferences & content that will help you build better products & companies.

Join our friendly list for event updates, ideas & inspiration.

Transcript

Hi. This is my first time here. Uh, thank you. From what I can tell, putting on an event like this takes an immense amount of work – Mark’s been working on this for a very long time. Uh, and there’s a whole lot of staff involved. I think we should make some noise to thank them for that. Excellent. Okay, very good.

So as Mark mentioned, I am, uh, Internet sensation and teen heartthrob Jared Spool. And what I would like to do is go back to 2010 when a young designer, uh, in Sydney, Australia, uh, decided to, uh, sign a book contract. He was going to write a book as first time he’s ever written a book. And he did what many first time book authors do. The moment they sign their contract, he decided it was the perfect time to redo his personal website and in the process of redoing his personal website he thought, well, okay, this is going to be a book based on performance based design. So I’m not going to do one version of the website and we’ll do two versions of the website. I’m going to A/B test them and then I’m going to see which one performs better and that’ll be my site. And then I can use this as an example in the book. So he comes up with one example. This example talks specifically about the book. It has all sorts of information about what the book’s about and what’s going to be in it. And if you’re interested you can sign up with an email address. In the second variation he decides that he’s not going to say anything about the book, he’s just going to say, hey, are you a designer? If so there’s a book coming, you probably want it. And again, a place to put in your email address. And the goal was to actually see which collected more email addresses, whichever one would collect most more email addresses. That would be the one he was going to go with. And interestingly enough, he expected the first variation to collect more email addresses that that’s the one that actually told you anything about the book. Okay. But to his surprise, that only collected 33 email addresses. Out of the 600 people he had visit that page versus the other design which collected 77 email addresses out of the 600 they’ve visited that page. So fascinatingly, uh, he was once again pleased that data actually trumped the design. The data actually told him that his instinct was wrong and this was basically the premise of his book. That data will tell you where your instinct is wrong. Now there are two things that are specific here. One is that he had this base assumption that more email addresses are in fact the better measure and he also had the basic assumption that all email addresses are equal. The thing is is that what would we expect would be a different metric if the metric wasn’t in fact number of email addresses, but in fact something more business oriented like sales of the book. Would we expect that the people who get the subsequent email campaign, once the book is out, would be more likely to buy if they came from variant a, than if they came from variant be the ones who actually knew about the book, who signed up, would they be more likely to purchase than the ones who knew nothing about the book and once they find out, maybe would say, actually that’s not something I’m interested in. Unfortunately, we’ll never know because he never wrote the book.

However, there’re some interesting lessons we can learn from his experiment. We can start by taking apart what he actually did. He collected email addresses and he made a couple of assumptions. Now in the world of experimental design we have actual names for these things we call the email addresses observations. While we didn’t actually observe people typing them in, we could observe that the database collecting them got larger as it, as the time went on. That the email addresses are better, that all email addresses are equal, that’s not an observation. Those are inferences and the difference between observations and inferences. It’s actually very important. This sort of builds on what uh, Bob was just saying, well, we have to have this nice separation of our observations versus our inferences because it’s observations that lead to inferences. We make our inferences based on the observations and then when we want to change something, what we change, well that’s the design decision that’s based on the inferences that we made. So we can sort of map this out. We can see how this all works. Right? So the observation was that the second variant had more email addresses, the inferences where based that more email addresses or better, therefore we should use the second variant. But that’s based on that solid inference that more email addresses are better, What we’re actually doing here is we’re saying, okay, what did we see, why do we think it happenec? And how will that in essence improve the design that we have?

Many years before Luke’s experiment I got to work on a project at Wells Fargo and the team there had just launched this website. This was 2004. Now to put things in perspective the site that existed before this one had a gray speckled background. And these site that came up was in the days, uh, when graphic old design was sort of new. So this project started back in 99 and, and, and at this point, uh, uh, doing anything graphically that was not just plain text through the browser, uh, was a new thing. And there were lots of things about this page that were brand new for users – being able to sign in for your account. For example, on the homepage was a brand new feature. No one had ever done that before. And so the team was really interested in figuring out, well, could we learn something from the logs? Could we learn things from the measurements? And so we started to dive into various aspects of, of what was going on this page and seeing what the logs could tell us. And one of the first things that we looked at was what sort of things people were using search for. We were very fascinated by the search box primarily because it was one of the first instances where the search box itself was on the home page. Right. So this was, this was something new and most other systems, you had a button that said search and then it would, or a link and it would bring you to a separate page and you’d type in search there and it would have some sort of fancy way of doing it. This was on the homepage, so we wanted to see what people are looking for and the team thought that what they would find the most common thing searched for would be things like where’s the nearest atm or what are the current money market rates. But interestingly enough, when we dove in, we found that the most searched for thing was actually blank. The log file was filled with blanks. Not every record, but a lot of them had blanks in them. So try and figure out why blank was the most searched for thing.

So the team had a divergence in opinions about what could be causing this. There was a faction that believed that it had to do with the the enter button, right, the enter button on the keyboard, that that had been mapped not to the search box but to the sign on box, I’m sorry, inverted to the search box, instead of the sign in box. So what people were doing was they were actually typing in their username, hitting tab, typing in their password, hitting enter, and at that point the search button would trigger nothing is in the search box blank in the field. There was another group that had a different theory. They believe that because this was the first time this search box appeared on the homepage, people just didn’t know the behavior. They didn’t realize they were supposed to type something in and then hit search that box next to it doesn’t have a label in it. It doesn’t have a prompt. It’s an unfamiliar design pattern. They thought, well, maybe they just don’t know to type something in and because all previous sites required, you hit search first and then you search. Maybe that’s how they thought it was. There was another faction that was convinced it was the lack of advanced search that people wanted advanced search. They had been arguing for this since day one. They knew that it was advanced search, that must be it. And then there was a small group of folks who just thought that the logging system was broken, that in fact people were typing things in and they were hitting search. It just wasn’t getting recorded in the log file.

So which of these is it? That’s the question. And we can map this back into our little chart here. We can say, okay, the log file contains blanks. There are four different inferences we can have. And the interesting thing is depending on which inferences, right, we’re going to come to a different design solution. So actually knowing inferences, right is important because if we pick the wrong one, we’re going to execute the wrong design solution. That’s absolutely critical because it means that we have to do more work. Most teams often start at the first inference and just say, well, this must be it. As soon as they have an inference, they run with it and they go with the design solution there and then they find out that nope, that didn’t fix it at all. The best teams with great designers will figure out that they should not stop at the first inference. They should make sure they understand all the inferences and then work from there. The trick then becomes to pick an inference and create an experiment actually say, okay, let’s figure out what it is. So the team did just that. They started to create a series of experiments and for instance, one of the experiments they created was to sit with people in the lab and watch them use search and see if anyone just hit the button without going through. And if that was the case, they would know that maybe the reason was that they could talk to a user and find out that they just didn’t realize they were supposed to type something into the box. And now because of that, there’s only one inference that matters. The other ones fall to the wayside and they can go and work from there. So by doing additional research, we can turn inferences into observations. And that allows us to have more confidence as we go forward.

Now I do a little experiment here. I’ve picked a measure and I’ve tracked it across four different websites. This is the proportion, Snapchat, Airbnb, Uber, and Facebook. And I want you to guess what it is I’m measuring. Breached accounts? I like that answer. No. What I am going to do is I’m going to, uh, map it for you just to give you clue into valuation. So this correlates strongly with valuation, so we know it’s not breached accounts. Any other guesses? Whoa, you’re allowed, there was a voice over here. Number of Russian bots. Okay. Wow. We’re a dark group. This all started because Peldi deleted 12,000 records, isn’t it? Yeah. Time spent in the App? That’s a good guess. It’s wrong, but it’s a good guess. R&D spend. That’s also a good guess. Nope. Number of ads? That’s an excellent guess. Nope. Completely wrong. Number of users leaving the sight? Nope. Accounts? No, nope. You’ll never guess. It’s the number of ‘E’s on the homepage. If you had any doubt? This is Snapchat’s page and this is Airbnb’s page and this is Uber’s page and this is Facebook’s page. Whole lot of ‘E’s on Facebook’s page right now. We might agree that counting the letter E is a stupid metric. Though I have been in talks where I have been talking to a startup group and I’m swearing that there’s some VC guy in the back who’s calling saying, we got to get more ‘E’s on our homepage. And, but I think we could pretty much agree that counting the letter E is a very stupid metric to track.

But here’s the thing, the most important word here is actually ‘metric’. And it’s a word that people use interchangeably with the word ‘measure’, but they are not the same thing. See, a measure is something that we can count. All right? It’s, it’s number of ‘E’s. Is, is something we can count. A metric is something that we track. We track over time, we track across different, uh, targets. Something that, that we compare one measure to the next. And we can measure and create metrics for all sorts of things. We can count the amount of time on the site, we can count the amount of accounts that were breached, we can count, uh, how happy the users are. We can, we can measure that, right? We can, we can take a little piece of tape and put it on the table and we can put a coin on the tape in the middle and we can have a little markings and we can have the move the coin when they’re happy or not happy and suddenly we know whether they’re happy or not. Or we can track that over, you know, all the participants we see in a study, there’s a third term that we like to use, which is analytic and an analytic is a measure that software tracks. It can track the number of ‘E’s on the page. It can track how many people visit the site. It can’t track easily how happy the user is. So not all metrics are analytics.

Analytics or something that you’re probably familiar with. You may have seen charts like this that come from Google analytics and other packages. This one is measuring time on page, which technically represents the amount of time the user spends on a page. And I point out that this particular chart had this anomaly on December 17th. Now what would cause this article that had been out for more than a month to suddenly get this spike on December 17th? Holiday shopping? Holiday shopping makes people slower when they read articles. Is that, I mean, what, what is it? Right? Any other guesses? Publication date, just somehow different on the 17th than on any other day? No, it was published in November 10th right. This is the same page hasn’t changed. External references? So suddenly it’s linked and more slow people come to this site? Right. See the problem with time on page is it actually doesn’t mean anything. It’s the variation from one day to the next is completely random, it’s noise. There’s no meaning you can associate with time on page on a site and it’s not specific to time on page. Turns out that many of the things that the analytic software track right out of the box fall into this category of being things that that actually you can’t tell what the inference is you’re supposed to draw from any change in the data is. Let’s take bounce rate. Bounce rate is the poster child of anybody who does search engine optimization, they love bounce rate and this is the same article, notice December 17th – no blip. Whatever it was, it wasn’t making people read slower it didn’t make them leave faster. And what’s fascinating about this is that no one can even tell you what bounce rate measures. Because the belief is it measures the number of people who come to this site and then like leave the site entirely right after the page. But that’s actually just a theory that people have. The software doesn’t really know if you’ve left the page or not. So for example, if you open a page up in a browser tab and then you go to another tab, is that considered leaving the page? If you come back to it two hours later, is that considered a continuation or is that a different session? Depending on the implementation of the analytics tool, it may have timed out at 30 minutes or 20 minutes or two hours and therefore if the person comes back to that tab and starts playing with the page again, it’s considered a different session and therefore you’ve got a bounce from that user before. But no bounce now. Nobody actually knows what the real implementation of the analytics package they’re using is for any of these metrics. So the meaning of these things is not clear.

And the big problem with these types of metrics is that we don’t know what we’re supposed to do differently. I have sat in meetings where people are presenting to their stakeholders what their project plans to be and I’ve seen the same charts appear in different presentations. In one presentation, this chart will appear and prompt. Someone will say, look at how high our bounce rate is. People are leaving right afterwards. We need to fund that redesign. And then the next presentation will come in and the person will say, look at how high the bounce rate is. People are coming in, they’re finding exactly what they need and they get out quick. We don’t need to change a thing. We call this an agenda amplifier – an agenda amplifier is whatever agenda you want to have amplified, this chart will help. Or as the famous economist, R H Coast once said, uh, if you torture data long enough, it will confess to anything you’d like.

Here’s what tools like Google analytics can’t tell you. They can’t tell you what on your site is useful. They can’t tell you what people find confusing. That can’t tell you what the difference is between a big spender and a small spender and what those people do differently. They can’t tell you how to improve the content on your site or in your product. And they can’t tell you why someone clicked on something. They can’t tell you why. This is a problem if we are using data to make decisions because the ‘why’ is critically important.

Let’s take another metric that people like – conversion rate. And I should have put this up at the beginning. There’s a warning that goes with this. Conversion rate, if you are one of the people who don’t know this, is something that that is the number of people who purchased divided by the number of people who visit. So if for example, I have a million people visit my design, and 10,000 people actually purchase from that site, we would call that a 1% conversion rate. Now the thing about conversion rates is that they’re ratios. So for example, if I want to improve my conversion rate to 2%, I can do that by working really hard to have 20,000 people purchase over the million visitors. But I can also get a 2% conversion rate by having only 10,000 people purchase and reducing the number of visitors to 500,000. So one of the most effective ways to increase conversion rate is to stop marketing. By no longer marketing your product, only your most loyal people will show up. They will have a very high purchase rate. Your conversion rate will likely go up, your revenues will tank, and I often have to draw this out. I was talking to a group of of retail executives about this and they are so fixated on conversion rate that I had to write in the math that actually said, no, no, your money goes down at a 2% conversion rate. And they’re like, no, the conversion rate went up. It has to go up. No it doesn’t. It’s a ratio. We have denominators in our world. And you know, finally I had to just sort of draw it all out and say, look, I’ll give you a choice. Do you want the conversion rate or do you want the money? And they had to think about it. They had so gotten into their head that conversion rate was the only thing that mattered, that it was really hard for them to understand that you could actually get less money with a higher conversion rate. And that’s not the only problem with conversion rate. Again, nobody knows how it’s calculated. So it’s technically the number of people who purchase over the number of people who visit. But how do you count a visit? If I sell insurance and it takes four visits before my customer finally purchases, does that mean that my conversion rate is 25% and that’s the best I can do? Or does it mean that my conversion rate’s actually 100% because that customer finally purchased, is it a customer-driven conversion, right, or a visit-driven conversion rate? Which one does your software count? I’m probably going to guess it’s more visits than than customer because customers are really hard because what happens if each of those visits was on a different device in a different browser that may not have the cookie from the last visit. So it’s really hard to connect up visits to a customer. If we get them to log in it’s much easier. But that’s a really hard thing to ensure. So chances are your software is actually only giving you a small percentage of the conversion rate you actually are getting. But of course the most important thing about conversion rate is that when it changes, we don’t know why. We have no clue. We can draw the inference that whatever we changed on the system last week did it, but that’s just an inference. We don’t have any evidence to suggest that.

Let’s try another experiment. I’m going to put up seven words. Once I get them all up, I want you to guess which one is different from the others. So I’ve got delightful, amazing, awesome, excellent, remarkable, incredible, satisfactory. Any guesses? Incredible. Yeah. Satisfactory was the one I was going for. So satisfactory. That’s a word that we use. We have surveys that measure satisfaction, but next to all those other words, it seems sorta ‘eh’, right. Satisfactory in my head is the food equivalent of edible. Right. Gotta have the food at the reception tonight, it’s going to be extremely edible. That’s not an endorsement. Right? Extremely awesome. Extremely delicious. That’s an endorsement. Right? Yet we have satisfaction surveys. All we care is were you satisfied, not were you delighted, not were you happy. Right. We have set an extremely low bar. And I believe we can do much better. And of course we do these crazy things like we ask people to measure satisfaction for things that they probably have no clue what we’re asking them. Like, this beauty here of ease of connecting to the Gogo inflight signal, SSID? Well, I was going to put down neither satisfied nor dissatisfied, but I actually think I’m more somewhat satisfied at connecting to the Gogo inflight SSID. I mean, I’ve been pondering this the entire flight. What does this mean? If that suddenly shifts amongst your customer base, call out the troops. We got to do something. Let’s get them working on getting more satisfaction out of our SSID selection.

Now, part of this is the way social psychologists white to measure things. Social psychologists like to start with a neutral, and so we just start with neither satisfied nor dissatisfied. And then we build one side to be satisfied the other side to be dissatisfied. And because social psychologists hate the idea that people might only have three things to choose, they give you five and say, okay, we’re going to be somewhat satisfied or extremely satisfied as if anyone knows when you go the exact moment you switched from somewhat to extremely. That doesn’t happen. So part of the problem is is that we’re using the word satisfied. What if we made that the neutral point and then said, okay, we’re going to measure things on a scale of delight to frustration. Well, this is a little better. We could probably just get rid of the extremely in somewhat and just say, where are you delighted or where you’re frustrated. What were your frustrated about? What delighted you? That would be a really useful survey if we’re going to do a survey. But most of the time when we do surveys, we do these crazy surveys that involve 10 point scales because nothing is good unless we can bring it to 10 points. 10 point scales is is the way that we basically say we don’t really care what answer you put in because what is the true difference between a seven and a six on a 10 point scale, right? Do we actually know when the average of our survey drops from 7.2 to 6.8 what do we do differently, right? It’s basically, 10 point scales are a great way to make any sort of noise feel like science is happening. We can get precision – we’re not accurate but dammit we are precise.

And I’m not even going to let you get me started on net promoter score, which takes the scale to 11. The thing about all of these metrics is again, they don’t tell you ‘why’ and because you don’t know why, you don’t know what to do differently. So as a result we need better ways to measure than what people commonly use. We need to understand what metrics are going to actually help us improve the experience of the user.

Turns out in the UX toolkit there are tools to help us. One of the ones that I particularly like is what’s called a customer journey map. A customer journey map takes the milestones that a customer will go through in the process of doing the thing. In this case, booking a hotel reservation and mapping those milestones on a scale of frustration to delight, and this doesn’t have to be accurate, right? We can use the little trick with putting a piece of tape on the table, putting a coin in the middle and as the participant in our study goes through every step, we can ask them to move the coin from one side to the other. The problem with something like this is it requires observation to understand if the person is frustrated or delighted. Sure. We can probably put some tool on the screen that mimics that little piece of tape. But no one’s actually going to use it because it’s going to be distracting at that point. In an observed usability test type scenario you can get away with asking someone, Hey, is the coin in the right place? And they’ll move it and then you’re done. But when you are trying to get this at scale, it’s really difficult. But if we can collect this data, it tells us a lot. Even if all we did was focus on the things that cause frustration because if we understand what the frustration is being caused by, we know what to do differently, right? Do we clean up the content? Do we help them find hidden features? Do we make the navigation more prominent? The different design decisions that come from our observation of frustration is huge. It’s really important.

Let’s just take the last one here. Error messages. Common error messages that are delivered. Phone numbers can’t have dashes or spaces, right? This one has baffled me forever because in many systems it takes upwards of 10 lines of code to determine if there is a dash or a space and put up an error message that says you’re not allowed to have this and then start the loop over so they can enter again. Whereas it, it often takes as little as one line of code just to strip out the damn spaces and dashes. So why do we even tell the user, why does the user care about this? That credit card code that you have that’s on the back of the card, except if it’s an AMEX card, it’s on the front of the card and there is one on the back of the card, please ignore that. Right? This is a popular error message because people have a store credit card, but it’s actually against the rules to store that core security code. So they don’t store the security code, but they have to enter it again to prove that they are who they are. It’s a part of the two factor authentication process. And as a result, they, uh, they forget to put it in, they get the error message or they put it in and then they get the error message about not having to remove the dashes in spaces. And when the page refreshes, it’s against the rules to present that code that they ended our previous code again, because that implies you stored it somewhere and as a result it’s been erased. But you didn’t tell them it was erased. They go to submit after fixing the error and they get another error telling them they didn’t enter something that they actually did enter. And they’re wondering why you’re so stupid. Or probably the biggest error message if you were to count them is username and password does not match. Right. And of course we don’t tell them whether it’s the username that matches is wrong or the password that’s wrong, because there are people who wear a tinfoil hats who will tell you that that only lets the hackers win, so we can’t possibly give them any clues to help them know whether they typed the wrong email address or that they’re using a username versus an email address. Oh, we can’t give that information out. We have to let them guess because the hackers will figure out these username and passwords even though there hasn’t been a breach in the last 10 years that was caused because someone figured out a username and password. All the breaches have happened because someone got phished or someone left a port open or someone did an injection. So, but the tin foil hat people.

Here’s a pro tip for you. You want something to measure, just start counting your error messages. What is the most delivered error message in your product? Do you know? Has It changed recently? It is very rare that an error message means a user is more delighted. So error messages turned out to be an awfully good proxy for frustration. So go ahead. Use error messages to measure frustration. Okay. Here’s a case study of how this works. We were brought in to a very large ecommerce company to help them reduce what they were calling ‘Checkout Drift’. They had, uh, a checkout process and they were afraid that they were losing people through this process. So we asked them how they knew that they were losing people and they brought out their analytics data and it showed sure enough that over time they were losing customers as they went through the process and they had tried all sorts of things and nothing worked. So they finally brought us in. And as a result, uh, they, they were looking at, okay, what could be causing this issue? And uh, we said, okay, we can help you, but we want to actually observe people shop. And we, and we’ve learned over the years that for ecommerce systems, it’s very important that you observe actual shoppers. If you ask people to pretend to shop, they behave very differently than if you ask people to actually buy something they need. When people pretend to shop, they quickly look at the product. They don’t really care because they’re not, they’re not gonna actually going to get the thing. But if you say, okay, here’s some money, go buy this thing you told us you needed, they will actually spend time making sure it’s the right thing. So you learn quite a bit about the shopping process. It’s much better if you have real shoppers. They said, okay, we’ll do that. We’ll have you, we’ll let you do that. Bring in a bunch of folks, we’ll watch them. So okay, before we do that, we’d love to see some baseline measures. So can you tell us what the analytics look like before this shipping information stage? And they said, sure. And they deliver us to the analytics data and were immediately surprised that the shopping portion of the site had a very high standard and then suddenly it drops. And we said, so I know you’re worried about this… Might we want to explore this? And their response was, no, no, we know what causes that. That’s shopping cart attrition. Now, shopping cart attrition is something that you hear in ecommerce circles all along and I’ve never understood it. It’s taken as a, granted, everybody suffers from shopping card attrition. Every major retailer says they have it. They all treat it as if it’s just part of the process. People put things in their cart and then they walk away. And I’m sure that happens occasionally, but I am shocked that it happens as much as ecommerce retailers say it does because we don’t see this happen in the real world. Right? We don’t see shopping cart attrition because if we saw it in the real world when you get up towards the registers towards Costco, you’d expect to see all these abandoned shopping carts filled with stuff. You know, as if aliens are just come in and abducted some shoppers and suddenly, you know, the carts are abandoned. Uh, but they were like, no, no, we don’t want you to worry about this. We actually know how we’re going to fix this. I’m like, oh, really? Just out of curiosity, how are you going to fix this? Oh, our marketing people are going to just keep emailing those customers until they buy the stuff in the cart. Okay, that’s a good plan. Let’s go with that. So we bring our first participant into the lab, expecting them to walk through this process. And the first thing we notice is that that’s not the process. That between shopping, and reviewing the shopping cart, there’s actually another step which is you have to log into your account. And then there’s another step after that for some users, which is they can’t remember their account. They get that username and password don’t match thing. They try that a couple of times and suddenly they decide to reset their password and then they have to go to email. Hopefully they remember which email address they use as their account. They go to email and reset it and then they are asked to put in a new password because after all they’ve just reset their password and now they have to put in a new password and for some reason they can’t use the old password, but they’re not allowed. No at the old password was, and then they can put in their shipping information and their billing information, their payment information. And so we say, okay, any chance we could get the analytics data for those things? And they’re like, absolutely. And they leave the room and about 30 minutes later they come back in the room and they say, absolutely not. Why not? Well, the people in trust and fraud, you can pick them out in the hallway. They have tin foil hats that people in trust and fraud. Do not trust third party code. They will not put the analytic software into those pages. Right. By the way, that’s an inference that every page has been instrumented with the analytics software. Not True. So they don’t have that. We push some buttons, get some executive air cover. Next thing you know, trust and fraud is instrumenting the pages much to their disgruntled belief and then we have to wait to collect the data. We expect that the data is going to look like this because it maps the world we have, but in fact the data actually ends up looking something like this. Somehow people logging into their account do it at a rate of three times that of people who review the shopping cart page, even though the only way you can log into your account is by clicking on the shopping cart page first. How do you get three times as many people doing that? Well, that was a mystery. The next mystery was that most of those people end up dropping off before they reset their password. They just go nowhere. And, uh, a bunch of those people then end up clicking on a link to email, clicking on the link in the email. A smaller number of those people drop off when correcting their password. And then we see the pattern we expected. Okay, we’d go back into the lab and we start watching again and we immediately understand why that number is three times as high. It’s measuring visits, it’s measuring every time the page is refreshed. It’s actually measuring page views and not unique page views, just page views. And as a result, what it’s measuring is every time they get the error message password. So basically the average person tries three things and then gives up, resets password. That’s why we see the spike. Many of them don’t give up or don’t remember what the username and password is. That’s why the email drop happens and therefore they can’t do this. So then we went back and we asked for one more piece of data. We wanted to know how much money was lost that was in those shopping carts. People were putting stuff in their shopping carts. How much money is lost by people who ask for a password reset and never complete the process. Turns out it was a small number on an annual basis. It came to about $300 million a year. Now this was a $1.2 billion revenue ecommerce site. So $300 million turns out to be a noticeable amount. The team ended up taking this data and running with it and implementing something that they’d had in their back pocket called guest checkout. Guest checkout would allow you to not log in at to complete your purchase. It turns out guest checkout is, uh, a fine, fine thing to have in that it, they can match you up based on your address and your credit card numbers and all the things. So they actually know who you are no matter what, right? So they don’t lose any information really if you use guest checkout or unless you always order it to a different address with a different credit card. So turns out that at the end of the first month of guest checkout being up, I get a phone call from the president of the ecommerce company who leaves this buoyant message on a Saturday morning on my answering machine and it says “Spool man”, which he always called me. “Spool man, you are the man”. Turns out they were on target to get all 300 million back within a year and that’s what they did by just implementing guest checkout.

So the thing about guest checkout, the thing about this measure, this metric is what we ended up doing was we ended up combining qualitative research, in this case usability lab research, with quantitative custom metrics, not using the metrics out of the box. The metrics out of the box didn’t help us one little bit. They just created confusion. What we were really interested in was a metric we made up. The metric that we made up was unrealised shopping cart value from password issues. And this metric is only meaningful to this one site while dealing with this one problem. Once the problem’s been resolved, it’s actually not that. And number. The number has never risen once we got guest checkout there. So tracking it over time as it becomes a, it’s not a a key metric anymore. It’s not, it’s not something that we’re looking for, but the lesson that we learned from this was that we get better results when we let qualitative findings drive the quantitative research. Instead of just bringing up the analytics software staring at those charts and saying, what the hell is this thing trying to tell me? We go into the lab or into the field, we actually see what people are doing and then we bring that back and the smart teams have learned that if they start with qualitative research and ask the question, how often does this happen in the real world and what? What could it possibly be costing us? That turns out to be an essential number to help us understand where change has to happen. Now I want to point out the team had a bunch of initial inferences, all of which turned out to be wrong. They were under the impression that the checkout steps were where we would find the biggest improvement. They believed that it was normal to have shopping cart attrition. They thought that the cart review went straight to checkout because they didn’t understand how their metrics work and they thought that all their screens were instrumented. Those were a lot of inferences to have that we proved false, but they had been basing everything on those facts up until that point.

Here’s the other interesting thing. We knew this immediately with the first participant. The first participant told us what was going on. When we map their journey against our eventual data, we see a direct correlation between frustration and where the problem points were. So just starting there gives us a huge advantage and we can focus our quantitative research on frustration to start with. If you haven’t done good user research, starting with frustration is the best place because you’ve got a lot of it built into your system. I almost guarantee you, if you somehow manage to accidentally shipped something that isn’t frustrating, I want to meet you. I want to meet you too if you ship things that are frustrating, I can help you. What this comes down to is the fact that observations trump inferences, ask question, how do we know that’s true? Also ask the question, when do we get that verb back? That was the one Democrat in the room.

Okay. Now I find that teams are very afraid of this stuff, right? They have a lot of reasons why we can’t do this. They say things like, well, we’ve got a team that does analytics and we don’t they, they’re a different group. Right? And that’s true that there’s, you know, a UX team that owns the qualitative data. There’s an analytics team that deals with the, uh, quantitative data. The recommendation I make with our clients is you start to merge them and put them under the UX team. This is DJ Patel. Under the Obama administration, he was to US chief data scientist. I do not believe the Trump administration has a chief data scientist. They don’t believe in data. He, the president of the United States had his own data scientist. He was able to express data in a way that the president and his staff could understand. You can have data science. Data science has to be an essential part of every ux team. It’s now an essential skill. So if the people who work on your user experience do not understand data, you need to skill them up on this, or you need to get them those skills by hiring someone who can help them. Another excuse I hear a lot is I don’t understand what the metrics mean. Almost always, if you don’t understand what the metrics mean, it’s because they don’t mean anything. Metrics that are complicated are usually complicated so you don’t ask what they mean, right? So just keep asking.

And finally, people don’t go into design in particular because they’re good with numbers. I get that they chose design so that they wouldn’t have to deal with numbers. And then we make them calculate pixel ratios. But here’s the thing, you don’t have to have sophisticated numbers. Right. The formula for counting the revenue loss due to password resets involved a single mathematical operator. It was a plus sign, right? We can handle this level of stuff. You don’t need sophisticated number of metrics. So the trick here is to keep things very targeted, to keep them going very much based on the qualitative metrics. That is where the power is.

Now I’ve mentioned the word design a lot and I haven’t really defined it so I’m going to define it now. Design is the rendering of intent. A designer has an intention in the world and how they’re rendering it is what matters. And we need metrics to help us understand if the design is rendered the way we need it to be, right? If we implemented that guest checkout right, we should see an increase in revenue. Can we track that? We should see a drop in that shopping cart metric. We should. Can we track that? Right. So we use the metrics, but there’s other things that where we get a sort of crazy intent. If you follow the twitter account of any of the NASA astronauts, you’ll notice that they typically the space station astronauts tend to post lots of pictures from the space station and they’re pretty cool pictures and if you look in twitter you will see their pictures listed. But if you follow the twitter account of someone like Mike Monteiro, a designer on the West Coast, you’ll notice that his pictures never show up in his tweets. The only way to see his pictures is by clicking on the link to Instagram. And I recommend you always click on his links to Instagram. The reason that his pictures don’t show up is because the day Instagram was purchased by Facebook, the pictures were shut off. The capability displaying the pictures in twitter was shut off. The original theory was it was shut off by Twitter who didn’t like the idea that Instagram was now their biggest competitor, but that wasn’t true. It was actually shut off by Facebook because what Facebook wanted to do was drive monthly active users. And if you’re just looking at the pictures in Twitter, you don’t help their monthly average user metric and monthly average users are highly correlated on the West Coast to market valuation. Just like the letter e on the home page. And as a result, this is the major way that silicon valley uses to measure companies that actually have forgotten to come up with a business model. They just say, well, if you have more monthly active users, that’s good enough for us, we will give you more cash. So that becomes the reward system. And what you see are behaviors like taking emails and eliminating the messages such that the messages no longer appear. Instead you just get something that says, go see what it is. And suddenly we are without context in our emails and our Twitters because we want to click through the site because that’s how they will measure whether you are an active user is if you actually bring up the thing. Designer at Frog, Robert Fabricant said the medium of design is behavior. What designers do is they create different behaviors. If we want to in fact create different behaviors, we have to change the design. And that’s what we’re doing here. We’re changing the design to create different behaviors, but are they the best behaviors for the user or are they the best behaviors for the business?

Now there are some folks who are trying to do things a little differently. Medium, for example, has this notion of reads. Reads is their driving metric. Reads is not just views, we don’t really care about views. Let’s focus on do people read the article? The Post on medium? But the problem is nobody actually knows how read is calculated. It’s some algorithm of how slowly you’d go from the top to the bottom of the article as your scroll. So it could measure whether the cat is on the keyboard again. We have to understand how we drive metric collection from design, not try to drive design from metric collection. And that’s what I came to talk to you about. We need our metrics to help us improve our experience. We need to avoid jumping from observations to inference too soon. We have to understand that, uh, in order to get the right understanding of our experience, we need custom metrics, not take the things that are out of the box. And finally, we have to be very comfortable with data science in particular, the data science of behavioral usage in our systems. That turns out be key.

Uh, real quick, this is part of a workshop that we teach on strategy. So if that’s something that’s interesting to you, uh, check out playbook.uie.com it’s a two day workshop where we start with 130 different strategies and you whittle them down and create your own playbook of that are unique to your organization. Uh, we have one workshop coming up in Manchester, UK and our next workshop in Chattanooga, Tennessee. Uh, it’d be great to see you there and if you want, you can save a couple hundred bucks off of it with that Promo Code. Uh, also I have written about this and many other things on our website, uie.com if you go to uie.com and you scroll to the bottom, you’ll find a place to put in an email address, uh, where you can, uh, sign up for a newsletter and those articles will appear in your inbox. And we will score that as a conversion. Uh, also, uh, you can contact me at jspool@uie.com or if you wish to contact me on Linkedin, I find that acts actually a better place for me to have a professional conversations. My email gets a little crazy sometimes, so I, I clear out my linkedin messages once a week or so and get back to everybody. So, uh, happy to talk to you there. And finally, uh, you can follow me on the Twitters where I tweet about design, design, strategy, design, measurement, design, education, and the amazing customer service habits of the airline industry. Ladies and gentlemen, thank you very much for encouraging my behavior.

Learn how great SaaS & software companies are run

We produce exceptional conferences & content that will help you build better products & companies.

Join our friendly list for event updates, ideas & inspiration.

Design Metrics That Matter | Jared Spool, UIE | BoS USA 2018

February 4, 2019 by Paddy Heaton

Jared Spool, Founder, UIE

See the Video, Slides, and Transcript below

Video

Slides

Learn how great SaaS & software companies are run

Transcript

Learn how great SaaS & software companies are run

About Us

Events

Learning

Community