You're buying a product online,
and you see 3 different sellers. They're all offering that same product,
at essentially the same price. One of them has a 100% positive rating,
but with only 10 reviews. Another has a 96 positive rating with 50 total reviews. And yet another has a 93% positive rating,
but with 200 total reviews. Which one should you buy from? [music] I think we all have this instinct that
the more data we see, it gives us more confidence in a given rating. We're a little suspicious of seeing 100% ratings since, more often than not,
they come from a tiny number of reviews. Which makes it feel more plausible that things could
have gone another way, and given a lower rating.
But how do you make that intuition quantitative? What's the rational way to reason
about the trade-off here, between the confidence gained from more data,
versus the lower absolute percentage? This particular example, is a slight modification from
one than John Cook gave in his excellent blog post "A Bayesian view of Amazon Resellers". For you and me, it's a wonderful excuse to dig into
a few core topics in probability and statistics. And to really cover these topics properly, it takes time, so what I'm gonna do is break this down into three videos. Where, in this first one,
we'll set up a model for the situation and start by talking about the binomial distribution. The second is gonna bring in ideas of
Bayesian updating and how to work with probabilities
over continuous values. And in the third, we'll look at something known as a
Beta distribution and pull up a little Python to analyze the data we have, and come to various different answers
depending on what it is you want to optimize.
Now, to throw you a bone
before we dive into all the math, let me just show you
what one of the answers turns out to be, since it's delightfully simple. When you see an online rating, something like this 10/10, you pretend that there were two more reviews. One that was positive and one that's negative. In this case, that means you pretend that it's 11/12, which would give 91.7%. This number is your probability of
having a good experience with that seller. So in the case of 50 reviews,
where you have 48 positive and 2 negative, you pretend that it's really 49 positive and 3 negative, which would give you 49/52, or 94.2%. That's your probability of having a good experience with the second seller.
Playing the same game with
our third seller who had 200 reviews, you get 187/202 or 92.6%. So, according this rule, it would mean your best bet is
to go with seller number two. This is something known as Laplace's rule of succession
– it dates back to the 18th century – and to understand what assumptions
are underlying this, and how changing either those assumptions or
your underlying goal can change the choice you make, we really do need to go through all the math. So without further ado, let's dive in! [Music] Step 1: How exactly are we modeling the situation
and what exactly is it that you want to optimize? One option is to think of each seller
as producing random experiences, that are either positive or negative, and that each seller has some kind of constant
underlying probability of giving a good experience, what we're gonna call the "success rate," or "s" for short. The whole challenge is that we don't know "s." When we see that first rating of 100% with 10 reviews, that doesn't mean the underlying success rate is 100% It could very well be something like 95%. And just to illustrate, let me run a little simulation where I choose a random number between 0 and 1, and if it's less than 0.95, I'll record it as a positive review otherwise, I'll record it as a negative review.
Now do this ten times in a row… And then make ten more… And ten more… And ten more, and so on to get a sense of, what sequences of 10 reviews, for a seller with this success rate, 0.95,
would tend to look like. Quite a few of those – around 60% actually – give 10/10. So the data we see seems perfectly plausible if the seller's true success rate was 95%. Or… maybe it's really 90%, or 99%. The whole challenge is that we just don't know. As to the goal, let's say you simply want to maximize your probability of having a positive experience, despite not being sure of the success rate.
So, think about this here. We've got many different possible success rates
for each seller, any number from zero up to one, and we need to say something about how likely
each one of these success rates is, a kind of probability of probabilities. Unlike the more gamified examples
like coin flips and die tosses and the sort of things you see in an intro probability class, where you go in assuming a long-run frequency,
like 1/2 or 1/6, what we have here is uncertainty about
the long-run frequency itself, which is a much more potent kind of doubt.
I should also emphasize that this kind of setup is relevant to many, many situations in the real world, where you need to make a judgement
about a random process from limited data. For example, let's say that you're
setting up a factory producing cars, and in an initial test of 100 cars,
2 of them come out with some kind of problem. If you plan to spin things up to produce a million cars, what are you willing to *confidently* say about how many total cars
will have problems that need addressing? It's not like the test *guarantees*
that the true error rate is 2%, but what exactly *does* it say? As your first challenge, let me ask you this: if you did magically know
the true success rate for a given seller, say it was 95%, how would you compute
the probability of seeing, say 10 positive reviews and 0 negative reviews? Or 48 and 2? Or 186 and 14? In other words, what's the probability of seeing the data given an assumed success rate? A moment ago, I showed you something like this
with a simulation, generating 10 random reviews.
And, with a little programming, you could just
do that many times, building up a histogram to get some sense of
what this distribution looks like. [music] Likewise, you could simulate sets of 50 reviews
and get some sense for how probable it would be to see 48 positive and 2 negative. You see, this is the nice thing about probability; a little programming can almost always let you "cheat"
a little and see what the answer is ahead of time by simulating it For example, after a few hundred thousand samples
of 50 reviews, assuming the success rate is 95%, it looks like about 26.1% of them
would give us this 48/50 review. Luckily, in this case, an exact formula is not bad at all. The probability of seeing exactly 48/50 looks like this. This first term is pronounced "50 choose 48", and it represents the total number of ways
that you could take 50 slots and fill out 48 of them.
For example, maybe you start with 48 good reviews
and end with 2 bad reviews. Or maybe you start with 47 good reviews,
and then it goes bad-good-bad and so on. In principle, if you were to enumerate every single
possible way of filling 48/50 slots like this, the total number of these patterns is "50 choose 48", which, in this case, works out to be 1225. What do we multiply by this count?
Well it's the probability of any one of these patterns. Which is the probability of a single positive review,
raised to the 48th [power], times the probability of a single negative review squared.
Crucial is that we assume each review
is independant of the last, so we can multiply all the probabilities together like this, and with the numbers we have, when you evaluate it,
it works out to be 0.261, which matches what we just saw empirically
with the simulation. You could also replace this 48
with some other value and compute the probability of seeing
any other number of positive reviews. Again, assuming a given success rate. What you're looking at right now,
by the way, is known in the business as a binomial distribution, one of THE most fundamental distributions
in probability. It comes up whenever you have something like
a coin flip, a random event that can go one of two ways,
and you repeat it some number of times.
And what you want to know is the probability
of getting various different totals. For our purposes, this formula gives us the probability of
seeing the data, given an assumed success rate, Which, ultimately, we want to somehow use
to make judgments about the opposite: the probability of a success rate
given the fixed data that we see. These are related, but definitely distinct. To get more in that direction,
let's play around with this value of "s" and see what happens as we change it
to different numbers between 0 and 1.
The binomial distribution that it produces kind of looks like this pile
that's centered around whatever "s" times 50 is. The value we care about,
the probability of seeing 48 out of 50 reviews, is represented by this highlighted 48th bar. So let's draw a second plot on the bottom
representing how that value depends on "s". When "s" is equal to 0.96,
that value is as high as it's ever gonna get. And this should kind of makes sense. Because when you look at that review of 96%, it should be most likely
if the true underlying success rate was 96%. As "s" increases, it kind of peters out;
going to 0 as "s" approaches 1. Since someone with a perfect success rate
would never have those 2 negative reviews. Also, as you move to the left,
it approaches 0 pretty quickly. By the time you get to "s" = 0.8, getting 48 out of 50 reviews by chance
is exceedingly rare. It would happen 1 in a 1000 times.
This plot we have on the bottom is a great start
to getting a more quantitative description, for which *values* of "s" feel more or less plausible. Written down as a formula, what I want you to remember
is that, as a function of the success rate s, the curve looks like some constant times "s" to the number of positive reviews times 1 minus "s" to the number of negative reviews. In principle, if we had more data,
like 480 positive reviews and 20 negative reviews, the resulting plot would still be centered around .96,
but it would be smaller and more concentrated. A good exercise right now would be to see
if you could explain *why* that's the case. There is a lingering question, though,
of what to actually *do* with these curves.
I mean, our goal is to compute the probability
that you have a good experience with this seller. So what do you do? Naively, you might think that probability is 96%
since that's where the peak of the graph is, which, in a sense, is the most likely success rate. But think of the example with 10 out of 10 positives. In that case, the whole binomial formula simplifies to be "s" to the power 10. The probability of seeing 10 consecutive good reviews is the probability of seeing *one* of them
raised to the 10th. The closer the true success rate is to 1,
the higher the probability of seeing a 10 out of 10. Our plot on the bottom only ever increases
as "s" approaches 1. But even if "s = 1" is the value that maximizes this probability, surely you wouldn't feel comfortable saying that, you personally, have a 100% probability
of a good experience with this seller. Maybe you think that, instead, we should look for
some kind of center of mass of this graph and that would absolutely be
on the right track.
First though, we need to explain
how to take *this* expression for the probability of the data we're seeing,
given a value of "s", and get the probability *for* a value of "s"
(the thing we actually don't know) given the data (the thing we actually know). And that requires us to talk about Bayes rule
and also probability density functions. For those, I'll see you in part two. [music].