Note: This is an excerpt from my new book-in-progress called “Uncertainty”.
The most popular definition of probability, and maybe the most intuitive, is the frequentist one. According to frequentists, an event’s probability is defined as the limit of this event’s frequency in a large number of trials.
What does this mean? Let’s go back, to the example of rolling a fair coin. You said that the probability of rolling heads on a single roll is 50%. However, how do you know this to be true? What if you roll tails 10 times in a row? Would this change the probability of rolling heads? Obviously not. Intuitively, this makes sense, but why?
I ran a coin-tossing experiment (simulated in the R programming language[1]) and you can see the results below. The proportion of heads very quickly converges to 50%.
Figure 1. Coin tossing experiment
This is the definition of frequentist probability in practice. If you execute an experiment a large number of times, then the frequencies will converge to the true probabilities.
Frequentist statistics have been the orthodox branch of statistics for most of its history. Statistics is based on the idea that you can extract a sample from a population, and then study properties of the population. If we treat each entity in this sample as an experiment, then the more samples we collect, the closer we will get to the truth.
However, there are certain events for which this definition makes no sense. Many of these events happen to be very important for our lives, and they are unique. Some examples include elections and sports events. We can’t run an experiment of 1000 elections and see what happens. Also, we can’t run 1000 Champions League finals to find the true probability of Real Madrid winning it.
In those cases, the frequentist definition of probability seems to get us into trouble. This is where the Bayesian definition of probability comes to our rescue.
The term Bayesian is due to Reverend Thomas Bayes (1701?-1761), pictured below.
Figure 2. Portrait of Thomas Bayes. We are not 100% certain about whether this is actually his, but it is the only one we got. Got the irony there, about being uncertain about the face of the man who is responsible for one our best tools against uncertainty?
Bayes used conditional probability (which we will explain about shortly) in his essay “An Essay towards solving a Problem in the Doctrine of Chances”, which was presented in the Royal Society in 1763. In case you noticed the year of his death in the previous paragraph, you will realize that this essay was presented posthumously. It was Bayes’ friend Richard Price (1723-1791) who discovered Bayes’ notes and published his work.
Bayes’ work concerned the following problem: How can you know the probability of an event, based only on how many times it occurred or didn’t occur in the past? Bayes used a thought experiment to illustrate his argument.
Bayes has his back turned towards a table, and his assistant throws a ball on it. There is equal probability of the ball falling anywhere on the table. Bayes has to guess where the ball is. On that first through, Bayes experiences the maximum degree of uncertainty, as the ball could really be anywhere on the table.
In the next step, his assistant throws another ball and reports whether it fell on the left or the right side of the first one. Let’s say that this time the ball lands on the right side of the table. We can assume that the the first ball now is more likely to be on the left side. If the original ball had landed on the left side, then the right side would have more space for another ball to land.
Then, the assistant throws another ball and then this ball lands to the right again. This makes it even more likely that the original ball lies on the left. Hence, with each throw, we narrow down the position of the original thrown ball more and more.
Figure 3. Depiction of Bayes’ argument. After the first ball throw (black ball), the second throw (orange ball), has more space on the right of the table than on the left. On the second row (left), you can see that there are only a few positions where the second ball could end up left of the original ball. On the second row (right) graph, you see all the positions right of the original ball, where the orange one could lie. It is clear that there is more space right of the original ball, and most of this space, lies to the right of the table. However, it is still possible for the orange ball to be right of the black ball, but on the left side of the table. So, 2 ball throws are not enough to completely pinpoint the location of the black ball.
In the modern world, this might remind us a bit of the Battleships board game. In battleships, each player (the game has only 2 players), has to place ships on a square board. On each round, the players specify one point in the board which they attack. If they hit part of a ship, then the opposing player informs them that they managed to hit a ship, but no other information is given. As you can see in the figure below, there are two ways to place the ships on the board (horizontally, as well as vertically). So, a player has to first figure out where a ship is, by getting an initial hit, and then guess where the rest of the ship lies. By trying out more and points on the board, the uncertainty around the location of the ships is reduced round by round.
Figure 4. A version of the Battleships board game.
So the main concept behind Bayes’ idea was the following:
Initial belief + new information = New belief
In modern era this is now turned into:
Prior + Likelihood = Posterior
We will explain more about this below. Interestingly enough, Bayes never focused on the theorem that now bears his name. The mathematicians of his time were not very happy with his approach, for two reasons. First, guessing didn’t sound too rigorous of a process. Secondly, in absence of information, Bayes assumed that all outcomes are equally likely. In modern parlor, we would say that the prior is uniform. Having to assign a prior probability of belief, seemed like an additional hurdle.
The theorem was independently discovered by one of the most prominent mathematician of all times: Pierre-Simon de Laplace. Laplace re-discovered this principle and published it in 1774. Eventually, he learned of Bayes’ discovery in 1781 when Price visited Paris. Laplace improved his formulation and decided to test it out.
What Laplace studied was whether the observation that more boys than girls were being born was a law of nature or just a statistical anomaly. He collected records from London, Paris, St. Petersburg, rural areas in France, Egypt and Central America. Using his theorem, he managed to conclude that indeed this seems to be a law of nature[2].
Laplace went on to make major contributions to other scientific fields, like Astronomy. There is a famous expression attributed to Laplace. Story says, that Napoleon asked him why he did not include God in his explanations of the movement of celestial objects, to which Laplace famous answered: Je n’avais pas besoin de cette hypothèse-là. (“I had no need of that hypothesis.”
Laplace was part of the human era, alongside Newton, that instigated a scientific revolution, with mathematics and reason being the main tools to explain away uncertainty, in place of religious and metaphysical explanations.
While Laplace did most of the work on Bayes’ theorem, his name was never attached to it.
Laplace later on discovered the central limit theorem, one of the most powerful and significant findings in modern mathematics (we explain more about this later). Upon discovering this, he realized that once we possessed large amounts of data, the Bayesian approach converged to the traditional frequentist approach. And so Laplace converted to frequentism, which he abided to until the end of his life.
[1] The R language (https://www.r-project.org/) is the most popular language for statistics. Python, however, seems to have overtaken it for machine learning.
[2] We now know the ratio to be approximately 105 boys to 100 girls: http://www.searo.who.int/entity/health_situation_trends/data/chi/sex-ratio/en/