Uncertainty: Nothing is More Certain
Join Sally Cripps as she sheds light on uncertainty in decision making
In this talk, Professor Sally Cripps explores the difficulties of making decisions with limited information and how AI systems aid in decision making.
Uncertainty lies at the heart of AI. Why do I say that? AI systems are designed to aid or autonomously make decisions
Professor Sally Cripps
Oh, hello everybody. It's my name's Sally Cripps it's my very great pleasure to be speaking to you today. But before we begin, I'd like to acknowledge the Gandangara people, the traditional custodians of the land on which I'm recording this lecture and pay my respects to their elders past and present.
The, the title of today's talk is uncertainty Nothing is More Certain. I'm gonna talk to you about uncertainty decision making. And that seemed a very apt title. It was a phrase used by a Roman statesman 2000 years ago, actually. Two, he was born 2000 years ago this year in 23 ad and it was Pliny the Elder.
In, in a slightly more recent use of it was in by a paper that I wrote that appeared last year. Uncertainty. Nothing is more certain. And in it we have a look at the difficulty in making decisions in huge amounts of ambiguity, particular in, particularly in environmental contexts. And when I think about it, I try to think, well, what is certain?
And, and really the only thing I could come up with, with was the sun rising and setting. And hence, you see, I. This very lovely picture in front of you of the sun setting. That's about the only thing that I can think of that's going to happen reliably over the next, well, few billion years at least.
Anyway, okay, so uncertainty is at the heart of ai, which is why that question mark disappeared. Why do I say that? It's because AI is essentially a system or a, or a series of systems that either aid us in decision making or autonomously or autonomously make the decision. So examples of autonomous decisions are things like robots, and they're moving around, or an automated mine, or even as we'll see you later on.
A An algorithm which automatically determines whether somebody gets a visa for an entry to the uk. So they're examples of automated decisions, but of course a lot of AI is used to aid humans making decisions. So there's a human in the loop, which looks at the output of an AI system and decides whether or not to make a particular decision.
In those AI systems, I'd like to think of them as part of a circle or, or a cycle. Cause they're not linear. And the, the part that begins at the cycle when you build an AI system is what you already know, your knowledge and beliefs. And then you may decide to catch some data. That's usually the way with AI systems, you get some data and that data is trying to understand the.
Question mark in the middle. What, what it is? You don't know. Then you have a model. Now, the purpose of a model is to connect the data to the question mark, to the issue at hand. Finally, you are using that, the, the data and the prior knowledge and beliefs. You get updated knowledge in order to make a decision.
And so the cycle continues. Cuz having made that decision, you've then got new knowledge and beliefs and so on and so forth. But at the center of it is always something. I dunno, it's very rare to think of any decision that you make with a hundred percent certainty. So let's talk about those unknowns and their relationship to decisions.
For some reason lost in the midst of time. Mathematicians or statisticians tend to call things they don't know, a Greek letter, and I've chosen the Greek letter theta here to represent what it is we don't know. And now I'm going to give you some examples of a decision that we need to make owned an example of theta, the thing that we don't know.
So you could, for example, the decision could be to set the price of a product at a supermarket. And what you don't know is the consumer demand for that product At that price, the decision could be to approve a visa. What you don't know is the riskiness of the applicant. You have a system in place, but you don't actually know for certain what the riskiness of the applicant is.
And another example that we're going to look at is the direction that a drone should go to locate pollution hotspots. This, in these situations, the, the decision is typically automated. That is that the drone would auto autonomously guide itself, but it still has to make a decision as to which way it goes.
And what, what we don't know is the pollution levels in the city. That's actually what we're trying to uncover. The decision could be how to manage the horses in the national park. And indeed, that was one of the examples that we had in the paper. Uncertainty. Nothing is more certain. And by no means the only unknown, but one of the unknowns is the impact of horses on natural habitat.
Another example could be where to drill for geothermal energy. In this day and age as we try to cut back on fossil fuels, there's a big push into alternative forms of energy. And one of them is geothermal energy, which happens when we find hot granite. The temperature gradient provides us energy and it's a very clean form of energy.
And the decision facing anybody who wants to produce this type of energy is where the hot granite is. So where to drill and what we don't know is the subsurface geology. In this slide, I'm going to talk about using that cycle, the elements of a decision problem, because for me, the cycle actually is the elements of the decision problem.
So the first element of it is an unknown quantity, something that we are uncertain about, but which will affect our decision. Then there are data or other sources of information. For some reason we give data or things we observe Roman letters. So in this case that's why we've got little n observations.
Why one up to y n a bold font wise? Just my notation for a vector. Then we've got a probability model that's gonna connect theta, the thing, I don't know, to the to the data. And that's done in two ways. Firstly, via the model. So the model says if I actually knew what the VA value of theta, the thing I don't know is how likely am I to observe the actual data?
And hence it's called a likelihood function. And then to update our prior knowledge with the knowledge in the data we use base theorem, and that's just base the, there that says the posterior distribution, that's the updated distribution. This thing PPF data. Given why is the posterior, the posterior distribution is equal to the likelihood times that prior divided by something which is a normalizing constant, which makes sure things integrate to one.
The last two elements of a decision problem are the set of decisions that you could take. And I've called this an action space given at the letter capital A. So the set of decisions d is in a, they're the set of decisions and we're going to choose that decision, which I've l labeled, sorry, I've got a bit of a cold D star has the that decision which will maximize my utility.
So that's what that equation said. Now, the important thing about this is when I think about a utility, a utility for those who don't know is could be anything. It's very subjective. It's about, you know, what it is I wanna get out of that decision. The important thing is, in relation to uncertainty, is that I must integrate over.
The unknown values of theta to get my overall utility. Cause I don't know what theta is and if the integration is with respect to the posterior distribution of theta. So this, this, this equation here doesn't have, theta doesn't appear in it because I'm integrating overall possible values where they come from the posterior.
So, and hence why this uncertainty around theta is critical in me making a decision because it affects my utility. Okay. So now I'm going to go through two really toy examples to show you those elements of a decision problem. And then go through a real example about where we use this type of thinking to actually monitor pollution levels across the city.
So the example I'm gonna start with is the Geothermal energy, which is clean energy. It's the hot granite is known to exist in the Cooper Basin in South Australia, but it exists a long way below the surface, four to five kilometers. And so we really don't have very much data, but yet we need to make a decision.
And the decision of course, is to drill or not to drill. That's the question. And the unknown is whether the site has hot granite. Now you're told from prior geological data that the probability that a particular site will, any particular site will have granite is about one in 50, right? So that's our prior belief in knowledge.
And we're gonna call about theta. We're gonna call that Prior Belief Theta and we're going to give it a benli distribution. Okay, so we've moved away from talking about events. We're now talking about random variables. So a benli random variable is a random variable that takes on the value of one if the event occurs and zero otherwise.
So theta here is the site has hot granite, it's gonna equal one. With probability one in 50, which is 0.02, it'll equal zero with probability, obviously 0.98. So that's what we don't know. Then we're told that we have a core sample and it is allied and it is known from previous data that the core sample analysis is very accurate.
It will correctly identify a site that has hot granite with probability 0.95 and identify a site without hot granite of probability 0.94. Now, I haven't explicitly said it there, but the core sample comes back positive. Okay, so now we are told we've got an unknown, but now we have a piece of data and so we're gonna use this information in the above dot point to construct a likelihood function, a very simple one, but a likelihood function nonetheless.
So why is also a boli random variable? It comes back one, if it's positive and zero, if it's negative and we're told that conditional, if I knew that there was hot granite, then it will come back positive with probability 0.95. If I know that there's no hot granite, there's still a chance it will come back positive and that that chance is one minus the probability that if there were no hot granite, it would come back negative.
So we're told it correctly identifies a site without hot granite. So in other words, data equals zero, it'll come back negative. Why is equal to zero with probability 0.94? So the probability that we'll still get a positive reading when the site has no granite is 0.06. So now we're in a position to calculate the posterior, right?
So this is the, what we actually wanna know is what the probability site has granite, you know, conditional on the fact that I've got a positive core sample and that's equal to my likelihood times my prior divided by this thing. Probability wise, equal to one. It's a normalizing constant, making sure that the probabilities on the left hand side add up to one.
That is the probability that theta is equal to zero given why is equal to one plus theta is equal to one given. Why is equal to one that some of those two things is one. Another way to think about it is just the, the probability of the evidence. The probability that if I went round taking core samples, I would get a one.
So and that's how we're going to calculate it. We're going to imagine the probability that I get a positive result is the probability that I get a positive result when the site actually has granted at times the probability that the site has granted. So that we, we are told that's 0.95 and we know that probability of the site has granite is 0.02, plus the probability that when the site doesn't have granite, I'm still gonna get a positive result, and that's 0.06 times the probability that the site doesn't have granite, which is 0.98.
This gives me a total probability of 0.078, you know, approximately 8%. So if we look at this where we should already be getting an idea that perhaps you know, the the results or the, the attributes of the course sample are not quite as good as they would seem to imply with 0.95 and 0.94 in the sense that the probability that a site has granite is 2%, whereas just taking the probability that if we arbitrarily drilled a site and got a positive result would happen 8% of the time.
So four times as often as the site actually has granite. And therefore, not surprisingly, the probability that we're actually going to find hot granite conditional on our positive sample is about one in four or about 24%. And now we've got all, almost all the way around that circle. And now we have to figure out what sort of a decision that we wanna make based on that.
Okay, so here the decision, again, drill or not to drill. So our action space is a D one being drilled, D zero being not drilled. And we want to choose that we wanna either do D one or D zero based on that particular decision that will maximize my utility. Now the formula for utility that I gave you in the previous slide had an integration sign instead of a summation sign.
That's just because in the previous slide that theta was the more general case, which is that theta is usually continuous rather than discreet, but. In this particular case, theta takes on the values only of zero or one. So what that statement in front of you says, well, my overall utility is the utility I get if I take a particular decision and theta comes out as zero times the probability that theta is equal to zero plus the utility I get for that decision.
If theta is equal to one times the probability that theta is equal to one, okay, so basically a weighted average of the utilities depending where the weights depend upon the, the, the probability that the site has granted or not. That's essentially all that says. So if we decide not to drill, that's D zero we will our utility we're going to define utility here as profit.
Obviously we could have chosen a whole range of things, and in fact, indeed when we drill for anything, we should choose more than just what the profit is going to be in terms of Figuring out our utility. But for this example, it is just profit. And that will be zero if I don't drill. Now, if I do drill the cost is gonna be a hundred million.
I'm wasn't on the slide, but I'm telling you that now the cost is a hundred million dollars. And then the, the million dollar question is whether or not there is granite conditional on the fact that I've already got a positive sample, right? So now the probability that there's no granite is 0.76 and I'll get zero revenue from that.
So I'll lose, end up losing the a hundred million that I spent in the cost or else I'll get a revenue of 700 million, meaning a total profit of 600 million. If there is granite, and that happens with probability 0.24. So I'm just gonna take the weighted average of those profits to get an overall utility of 68 million.
And so, if. My utility is defined as the profit going forward, then obviously I would decide to drill rather than not to drill because I'm getting 68 million from drilling and zero from not drilling. So that's just the setup of, that's just a very simple problem about using uncertainty to make a decision.
In this slide I'd like to point out something, a very important concept, which is the value of information. So the core sample, I'm saying cost 10 million wasn't worth it. Now the way the way to think of the piece of value information is that a piece of information has no value. In a rational world, if I'm not going to do something different contingent upon that information in some circumstances.
So for example, if the core sample had come back positive and I had decided not to drill at all, even if it were positive, they were obviously, and I could work that out in advance, then I should not have gone to the hassle of, or the expense of getting the core sample. So let's work through this example.
If we if we didn't take the core sample, then we would clearly decide not to drill just based on prior probabilities. We would only get that revenue of 700 million with probability 0.02 and we would lose a hundred million with probability 0.98, giving an expected profit of minus 86 million. So we would definitely decide not to drill.
Now with, with the core sample it's gonna cost $10 million. And if the sample comes back as positive, we've already shown that we would actually, I'm sorry, I've got a cat running around. I'm terribly sorry about that. And that will happen with probability 0.078. If we decide if the, if the core sample comes back negative, then we won't drill.
Why won't we drill? Well, we could go through the maths again if we have conditional on why or zero the probability that we're actually gonna find any granite is very tiny. It's 0.0011. So we and the overall expected revenue is minus 0.98 I is minus 98 million. So let's just recap and think about how much that that core sample was worth.
We know it is worth something. Why? Because without the core sample, we would've decided not to drill. With the core sample, we would have decided to drill under some circumstances. And those circumstances is when the core sample comes back positive. How often does the core sample come back? Positive 7.8% of the time.
So the expected utility from getting that sample would've been the, the 7.8% times 68, which is in fact 5.3 million. So it's not worth the 10 million that we paid for it, but it is worth something. So this is, this next slide is just to take a bit of a breather after working through all those calculations.
I wanna make the point here that dec making decisions under uncertainty is very difficult and it is very difficult for human beings to hold in their mind. All those myriad of options in front of them. Thinking about the future, what these consequences of the future are. And you know, there are two views of, I'm gonna put up two views of the human.
Capabilities on one, one view of human capabilities was this lovely piece. By Shakespeare in, in it says what a piece of work is, man, how noble, in reason how infinite in faculties. In fact, this is from Hamlet. And Hamlet is feeling a bit miserable and he's dwelling on the reasons why he might feel re miserable because he's saying how wonderful a human being is and, and yet he still feels miserable.
As a matter of fact, actually, Hamlet has quite good reason to feel miserable. You know, he's, he thinks his dad has, his uncle has murdered his dad and married his mom, which is enough really to make anyone feel miserable. But anyway, this is a lovely piece of prose talking about just how infinite in faculties, how wonderful our capabilities are.
Now, this is a slightly different view of our capabilities of reasoning. It says, the capacities is by Herbert Simon Noble Laureate. I'll just read it for you. The capacity of the human mind for formulating and solving complex problems is very small compared to the size of the problems. Whose solution is required for objectively rational behavior in the real world.
Or even for a reasonable approximation to some such objective rationality. And I've gotta say, I, I tend to side with Mr. Simon in this I think that, you know, it, we do have a very limited capability for making rational decisions in such a complex world. Which is why the point of this slide is why we need a framework in which to study it.
So what I did back then may have seen a, a, a complicated way of attacking a very simple problem. But the point is that it generalizes to a much larger class of problems. And Al enables us to, to make rational decisions. You may argue rationality is not all that is cut out to be, and I think I'd probably agree with you a lot, but there are, there is a place for rationality.
This is a another toy example where I'm going to try and show you about how we think about sequentially acquiring information. And I've chosen this as a, this is a legal example. In my new role, I found myself surprisingly, I. Surrounded by lawyers. And it does occur to me that one of the things that happens in law is that they make decisions under uncertainty or in lots of ambiguity an awful lot of the time.
So this is an example of a court case and we're going to see the value of various pieces of information. So it's a real court case and it's a famous legal case where the perpetrator was ha, has a, an attempted assassination of a wo world leader and was found not guilty on the basis of insanity, specifically that the person had schizophrenia.
So there is no doubt that the person did in fact, attempt the assassination. The defense is that the person was insane and they, the defense based their results on based their case on the results of a CAT scan, which showed brain atrophy. And put to the jury was that therefore this person is got schizophrenia and is not legally responsible.
And in fact, the jury agreed. So there was a big outcry about this particular decision. But let's go through some of the data to see whether or not you know this, we would agree with this decision. So the following are relevant pieces of information. Schizophrenia is prevalent in 1.5% of the population.
So you are hopefully thinking to yourself now, aha, that's a pryor. And you're right. It's a prior, and again, our unknown here is whether or not the person has schizophrenia. Again, it's a benli random variable. I'm just gonna shorten, I'm not gonna keep writing benli. I'm gonna just write be from now on.
So theta is equal to one with probability, 1.5 percent or equal to zero zero. In other words the person doesn't have schizophrenia with, with a 98.5% probability. So they are, so that's our prior belief. Then we're told to that the person has had a CAT scan and it's showed brain atrophy. So, and that again, our data, which we denote by y is equal to one, is equal to one or zero.
And we're told that 30% of people with schizophrenia show brain atrophy compared with 2% of the non schizophrenic population. So this is telling us about our likelihood, right? So we are told that if the person has schizophrenia, the probability that why, or come back as showing that they have brain atrophy is 30%.
So again, why is this benli random variable equaling one if there's brain atrophy and zero otherwise? Now, if the person doesn't have schizophrenia, then we are told that that they will have brain atrophy, you know, 2% of the time. So that's, you know, if you sampled the non schizophrenic population, you would get 2% of people would show brain atri, brain brain atrophy.
But now of course, we need to turn the, the handle. We've got a, an, we've got a prior, we've got a likelihood we can calculate the, the. Probability of the evidence that is the probability that somebody comes back with brain atrophy in much the same way, or is in exactly the same way as we calculated the probab probability that the site contain granite and we'll get a posterior probability that the person has has indeed got schizophrenia of 18.6%.
So I'm not here to say whether that is beyond reasonable doubt, but 18.6% is certainly not 1% or 2% or even 5%. I mean, it is very much non-zero. And so, you know, my own personal view would be that if I thought there was an 18.6% chance, then that would probably constitute reasonable doubt, but there was more information and we'll go over to that more information on the next slide.
Individuals, we, the first degree relative have who has schizophrenia have a 10% chance of developing the disorder as opposed to the usual 1.5% of the population. And you're told that the perpetrator has a sibling is schizophrenia. And how would you update your belief regarding the insanity p plea?
So we've got now got two pieces of information, the CAT scan and the fact that the person has a sibling with schizophrenia. I'm now going to subscript that first piece of information regarding the CAT scan by the value one to indicate that that was the first piece of information and it relates to brain atrophy via a CAT scan.
So, Our new prior belief is in fact our old posterior. So the, if you think about it, we're going around that cycle again. We found out that the person actually did have brain atrophy. We went from a prior probability of 1.5% to 18.6%. That now becomes our new prior probability as we consider this new piece of data.
The perpetrator has a, has a sibling with schizophrenia, and I'm going to subscript that piece of information by two just to indicate that this is now a different piece of information. And it is the fact that the sibling that the person has a sibling with schizophrenia. And you're told that 10% of people with schizophrenic siblings develop schizophrenia.
So if we've got our model in our head again, which is, this is a likelihood function. So if our perpetrator had schizophrenia, if that were true, then the probability that they'd have a sibling with schizophrenia is 10%. If our perpetrator didn't have it, then the probability that they'd have a sibling with schizophrenia would just be 1.5%.
And again, we update, we turn the handle and we update our prior conditional wi with the likelihood, and we get now 60%. So even on the, so on the balance of probabilities, it is more likely than not that the person has schizophrenia. So this would definitely be, I would think, room for reasonable doubt.
And this is an example of a Bayesian learning algorithm. This is how Google learns about you on the internet. So you make purchases, you have that gets these sequential pieces of information and it's updating all the time. What it doesn't know about you theta. So that theta might be your propensity to purchase a certain product or your propensity to take a trip somewhere overseas or anything of that nature.
It is constantly updating and learning each time you either do you, you do a search or you make a purchase. All of these things fed into something very similar to the algorithm that I've just described. You've also heard about Navin talking about the DARPA challenge and the, the ways that Bluey and I can't remember the other, the other names of the robots, but I do remember Bluey cuz I've met Bluey.
And that's how the robot decides which way to go as well. It's part of this Bayesian learning algorithm. So you may ask yourself, so which information, if I have, if pieces of information are costly, which in piece of information would I collect first? And so now I wanna talk about the value of information as a form of utility.
So let's think about this particular case we had up our perpetrator. We started out with a prior belief that they had schizophrenia of 1.5%. And when we got the piece of information one we updated that prior belief to from 1.5% to 18.6% so we can compute this ratio. Now, what is this ratio?
This ratio is just my posterior divided by my prior, and that ratio for the first piece of information is 12.4%. Now for the second piece of information, we can do the same thing. If we had not observed y one, if we had observed Y two first, we can calculate the posterior probability that the perpetrator had in fact schizophrenia, and that would've been 9.2%.
And the ratio of the 9.2 to the 1.5 is 6.1. So you can see that the first piece of information shifted our belief by a factor of two times more than the second piece of information. And I'm going to now introduce a concept called the mutual information. But before I do, I want to just point out that theta and y one or theta and Y two both can take on, so theta can take on the value of one or zero.
And so can Y one or Y two. And this is just one particular realization of theta and Y one. There are many other pairs, or in fact three other pairs. You know, we could have had a zero zero, A one zero or a zero one as well as a one one. So this is just one particular realization of these random variables, which are the schizophrenia and brain atrophy in this case, or schizophrenia and a sibling with schizophrenia in the, in the second case.
So mutual information as a utility function. All I've done here is take if you look at this, this is, I am taking this ratio of the posterior to the prior, I'm taking the log. There's a lot of reasons why, a lot of ways to explain taking the log, but I think for now probably the easiest explanation is to say that taking the log of a ratio is a bit like take is a bit, is analogous to the percentage change.
So you can think of the log of the ratio of the posterior to the prior of the percentage change that I get from moving from the prior to the posterior. And then what I'm doing I'm waiting that percentage change by the probability that those outcomes occurred. So as I said before, here we have theta and Y one being one, but in fact they could have either be, both of them could have been zero or one.
So I'm just basically taking a weighted average of those rates of. Change of the from the prior to the posterior where those, where that weighted average is determined by the, or the weights are just the probability that that particular pair of outcomes occurred. And you can see that the, if I do this, I calculate the information for the, the, the first, the mutual information for theta and Y one and I get 0.0038 mutual information for theta and Y 2.00 16.
So y one adds more information than Y two on, on average, not just across when they both equal one. Now, you know, unfortunately, the real world things are not just zero or ones, they're continuous, but that's okay. I, you know, we're not gonna, I'm not gonna test you on this, but you know, if we, if all, all we need do is replace those summation signs by integral signs, and this, this last one says that, you know, if in general we have a pair of random variables that are continuous with values over the space theta times y, then the mutual information is, is just the expected value of that rate of change effectively, right?
So that's that integral there.
Now I'd like to talk about extend what we've done in those two toy examples to how we actually use this in a, in a completely autonomous AI system. So in this particular example we want to dynamically allocate to locate the maximum of a function. In this case, it's pollution hotspots. And we have a drone flying around.
This is taken from one of ro You've met Roman by now several times. I know. But this is taken from a piece of work that Roman did with drones flying about the city. And the drone has to decide where to go next. We don't, we want to discover the hotspots, but we wanna do it in an optimal fashion. So we wanna go the, the shortest possible path to trying to find out where those hotspots are.
And in this example that I'm about to show you, there are two forms of acquisition functions. One is entropy, which is analogous. Two, not exactly the same as mutual information, but similar to mutual information. And the goal in the entropy function is really to reduce our uncertainty the most. Then there's another acquisition function, which actually unfortunately can't really be formulated as a utility, formally as a utility function, but it has very good properties.
And what this. This other acquisition function does is try to minimize uncertainty, but maximize the the, the probability that I'm gonna hit the maximum. You know, so I, it, it, it explores, but it explores around the maximum. So it's trying to have a trade off between exploring and also hitting the maxima.
So now I'm going to that's just putting it up formally. So I've had my unknown here. My theater is now where that pollution hotspot is, and my, the decision of the drone is where do I go next? So I'm, I'm going to hopefully be able to play this for you. And what you are going to see is the path of the drone.
As it goes around to explore pollution levels and it's used and the path is dictated by one of three acquisition functions. This is the upper confidence bound entropy, or just randomly going anywhere. This is the mean of the function, the mu, this is the variance, and this is the path taken by the drone.
If we look at the entropy one and you follow where it's going, you can see that it's closely following the variance, right? It's trying to reduce the variance. And if we look at the path of the upper confidence bound, you can see that the drone is automatically coming back to these two regions, because these are two regions of, they're two regions of high pollution levels.
These are the hotspots, and it's, it's exploring, but it's exploring around those regions so that it's identifying those, those regions that are of concern as opposed to the one at the at the bottom, which is entirely random. It's really not doing a very good job of you know un uncovering the pollution levels in an optimal fashion.
So we just let this run and it's about to. Finish, and you can see that's the result of the drone. So that is a AI system where all those notions of the value of information that we've discussed together with these acquisition functions show how we are constantly the, the, the drone is guided in a, in a sensible way to get intelligent data acquisition.
So I'd like to shift gears a little bit further at the moment and talk a little bit about more the notion of uncertainty and how that plays into decision makings. And I've called this slide uncertainty or taxonomy cuz I'd like to break down uncertainty into bits that we can do something about and bits that we can't really do anything about.
And the first is the bit that we can't do much about, which is the inherent or aortic uncertainty. In this case, getting more data will not help. And you can think of this as the outcome of a toss of a coin. Then there's parameter uncertainty. An example of a parameter would be the probability that the coin will come up heads.
So, although I may not be able to perfectly predict the outcome of a coin as I'm flipping it, whether or not it comes up heads or ta, the more I flip it, the, I will get better estimates of what the probability that will come up. Heads are. So, so in that case More data will help. Then there's model uncertainty.
Model uncertainty is harder to get at that parameter uncertainty, but very important nonetheless. And here coming up with a model for coin tossing, you know, this is just that the data are, are i i d independent and identically distributed benli random variables. So this is, as you can see, this is a likelihood function.
So every time we propose a likelihood function, we are proposing a model. They're not the only sorts of models that we can think about, but they are a sort of model. And there you see that what this model is effectively saying is that the coin toss on one coin toss the outcome does not depend on a previous coin toss.
It's a pretty good assumption, but there are lots of cases where we make model assumptions that, you know, we could have chosen many. And there is uncertainty around that model that's often not appreciated. And then there's knowledge uncertainty, knowledge, uncertainty. I sort of would paraphrase. I, I simply don't know.
And unfortunately that also happens quite a bit of the time. But yet we still must make a decision. So if we think about a future event, so why Star here is a future event and we think about, we want to have some idea of the likelihood of that future event occurring. We've got for now I'm going to leave out knowledge uncertainty.
I'm gonna come back to it. But we do have to take into account all those first three types of uncertainty. So which is what this equation is doing. So the first, the first bit of that equation says that there's uncertainty. Even if I had the correct model M and I knew the parameters, there is still uncertainty about the outcome.
So that's my inherent uncertainty. That's the outcome of a toss of a coin. Then there is. Conditional on a model. So I'm only defining parameter conser uncertainty conditional on a model. There's uncertainty around, you know, the probability of whether it's gonna come up ahead or tails. I mean, we all know that's about a half, but, but there is, there is many cases where we don't know the parameters.
And then lastly, there's uncertainty about the model. And the uncertainty on a future event will actually encompass all of those uncertainties. They all must be considered if we are to accurately quantify uncertainty. Now, when we come to quantify uncertainty, you know, can probability actually represent or quantify uncertainty?
Well, I would argue yes for the first three, but no for the second one, knowledge, uncertainty in the way I. Imagine it, and this is again from the paper is that uncertainty, which we simply don't know enough about to make probabilistic statements. So so that is a, a different type of uncertainty. So now I'm going to try and illustrate the difference between these in again, a sort of a toy setup and then move on to how this was implemented at a sort of large AI scale.
So we can think of I'm going to do this in a regression context cuz I know you've learned about regression from the wonderful Richard Scalzo. So here our decision is how much soft drink to water in a beachside cafe. Where we know the thermometer is there in the picture where we know that people buy more soft drink on hot days than on cool days.
So let's just go back to our decision cycle. That's our framework with which we think about making decisions under uncertainty. And here are the two things that I'm uncertain about or I'm going to tell you. The two things I'm uncertain about is the sensitivity of soft drink demand to temperature. So that is you know, if the temperature goes up one degree centigrade by how much do my soft drink sales go up?
So that's a parameter. So that's the, that's parameter uncertainty. And then there is the uncertainty. I wanna say, for example, the day is forecast to be 26 degrees. I wanna know what is my predicted demand of soft drink. If the temperature is gonna be 26 degrees and I've called that y I hat now why I hat will contain that degree of.
Uncertainty. That's due to the fact that, I dunno, the sensitivity of soft drink demand to temperature, but also inherent uncertainty in the sense that that many things go into make up soft dink demand. So mal I have some data and the data here. Now I've got not just Ys, but I've got X'S as well. The Ys are my sails.
The X's are gonna be the temperature. And I've got a model and it's just a really simple model. It's just a simple linear regression model that says my sail are linearly related to temperature. But you know, there's many things that go into making up those sails. So I've got this error term here and I'm assuming, again, this is a model assumption that the errors are normally distributed and that'll give rise to a likelihood function and a posterior.
And then I need to make my decision about how many soft drinks to order. So let's just work with that example and I'll show you the difference using that example between inherent or allo toric uncertainty and parameter uncertainty. So that's our model. Here's some data. We've got temperature, we've got soft drink sales, and you know, we can put a line of best fit through there.
There it is. But I'm you know, that's just based on that data. I'm not sure whether that's the true line that maps temperature to soft drink sales. There's uncertainty around that and that's the parameter uncertainty. And then, but that's the uncertainty around BTA zero and BTA one. But then remember, I do wanna get a prediction interval.
I want to actually predict what will happen when, when the temperature is 26 degrees. And as you can see, there's more uncertainty than just surrounding BTA zero and BTA one. There's this error term because. Temperature is not a perfect predictor of soft drink sales. Many pe many other factors go into the decision to buy soft drink.
And so here I've constructed 95% posterior prediction intervals for the temperature at 26 degrees. And it will include not just the parameter uncertainty, but also this inherent uncertainty due to the fact that it doesn't perfectly predict that temperature doesn't perfectly predict sales.
Now, if we time moves on and now we've got a hundred observations, you can see parameter, whoops. Parameter uncertainty has shrunk enormously, right? Because we're getting much more sure that this is in fact the true regression line, but the inherent uncertainty remains approximately the same because, you know, there are still many other things that, that predict softing cells other than temperature.
Now, I do wanna make a point that the way I've defined parameter and inherent uncertainty is conditional on a model. So, although we cannot reduce inherent uncertainty by having more data conditional on a model, there are ways that we can, in this particular case, that we can reduce inherent uncertainty.
And that is by actually having a different model. So maybe we can reduce this inherent uncertainty by having a different model, and that's what this slide shows. Here is my, my data again with my hundred data points. And what I've done on the right hand graph in front of you is I've changed the model, right?
So I've included other Xs. Now these other xs in this instance are just different transformations of temperature. So I get a non-linear function of temperature, but they could have been other different types of information as well. But what you will see is that I have reduced my inherent uncertainty, but I've increased my parameter uncertainty cuz now I've got p plus one parameters as opposed to just two parameters.
And in fact, you know, these are sort of, if I look at the overall uncertainty, I haven't decreased it at all. And this would tell me that maybe this linear model is not such a bad model. And in fact, that's sort of how we go about thinking about whether we should add extra variables into a model by looking at how much those extra variables.
Reduce my uncertainty inherent uncertainty. That's the bit that I can't really control, but that's gotta get traded off against the fact that I've got many more parameters that I have to estimate. So that's, that's one way we could have changed the model. Another way we could have changed the model to reduce our inherent uncertainty is to have a different functional form.
So in those previous models, they were sort of additive models but now we've got something, we've got these xs could be the different transformations of the temperature. They're going into some deep learning or neural neural network with multiple layers. This is the, the formula for what we've got.
We've got now got this flexible function. F again though, what, and, and these sort of deep learning models, you know, they're wonderful on one level because they give great predictions, but they're really difficult to get a handle on the uncertainty around them because now we've gone from Two in our two parameters in our, in our part, in our linear regression case, two 105 just waits, leaving aside things like bias terms and various bits and pieces.
So we've, we've really increased our parameter uncertainty. And actually it's. When you think about parameter uncertainty, in the previous case, that parameter, that B2 one had a meaning, it, it meant the sensitivity of soft drink sales to temperature. Now, these parameters, it's really difficult that they don't really have that interpretation.
These omes, these little ERs are the, the weights that connect inputs to outputs. And so when we put a pryor over these weights, it's really difficult to know what it is we're actually doing in terms of uncertainty. What we are actually doing. When we put a pryor over those weights, we're doing something called shrinkage, or we are regularizing, we're actually just stopping the graph from being too wiggly.
But it becomes really difficult to interpret and very difficult to qu it is difficult to quantify uncertainty. And just to give an example of how high dimensional these problems can become you know, there's been a lot in the news about chat G P T. I'll just point out that chap PT. G P T has, you know, many more than 105 parameters.
It has 175 billion parameters. So, so getting uncertainty quantified you know, outputs for the, for models of this scale is a tremendously difficult problem. And, and an active area of research. And it's even worse than that actually, because these models are often over parameterized, which means that there many combinations will give you the same output.
So here is an example of a couple of parameters, theta one and failure two. Different combinations are equally likely. And so if we just focus our attention on uncertainty in one particular mode, it's not reflective of the un the total uncertainty that we are experiencing. This is not to say these models shouldn't be used.
They, they are u so useful when, when the goal is prediction rather than inference. But getting uncertainty around those predictions that takes into account these parameter uncertainty is a very difficult area. Yeah, and very interesting one too. Okay, so now let's move on to the real example of where we can look at, we can think about The impact of inherent and parameter and even model uncertainty where it was totally ignored.
And I've called this slide the UK visa fiasco because it was an, indeed a fiasco. So here, the, the, the decision that we, they were faced with Was whether or not to issue a visa. Okay? That's our decision. And the thing that we don't know is the riskiness of the applicant. So we may have prior belief about our riskiness of the applicant.
We have data on the, on the applicant. We have a model. I'm not really the model. This is a model that could have been used. I don't know exactly because it's not Ava publicly available, the exact model that they use. But here I've just given an example of one, and this is just a simple multinomial logistic regression.
I'm sure Richard has taken you through. You know, this is an example of regression. When our output is a category, it's not continuous. So here, We've got this random variable. Well, why? That's our whether the, the person ha is either low, medium, or high risk. So why here is a vector for each individual with a one in the place of the risk category and zeros elsewhere.
So for example, if somebody was low risk, why? I would be 1, 0 0. If they're medium, it would be 0, 1 0, and if they're high risk, it would be 0 0 1. The actual form of this is not important. I don't want you to think, you have to start to understand the whole form of this. The important point is that they were predicting the riskiness of the individual based on previous data.
And based on covariates. So they had attributes of peoples, and these are in the X's. Now, one of the X's I've called it X one i here is past immigration breaches, right? So that was used to determine perhaps sensibly whether or not a re the, the applicant was likely to be a high risk individual.
And as I've said down the bottom here, the Yi hat is the prediction of the individual into either low, medium, or high risk categories. And that is what the algorithm was doing. And then based on that prediction, if they were high risk, they didn't get a visa. Okay, so let's, this was initially at time, as I said, up the top here, T is equal to zero.
Let's roll forward to time. T is equal to one. And this is what they actually did. They, they got what they did. Let's, you know, go back to our Covariate X one. They said X one is now not just the past immigration breaches, but it's our prediction of whether or not they were a high risk person before. So they actually used a prediction.
As an observation into the next round of the model. Now as you've seen in the the trivial slide with the soft drinks, this yi hat has a lot of uncertainty around it. There's uncertainty because we dunno the parameters of the model. These are all the beaters here. Same degree of uncertainty there. There's also just inherent uncertainty.
Not everybody that is predicted to be high risk is high risk. Both of these types of uncertainty were ignored, so they actually treated predictions as if they had. No uncertainty around them as if they were actually objectively observed observations. And of course they also ignored model uncertainty cuz they had a model and they could have had many models.
And I, so, you know, so that was also ignored too. So what was, what actually was the end result of this particular AI system? There's a wonderful visualization on the A, B, C for people who, you know, has no maths. And it's absolutely fantastic to try and understand what their, what actually went on in this case.
And I've given the reference and I would really encourage you all to have a look at it. Cause it also goes through the robodi example, which is great. Now here is how they explained what actually happened. So there was 200 people split evenly be between two made up countries, red and blue applying for visas in 2015.
And the red group has slightly higher rates of historical breaches. Again, this is from the website. So you've got the historical breaches, 33% historical for, for the red versus 26% for the blue. And then their model in 2015 was pretty close to the predictions for pretty close to those historical values.
23% for the blue, 34% for the red. Now remembering that they're going to then use their predictions from 2015. To feed into their, if they were predicted to be a high risk, to feed into their as as a immigration breach effectively for future years. So what you end up getting is this results that by 2017, you've got 43% of the people from the red country are being refused the visas versus 20% from the blue country.
And if you were to run this out, you know, into infinity, you'd get that basically everybody in the red country would be refused a visa. So this is an example of what happens in AI systems that ignore uncertainty. So I've said here, treating model generated outputs as observations, ignoring. Inherent and parameter, as well as model uncertainty is my idea of irresponsible ai.
And there was a report written recently by the Royal Society or rather, it was commissioned by the Royal Society. It was written by the Alluring Institute on the use of synthetic data, which is getting a lot of press at the moment. So, synthetic data, and I mean by synthetic data here, data that is generated as a model output.
That could be from a statistical model, as in the case of this particular visa fiasco. Or it could be in the case of agent-based models. For example. Synthetic data is limited to the models predetermined configuration and will not enable statistical inference to be reliably drawn about the real world.
And that's from a report called Why what, why, and how. And I think that this is really important. It's not to say that synthetic data doesn't have a use sometimes is very useful, but we must always anchor synthetic data to real data if we really want to make inference about the real world. Otherwise, the synthetic data generated from a model just contains the information in the model and the model may or may not be right.
And I want to now flip back to our geothermal energy example, because this is an example I think where. Models. And these were physical models were used in a really smart and interesting way. And they were anchored to data. So they used model, they used physical models, a and anchored those physical models to data in a way to learn about what the subsurface geology looks like in the Cooper Basin.
So now we're it, we're going to go back to the Cooper Basin. You know, it's not quite so straightforward as the first few slides seem to indicate. And this is an example of how to use probabilistic models with machine learning models. Sorry, probabilistic machine learning models with physical models to really reduce our uncertainty about what lies beneath the surface of the earth.
So here, this, what work was done By Lockland McMan in 2014 at NTA with a whole bunch of other people there at NTA at the time. And also I was work done with Richard Scalzo and a few people and I was fortunate enough to, you know, work with Richard on that particular occasion. The so again, our prior, what we don't know is our subsurface geology and but what, so they have to come up with a pryor on the geology, and this is their prior for the geology, they said that the geology is a layered structure and there are capital n of these layers.
They have a poon distribution cuz they're not it's, they're, they're random. We don't know exactly how many. The and the layers are not necessarily adjacent. The, the layers have edges, and the edges are not straight lines. You know, they are, they're, they're curves or, or wiggly lines. So they put a Gaussian process prior over the edges.
I'm not gonna go into Gaussian processes. I'm sure you'll have a lecture on them at some later stage. So, so we, we've got Gaussian process prize as for, for the edges. Now the layers contain a rock, and we're saying that each layer contains only one particular type of rock. So for example, you know, particularly of interest is granite obviously.
So it could be granite, it could be sandstone, it could be basal. They're sort of three examples, and each particular rock has properties. So this is their prior, and over here we see a realization of this, of this pry. This is not the only realization. You know, this pryor could generate many, many, you know, billions of possible geologies that live below the surface.
And the role of data. And in this case the physics, the geophysics behind it is to constrain those prior, prior worlds to something that is to reduce the uncertainty, to get a better pinpoint on where hot granite lies. So the rocks have properties. And these properties are things like density and magnetic susceptibility.
And that's very important because those rock properties will actually Determine what we, you know, surface measurements. So, so together with this prior, they have data. They have gravity, magnetics, magnet, magneto, tolus, and a few core samples, not many core samples. So most of the data, certainly Y one to Y three lies on the surface and is relatively cheap.
But these are not direct, actually record, you know, readings of the subsurface. But what they did was say, well, if this subsurface exists and has these properties and I have things like forward models, like models that take density of a rock to gravity on the surface, or magnetic susceptibility to the magnetic reading on the surface, then I can say, if this true structure was the, if this prior structure was the truth, then this is what I would expect to see on the surface.
And they said that that data that they saw on the surface had a normal distribution and had a mean, which was determined by the geology and the forward models, which propagated rock properties to readings on the surface. So it was a really imaginative and clever way of, Using different types of information to reduce uncertainty.
So here it is, the prior, together with the likelihood giving the posterior about where to drill for granted. So now I'd like to talk a bit about model uncertainty, which I've sort of glossed over. I've mentioned it a little bit, but now I'm gonna give you two real examples of model uncertainty, and I'm sure you'll all remember Covid and because and, and a concept called the effective reproduction rate.
So the reflective reproduction rate. Determines how, you know if one person is unwell, you know how many other people get unwell and we wanna get that. If it's that's below one, the disease will die out. If it's above one, you know, we, we we're getting, you know, it, it spreads. And this was a particular piece of work that was done by Imperial College, by the Imperial Covid 19 task force.
And I'm given the reference for the paper, and it was entitled, estimating the Effects of Non Pharmaceutical Interventions. Non-Pharmaceutical Interventions are those in interventions, which are, as the title suggests, non-pharmaceutical. So things like, and I've listed them over on the left there, lockdown, event ban, school closure, self-isolation, and social distancing, right?
So they're our non-pharmaceutical interventions. And this is a model, right? This is a model that relates those non-pharmaceutical interventions to the effective reproduction rate. So it's a model that says how this effective reproduction rate changes over time. And it's, it is model. The model is that it's a step function of these interventions.
So here is my effective reproduction rate. And you can see these are the generated model outputs and how these are changing in response to the particular intervention. And what this particular group did really well was that they then said, okay, so if that is the effective reproduction rate, you know, how does this translate?
They had a model for going from the effective reproduction rate. All the way through to the number of infections through to the number of deaths. So, you know, the number of infections is gonna be much higher here. This is the number of infections, these pink bits down the bottom of the actual data. So they connected their model outputs to data, and you can see that there are many more number of infections in the population than are reported, and that is sensible.
But most importantly, they connected it to deaths because deaths tend to get reported. And so they anchored their model output to the, the, the actual number of deaths, which, and you can see their model output agrees quite well with a number of deaths. But this is just one model. And they then came up with that.
You can see that they said that lockdown had this huge impact and you can see the effective reproduction rate dropping. And they said therefore We've avoided lockdown, avoided, you know, three, approximately 3 million deaths. But this was just one model. This, this is just one model that they had.
They had another model. And here is this other model, this other model A models, the effective reproduction rate as as a function of mobility. And the mobility data was provided by Google. So this is a different model, and here is the way that the model reproduces this effective reproduction rate, right?
You can see that on this model basis the biggest impact was in fact social distancing. Social distancing here had the biggest impact on reducing the effective reproduction rate. Again, you should check models by validating them with data. Again, they did this and for the UK data, The root means square arrow, there's many more source of uncertainty than just the root means square arrow.
But the root means square arrows were approximately similar for the UK and the for, for, for the UK for this model and the other model. But they did it across 13 countries and in fact, across the whole 13 countries, this model had a, a much better explanatory, you know, it was a much better fit to the data than the other model.
But the point here is that we've got one model that says it's all about lockdown. We've got another model that says it's social distancing, probably has the biggest impact. And when you go to make decisions, You know, they, these are pretty big decisions that you're making. And, and so having two models conflicting, you know, is not saying that one is right or the other, but it's certainly saying that we need more, you know, you, we need to be a bit more certain or we need to somehow come up with a way of figuring out which would be the best policy.
And we wrote a paper about this and it appeared in the Journal of Clinical Epidemiology. And as this slide suggests, you know, this, it was a paper about model uncertainty. And we came to the following conclusions that the model proposing the major benefits for lockdown in European countries had the worst fit to the data.
It did. Models with benefit to the data showed little or no benefit from lockdown. And, and that observational we came to the conclusion that observational data that are fed into complex epidemic models should be dissected very carefully. And substantial uncertainty may remain despite the best efforts of modelers and causal interpretations from non-ST models should be, should be avoided.
So they were our conclusions. So that's just a bit of how important IT model uncertainty is in the decision making process. And finally, Decisions where we really have no information. And that was the case when with the wild horses in the Kasko National Park, we really, so had absolutely very little information on that.
We, the estimate of the wild horses population was about 9,000, but the Arab bars were from, I think from two or 3000 all the way up to 30,000. So, you know, that's another way of saying, I have no idea, but it's probably under 30,000. So there was very little knowledge about anything, about the number of horses, about the way that the horses impacted the environment.
So how do you make decisions cuz a decision has to be made? We know there are too many horses in the national park. How do we, how, what is the best way of moving forward, particularly when there are so many different stakeholders, all with very different priorities and different beliefs? Well, The report or the paper argued.
The paper was a joint author with Hugh Derrick White, who was the chief scientist at the time, or is the chief scientist of New South Wales and was in charge of the inquiry. So start with what is known. 350 horses can be rehomed each year. That is what they did find. That means based on horse reproduction, that we can have a maximum of 3000 wild horses.
Second, try and establish parameter and model uncertainties. The number of current horses, which is parameter uncertainties, is critical information and they're now gonna be made they're now gonna be counted using drones, things like Romans, drones flying around. And this information made available to the public, and that is so important that the public get to see what, you know, the number of co what counts varying, you know, even according to drones as they As they transverse different areas and different days, they'll pick up different number of horses.
Damage to horse damage caused by horses to the environment, which is a model uncertainty, can be, will be monitored over time by images using deep learning algorithms, which are fantastic for image analysis. And then intelligent data acquisition principles. So we are going to be using mutual information as a acquisition function to constantly learn and update what we know.
So although we know very little for decisions with know, knowledge uncertainty, this still provides a framework with which to attack those types of problems. Uncertainty doesn't really matter. Well, I hope you will think yes by now, but the extent depends on the decision context. So here's for one scenario.
We only care about prediction. We're not interested in the why. We're just interested in knowing whether somebody's gonna buy a product and predicting whether they're going to do that. We've got large amounts of data. Decisions are reversible and repeatable, and the consequences are short-lived. And the, if you make a poor prediction, the costs are born by the people who develop the algorithm, not by the people who the algorithm is targeting.
Okay. Scenario two is inference. I really need to understand the why. It is important. Daniel is spa, sparse and uncertain. Decisions are difficult or impossible to reverse, and the consequences are long term and the future is really ambiguous. And the cost of a poor decision is not born by the people who develop the algorithm, but by the people who the algorithm targets.
So what sort of situations are typical of those scenarios? Pricing a product of a supermarket image analysis on online advertising. Failing to quantify uncertainty will lead to sub-optimal decision making, but probably not a catastrophe. Scenario two, understanding the impact of a government intervention.
Where to drill for, for geothermal energy, deciding whether an individual gets a visa, how many horses to Carl in the national park. Failing to quantify uncertainty or accurately or ignoring it altogether is an example of irresponsible ai. So in summary, going back to our cycle, what, what would I have you take away from this?
I would have you take away that accurately. Quantifying uncertainty is crucial for optimal decision making. It's also crucial for scientific discovery because n quantifying uncertainty tells us what we don't know. And it is imperative if we are to build trust with the public. We need to be honest about what we know and what we don't know.
But of course, without it, we cannot have responsible ai. There's a lot of talk about responsible AI at the moment. Most of it ignores actually the important role of quantifying uncertainty has in AI systems. But there's good news. And the good news is you know, we have a framework. There is one equation to rule them all.
One equation to find them, one equation to bring them all. And with a little bit of mathematics, find them and it's called. So thank you very much. It's been a great pleasure to talk to you and all the best for your future studies.
Autonomous decisions can be made by AI. That includes robots moving around, automated mines, and algorithms determining visa approvals. AI is also used to assist humans in decision making. Without proper guardrails this can cause harm.
The scale and speed of this technological impact is almost certainly unprecedented in human history. Regulation for this new era must be technologically well informed, responding to and where possible anticipating issues related to new technology. The goal should always be to ensure that humans experience the benefits of new technology, while addressing or at least mitigating harmful risks.
I like to think of AI systems as part of a cycle or a circle rather than a linear process. The cycle begins with existing knowledge and beliefs, which are then complemented with data
Professor Sally Cripps