Presagen Webinar Series: AI Accuracy Supremacy is a race to the bottom for robust and reliable AI in Healthcare

In Jan 2023 Dr Don Perugini presented the webinar “AI Accuracy Supremacy: A race to the bottom for robust and reliable AI in Healthcare”. Presagen’s webinar series presents new technologies and challenges related to AI in healthcare, women’s health and fertility. Below is the transcript of the presentation, the webinar video, and the presentation slides.

 
 

Slide 1

Hi everyone, Thanks for joining me.

I’m Don Perugini from Presagen. Today I will be talking about how our obsession with AI accuracy supremacy is a race to the bottom for robust and reliable AI in healthcare.

Slide 2

Achieving new accuracy records in healthcare can be ground-breaking.

It could involve solving a new problem or producing an innovative new algorithm to better solve an existing problem.

These scientific achievements are typically newsworthy and should be celebrated.

Slide 3

However… when you’re talking about using AI for healthcare, in the real-world, in clinical practice and on patients,

the AI needs to be consistently accurate, or what we call generalizable.

The AI needs to be robust and reliable to work across all scenarios that it is intended to be used for.

It needs to be unbiased and scalable to work reliably for all patients, regardless of who they are or where they live.

Slide 4

The question is, does our obsession with accuracy help AI transition from the lab into the real-world.

Do AI competitions like Kaggle, and beating AI accuracies in benchmarks, necessarily translate to better AI tools the real world?

Slide 5

The problem is, AI accuracy can be misleading, for two reasons.

First, accuracy is only as good as the data used to test it.

If you are testing the AI on one or a few clinics, it is easy “overfit” the AI to be highly accurate for that type of clinic or patients that attend that clinic.

This may not be representative of the accuracy for other clinics and patients in the real world which may use the AI, and thus can lead to bias.

To solve this, the data used to test AI algorithms needs to be diverse, or better still, globally diverse, to ensure that the AI is globally applicable.

Secondly, accuracy provides little insight into how reliable and robust the AI will likely be in practice.

What does this mean and how do we solve it? Well, that’s the focus of this webinar.

So lets do a deep dive.

Slide 6

Let’s use a simple example of training an AI algorithm to identify pictures of hotdogs.

This example could have been detecting cancer in images, but for simplicity, lets use hotdogs.

Slide 7

To create AI for real world use, there are several steps.

First you collect the data to train the AI, which in this case are images with and without hotdogs.

The AI trains using this data, which in essence looks for patterns or features in the image to identify hotdogs, and distinguish it from say burgers.

The AI training will produce multiple AI algorithms.

So why multiple AI algorithms? Well, it is just like in class at school.

Teachers teach different students the same material, however each student learns differently, and some learn better than others.

With multiple AI options available, we need to assess and test each AI to determine which one will likely perform the best in the real world.

Finally, the best AI is then selected, and then can be productized or operationalized to be used in the real world.

Slide 8

The key step in the AI creation process is how do we assess and test AI, and determine whether it is likely to be reliable and robust enough for real-world use?

Slide 9

To explain this, we will now talk about the AI output and how to interpret it.

When the AI analyzes a new image to identify if a hotdog is present, it outputs a number, or a score, from 0 to 1.

This number is used to tell us whether the AI sees a hotdog in the image, and its confidence in seeing a hotdog.

A number of 1 means that the AI is very confident, or certain, that the image contains a hotdog.

A number of 0 means that the AI is very confident that the image does not contain a hotdog.

A number of 0.5, so halfway, tell us that the AI is unsure whether the image contains a hotdog.

Therefore we say that a score above 0.5 means the AI sees a hotdog in the image, and a score of less than 0.5 means the AI does not see a hotdog in the image.

And the closer the score is to 1 or 0, indicates the confidence that the AI that it sees a hotdog or not, respectively.

Slide 10

In this example, there are four images, and four scores output by the AI.

The first two are clearly hotdogs, and the AI outputs 0.9 for one and 0.6 for the other.

A value above 0.5 means the AI sees a hotdog, so the AI is correct for both of these images.

Similarly, for the bottom two images of burgers, the AI score is less than 0.5, so the AI does not see a hotdog in the image, which it is also correct.

Slide 11

It is also important to note that although the AI got all the images correct, it is more confident in its assessment of some images than others.

So, with the top image, it is very confident that it is a hotdog with a score of 0.9 out of 1, but the second image it is less confident with a score of 0.6, but still correct.

Likewise with the last two images of the burgers.

The AI is more confident with the top burger image with a score of 0.1 compared with the second burger image with a score of 0.4.

Slide 12

This metric associated with AI confidence is key to assessing whether an AI is likely to be reliable and robust in practice.

Our view is that selecting AI which is both confident and accurate during testing, will likely lead to AI that is more reliable and robust in the real world, compared with selecting AI based purely on accuracy alone.

I will explain why.

Slide 13

Let’s consider an example of comparing two different AI algorithms, algorithm 1 and 2, by testing them on four images.

As can be seen on the slide, algorithm 1 gets all images correct, with 100% accuracy, because it was able to identify images with and without hotdogs in all four cases.

Algorithm 2 on the other hand gets one wrong, the second image, and only achieves 75% accuracy.

Just looking at accuracy, algorithm 1 looks better than algorithm 2.

Slide 14

However, if you look at the confidence of these assessments, algorithm 1 was less confident, close to unsure, in its assessment of whether a hotdog was present or not for all images.

Algorithm 2 on the other hand was very confident in its correct assessments, and less confident with the one that it got wrong.

What does that mean?

Slide 15

It means that the lower confidence with algorithm 1, or it being less sure of the assessment, is almost like guessing.

The high accuracy was probably more like the “luck of the draw” – it just happened to guess right for all the images rather than smart enough to really being able to identify hotdogs in images.

Therefore, algorithm 1 is likely to be brittle, and the 100% accuracy is not likely to be a reliable measure of its performance in the real world.

Now with algorithm 2, it was very confident in its correct assessments, unsure when it got the image wrong, which is ok because the AI is telling us that it is less sure.

Therefore, algorithm 2 is more likely to be robust and the 75% accuracy is likely to be more reliable when used in the real world.

In other words, to have AI generalize well in the real world, you want to choose an AI that has good confidence in its predictions when testing, even at the expense of accuracy, because that AI will generally do better in accuracy overall, on new unseen data in the real-world when it's used in practice.

Slide 16

To conclude, accuracy and confidence in AI creation are key to more reliable and robust AI for use in the real world.

Focusing on accuracy alone can be misleading and does not necessarily represent the performance of the AI in practice.

For AI in healthcare, it is important that AI reliably perform as per the stated performance metrics, to ensure that all clinics and patients that uses the AI receives the improved health outcomes and clinical benefit.

Slide 17

I would like to thank you for joining this webinar.

This presentation is available on our web site at Presagen.com.

Thank you.