No References Allowed - Testing Competence in the Age of AI

Sam Ireland
Feb 8, 2023
11 min read

Updated: Mar 3, 2023

Preface

Back-to-back opinion-based soft topic pieces? 🫢 I promise I'll write another mini-book thoroughly over-explaining a medical topic next. Stay tuned. In the meantime, news about ChatGPT is blowing up. I have been in a vortex of watching interviews with the people creating these AI systems. There seems to be a never-ending stream of news every single day about new AI capabilities, and some of them are very impressive, to say the least. In EMS, we'll want to stay aware of this topic. Otherwise, it's going to sneak up on us.

Universities are in a panic trying to figure out how to deal with these systems when testing their students. Some are strongly opposed, while others welcome this technology with open arms. Where will EMS fall on this spectrum?

I recently did a conference keynote, "The EMS Singularity" exploring these topics. I'll be giving that presentation in a live class time slot sometime soon, and we'll also post it as a podcast. This is an exciting technology and one I am beyond excited to see evolve.

Check out the videos at the end of the blog as well!

What is competence?

The question of competence is relative to the situation. When determining competence, we devise a list of our required objective goals. Once that list is met, we can feel comfortable saying that an individual can adequately perform a given task.

For an emergency clinician, we could use the example of chest compressions to develop a set of objective values that would determine their competence in a given scenario. For example:

1. You determine your patient is pulseless.

2. Start chest compressions.

This would be an easy test of competence - any delay in CPR is robbing the system of blood flow and perfusion. However, what if that provider was being aided by a reference card that had an algorithm on it?

Suppose the clinician followed the reference to arrive at the same conclusion as someone who had memorized the process, and the outcomes were the same. Would we be able to judge that individual as any less competent? This likely brings in some bias on the part of the examiner. The examiner might conclude that since the provider needed a reference and did not know the answer by memory, the provider cannot be trusted to perform the task independently. Let's take this a step further by blinding the examiner.

The examiner is behind a wall, assessing how long the provider takes to start chest compressions and maintain perfusion pressure.

The examiner has no idea if the clinician is using a checklist or not and is only measuring how long it takes to start chest compressions and maintain perfusion pressure using a sensor and timer (since these were the objective metrics we determined would prove competence). This scenario may seem very simple, but the idea of using resources runs much deeper.

We've certainly made exceptions when it comes to advanced airway placement. There are so many intricate steps to this procedure that some would argue not using a resource such as an 'intubation checklist' would be irresponsible. This forces us to ask some very difficult questions about where we determine the amount and type of assistance a clinician can receive before they are deemed incompetent. Consider this list that would never end if we kept going:

Does "reference" mean only a physical checklist?

Can that checklist be on your phone?

What if that checklist could use audio to assist you?

What if that audio-enabled checklist was interactive?

What if that interactive checklist was gathering information from your monitoring equipment?

And so on...

A good example of this question of using references is an open-book exam. All the answers are available to the student if they can find them in the book. So, what are we really testing? Their ability to use an index and a glossary? The exam essentially turns into a game of timed page flipping. What if that book were in PDF form? Could they Command+f (search) the document? That would get rid of needless page-turning. You can see where this is going. An AI assistant will take a question the student gives them, search the resource, and consolidate the information for them. Besides AI being more efficient, the difference barely exists. It's a smart book that you can talk to like a human.

Perhaps the most challenging question to answer is:

Are we opening up Pandora's Box of ignorance and reliance?

ChatGPT

As Jay-Z once said: "Well, I ain't passed the bar, but I know a little bit."

ChatGPT knows a little bit too. In fact, it can pass parts of the Bar exam, passed Wharton's MBA exam, and especially of note to the medical community, all of the United States Medical License Examinations (USMLE) steps.

How comfortable would you know that an emergency physician used ChatGPT to pass their USMLE? You probably wouldn't be very comfortable with the way testing is currently performed. This isn't limited to the medical profession. You also wouldn't want a lawyer or CEO for your business to have obtained credentials based on their ability to copy and paste.

You might be thinking this is a non-problem. After all, these exams are monitored, and there are steps to ensure that someone challenging these exams cannot simply copy and paste their way to success. That's true, but it's also far from the point.

At least two things are certain - systems like ChatGPT (and soon GPT 4) are not going away and will improve over time. We cannot even fathom the amount that they will improve. AI systems are on an exponential growth path, are there are no signs of slowing down. So, while physicians, lawyers, and CEOs might not be copying and pasting their way to success, it questions their entire education and testing method. How so?

A visual comparison of the parameters of GPT-3 (which is what ChatGPT is mostly based on) and its successor GPT-4, which will probably be released this year.

Analogous Questions

At the outset of this blog, we pondered the question of competence between memorization and resource use. If this reference was used for successfully completing a task, why can't ChatGPT be used for the same purpose? One might argue that one was used to figure out what to do next (performance-based examination), and the other was used to answer a question (knowledge-based examination). We could challenge that logic further:

Skill-based examination: Testing whether or not to perform chest compressions.

Knowledge-based examination: "Your patient is pulseless. What is the next most appropriate action?"

We're testing the student on the same knowledge set but in different formats.

When you really boil this conundrum down to its bones, you come to the question of what is worth memorizing and what is not. Even when exams come up with higher-level questions, test preps start copying those types of questions, and memorizing starts all over again.

I've certainly had the experience, and you probably have as well, of memorizing a concept for an exam only to forget that concept shortly afterward. If you're like me, this meme hits a little too close to home:

If you were forced to take every exam you've ever taken without any warning, how would you think you would do? Just the thought of this scenario turns my stomach. Even as I sit here and write this blog, I'm seriously questioning which year Columbus sailed the ocean blue - 2002, if my memory serves me correctly... 🤨

The point is that we teach and test a lot of rote memorization, which is becoming an increasingly antiquated way of displaying intelligence. As our ability to access answers increases with AI's large language models, we must evolve with the systems that serve us. Consider the calculator as an example.

All math exams used to be performed without a calculator. Why? Because "you're not always going to have a calculator in your pocket." That concept didn't age very well, and I certainly wouldn't want anyone to give me a medication dose they think they calculated correctly in their head. Now, there are apps that can scan written problems, show you the solution, explain every step the calculator used to arrive at that conclusion, and explain the context of the question, all from a split-second image from your camera. We are now experiencing the dawn of the LLMs - large language models. Large language models are the equivalent of a calculator, but with words (and they're already in our pockets).

High-level math exams test your ability to work with a calculator. This is because as problems become more complex, you would waste too much time doing each calculation by hand. Some calculations are too complex to be performed without a calculator (otherwise, they would take all day). However, we're missing an essential piece of this comparison - background knowledge.

Background Knowledge

Just because you've been given a resource doesn't mean you know how to use it. It would be best if you had a certain familiarity with a resource, allowing you to use it effectively. However, there's a catch to that logic.

With text-based inputs, even a layperson can pass the USMLE (a human actually isn't even necessary). However, in real-life scenarios or simulations, the user of that resource needs to know what questions to ask. If the student were being examined how to perform a skill, there would be a completely different workflow. The AI might only be questioned about peripheral items, such as finding normal vital signs for a pediatric patient, looking up a medication dose, showing how to mix that medication properly, displaying a protocol, interacting with a safety checklist, alerting a hospital, speaking to the clinician about vital sign trends, setting reminders, and so on.

We've been pushing clinicians to do simulations like they're really caring for a patient - to prepare exactly like they plan to practice in the field. As AI gets better, more relevant to our field, and integrated with our systems, can we really say that we're preparing students for the field without teaching them how to integrate with these AI systems?

Measuring Success

This still doesn't answer the question of what is worth memorizing and what is not. However, maybe that's the wrong question to ask. We need students to have a strong knowledge base, but we also need to avoid wasting their time with needless rote memorization. Where is the middle ground between these two? The answer is much simpler than we realize, and some people have already figured it out.

I've taken a few courses in which most of the exam is based on a visual interface. These interfaces don't need to be high-definition, and some of them are actually pretty simplistic. They involve drag-and-drop items, vital signs on a monitor, a patient you can see, a historian giving you information, etc. Here are a couple of examples I found through a quick Google image search:

These interfaces could still allow the student to use resources like an AI, but the final treatment would be up to the clinician. This would train them to work with their resources while avoiding the copy-and-paste method (which allowed AIs like ChatGPT to pass the USMLE). These interfaces also avoid the in-person stations that NREMT is sunsetting since many of those skills also came down to a rote method of memorization of those check-offs.

These scenarios could also be reasonably randomized, with generally accepted ranges that the system would reference as objective success points. These systems could be in VR (like the first image) or on a screen (like the second image). Acceptable time frames would be used to avoid over-use of the AI, ensuring the student has enough background knowledge to care for the patient. These simulations don't have to be limited to patient-provider interfaces either. MCI scenarios could be viewed from a map perspective as the student organizes resources. Medication scenarios could include interactive interfaces for drawing up medications and preparing infusions. Cellular and mechanism of action questions be viewed on a micro level to assess knowledge on a drag-and-drop interface.

A mobile app that has become very popular is called Brilliant. It uses images to test your knowledge. The example below asks the student which way the blue gear will rotate based on the knowledge that the yellow gear is rotating clockwise. It's very intuitive and makes you use your brain for real-world problem-solving. These applications are good enough to teach the hard sciences, so this method wouldn't be difficult to use for medical education.

We already encounter little tests like this on an average day on the internet.

These image recognition tests ensure that a bot is not running the computer. The principle is the same - using an interactive image framework to place a layer of separation between the test and the human. However, humans still need background knowledge of everyday life to select the correct images. This method could be scaled up to the level we need for assessing competence for complicated questions.

The Is-ought Fallacy

People often say, 'we've always done it this way.' That saying has become a four-letter word in medicine in recent years. The memeification of this saying stems from the is-ought problem, which is stated as the following:

"The is-ought fallacy occurs when the assumption is made that because things are a certain way, they should be that way... In effect, this fallacy asserts that the status quo should be maintained simply for its own sake. It seeks to make a value of a fact or derive a moral imperative from describing a state of affairs."()

Exams have been the same for a long time, and so has the way we prepare for them. However, with the advent of AI, it's time to reassess how clinicians display competence. If we want clinicians to use references in the field to avoid mistakes, why are we testing them without these things? Perhaps I can play a bit of devil's advocate.

Bringing back the example of a calculator, you don't get to use a calculator in lower-level math classes. You gradually build your skills and do problems by hand to prove your competence. Then, as you master that level of knowledge, you're allowed to automate those calculations when you reach the next level. This baseline understanding of how the whole system of mathematics works allows mathematicians to make breakthroughs in their field. Bringing this back to medicine, we must teach students from the ground up before introducing references. However, I don't think this has to be done with rote memorization.

Team discussion, debate, presentation, trial and error, real-life problem-solving, skillful lectures, simulation, games, reading, etc., are all critical components of building a solid foundation of knowledge. Some of the most interesting and complicated things I've ever learned came from watching a YouTube video and then trying it out myself - no exam needed. I'm not proposing we eliminate exams, but I believe there are better ways to engage students than showing them bullet points they'll be tested on the next day. However, our educational system is currently built this way - but ought it?

End Game

"Begin with the end in mind" - Dr. Stephen R. Covey

When we think about our end game for students, what is it? Most would say to ensure they can take really good care of patients. It all comes down to obtaining the best possible outcome for every patient they encounter. If we have tools to assist clinicians in doing this, shouldn't we teach them how to use them?

We can fight AI by using lock-down classrooms, no internet access, scanning papers for signs that AI wrote them, and forbidding resources during testing, but I would question what we are trying to accomplish by doing this. If we're going to 'play as we practice,' we would want our students to learn to use the resources available while still learning. We need to find a way to ensure baseline knowledge while not wasting their time with too much memorization.

We are in the same situation as math teachers when the calculator was invented. How will we respond?

Conclusion

AI will be integrated into everything we do in medicine, perhaps sooner than we think. It will help keep our patients safe, our providers informed, our communications connected, and change how we think about how we care for patients (if implemented correctly).

Systems like ChatGPT are only the beginning, and many companies are already working on or have already integrated with other healthcare specialties. We will see the same integration happen with EMS. While the response to this might be fear of losing our knowledge base, over-automating decision-making, cheating on tests and papers, and the like, there is also a very different outlook we could adopt. We could view this as an opportunity to improve patient safety, offload menial work, automate documentation and communications, and always have a smart partner in our pockets we can bounce ideas off of.

One day, we'll wonder how we ever lived without it.

Videos

If you’re interested in the topic of AI, here are a few videos you might find entertaining.

My favorite news program is Breaking Points with Krystal and Saager. They were recently guests on the JRE, and had some great points about ChatGPT. I especially appreciate that Krystal used the calculator reference (I swear I wrote that before I watched this video).

Check out Breaking Points channel on Youtube:

https://www.youtube.com/@breakingpoints

Lex Fridman, one of my favorite people on earth, also did an interview on JRE about ChatGPT that was a little more technical, but extremely informative and interesting.

Check out Lex Fridman's channel on Youtube:

https://www.youtube.com/@lexfridman

Sam Altman, the CEO of OpenAI (creators of ChatGPT), is interviewed about this technology.