Using FHIR to Optimize Clinical Artificial Intelligence

I have to admit, I have a lot of #FOMO with the ubiquitous hype of artificial intelligence (AI). Especially its use in healthcare. It's exciting to see the variety of applications explored, notably those from large language model (LLM)-based, foundation models, like ambient scribes and semantic search, but I'm most intrigued with clinical reasoning. I'm also extremely skeptical of deploying them into production use cases in our healthcare systems from a safety and efficacy perspective.

Because of this, I've actually embarked on a bit of a side quest in 2024 to focus my efforts in diving deeper into Fast Healthcare Interoperability Resources (FHIR) because I think that it allows us to aggregate higher quality data from disparate systems that can be used to create benchmark datasets for improving these foundation models.

In this article, I want to explore a few thoughts I have about training, evaluating, and deploying clinical AI, with a focus on clinical reasoning, into production systems. For the purposes of this article, I use LLMs, foundation models, and clinical AI interchangeably with clinical AI focusing on the reasoning application.

The Three Epochs of AI

The volume and pace in which AI news is published, and its associated jargon, presents a significant hurdle for many of us in healthcare to get a quick snapshot of what matters and what doesn't. I've largely tuned out most AI-related information because it's so difficult to keep up. To this end, I really appreciate the framing of how to think about AI and its implications in the Three Epochs of Artificial Intelligence in Health Care in JAMA Network by Michael Howell, MD MPH, Greg Corrado, and Karen DeSalvo where they categorize AI into the following:

AI 1.0 - Symbolic AI

Described as the "encoding of human knowledge into computational rules, as well as probabilistic models".

AI 2.0 - Deep learning

Described as the "models learn from examples labeled with ground truth" and are "task-specific, meaning they do one thing at a time, and they primarily focus on classification and prediction".

AI 3.0 - Foundation models

Described as having "fundamentally new (and potentially transformative) capabilities, as well as new kinds of risks, such as hallucinations". Biggest difference being that "these models can do many different kinds of tasks without being retrained on a new dataset".

Using this framework has helped me think through what a path towards deploying clinical AI could look like. Especially since AI 1.0 and 2.0 are already being used throughout healthcare today. Conversely, AI 3.0 is new and warrants a fresh perspective in how we evaluate it.

The Opportunities

The efficiency and scaling of decision support tools used in direct patient care is, in my opinion, one of the primary value propositions of integrating clinical AI into healthcare. Reducing administrative burden probably is second, but not by much. To keep this section succinct, I have two major thoughts here:

  1. Magnitude, complexity, and pace in which new medical research and clinical practice guidelines are introduced

  2. Pace in which implementation of tools (e.g. EHRs, CDS) are created and/or updated to reflect best practices

Clinical AI presents us with an opportunity to tackle both by its ability to ingest large amounts of medical literature (#1) and reason about it to provide decision support to our healthcare teams (#2). I'd love to dive deeper into the applications for reducing administrative burden, but I'll save that for a separate article as there's a lot we can explore there. In the interim, check out my MPH capstone from 2020 when I discuss I Analyzed the 21st Century Cures Act on reducing clinician burden attributable to health information technology & electronic health records.

The Challenges

Discussing the challenges could easily be its own article, if not multiple, but I do want to focus on a select few.

Electronic Health Records (EHRs) & Data Silos

If you've ever worked with someone in healthcare informatics, you've probably heard them say:

If you've seen one EHR, you've seen one EHR

Having personally worked as a pharmacist end user and informatics pharmacist in various hospitals and health systems across multiple states using multiple EHR vendors like Epic, Cerner Corporation, Allscripts, & McKesson, let me reinforce that idea: the data in one hospital is usually not generalizable to another hospital. Even if it's using the same EHR. The devil is always in the details and the nuance is in how each local site's EHR team implements the EHR. Essentially, how interoperable data is between organizations is heavily influenced by the individuals that are implementing and maintaining our EHR systems.

I'm a huge fan of the HITECH Act of 2009 as it was likely why I had a job in 2015 working on the EHR Project (aka Plummer Project) at Mayo Clinic, but I do believe that the implementation timelines of HITECH were so aggressive that we have now have a non-trivial amount of data silos across the United States.

These data silos underpin why I'm quite interested in FHIR because I imagine most clinical AI is trained on, somewhat, homogeneous datasets like the publicly available MIMIC, which is from one institution, or proprietary datasets from an institution's own EHR system. Check out Clinical Text Datasets for Medical Artificial Intelligence and Large Language Models - A Systemic Review to dig deeper. I think FHIR, because of the Cures Act, will enable us to build higher quality datasets that comprise multiple EHRs from multiple hospital and health systems. In turn, these datasets convert the problem from a data silo issue to an optimization one. Similar to what ImageNet did for computer vision & deep learning or what SWE-bench is likely doing for AI programmers. Y Combinator did an awesome round table on this last month you can check out here: 10 People + AI = Billion Dollar Company?

Thus, FHIR -> better dataset -> better clinical AI

Hallucinations

Of course, when it comes to challenges with LLMs, hallucinations is probably the single biggest hindrance to safely deploying LLMs to production. More on that later.

Necessity of data standards

This has been on my mind ever since Peter Lee posed the question about whether FHIR is actually needed in a world with pervasive LLMs in the last 4 minutes of his lecture: The Emergence of General AI for Medicine in March 2023. In short, if LLMs, like OpenAI's ChatGPT, which seem to be quite excellent at converting between formats to exchange data (e.g. org A -> FHIR -> org B), why do we even need it at all when we can go from org A -> LLM -> org B?

It seems like Peter wasn't the only one thinking about this either as Gabriel Brat, MD, MPH, Josh Mandel, MD, and Matthew McDermott recently published an article: Do We Need Data Standards in the Era of Large Language Models? It was in response to another article by Yikuan Li et. al. titled FHIR-GPT Enhances Health Interoperability with Large Language Models where they found their model improved the conversion of clinical narratives to FHIR resources as compared to existing NLP pipelines. Isaac Kohane tweeted about it as well.

In my mind, I don't think FHIR, or data standards for that matter, will ever fully go away and I would imagine that we will continue to need them in the future. The original article seems to align with this in their conclusions. It's a great read and I encourage you to check it out for yourself. Besides, I'm betting that FHIR will be a prerequisite for us to Further optimize our clinical AI models and that we should invest more into it to unlock more opportunities.

Evaluating Clinical AI

How we evaluate the safety and efficacy of clinical AI was my true impetus for writing this article. I'm still forming my opinions, but actually wanted to highlight one of my most enjoyable reads of 2024: The Regulation of Clinical Artificial Intelligence by David Blumenthal and Bakul Patel. This ties back to the original framework I shared in the beginning as David and Bakul suggest that evaluation of clinical AI takes on a different shape than it has in the past, and consequently, should be regulated differently. In their article, they differentiate this by defining AI, for the purposes of regulation, as:

  1. Pregenerative artificial intelligence (PGAI) or

  2. Generative artificial intelligence (GAI)

The former likely represents AI 1.0 and 2.0 and the latter is AI 3.0.

There's a lot of insightful information about the approach our federal government is taking to regulate this, via software as a medical device (SaMD) and Predetermined Change Control Plan (PCCP), but I want to jump right into their proposal for regulating GAI:

...regulate them less like devices and more like clinicians.

It's a fascinating idea to me and they provide some concrete ideas:

  1. Prescribed training on a specified body of information - I think of clinical practice guidelines we have to learn in school

  2. Obtaining passing grades on relevant clinical examinations - I think of our licensing exams like NAPLEX for pharmacists, NCLEX for nurses, and USMLE for physicians.

  3. Undergoing period of supervised use in clinical practice - I think of our clinical rotations and post-graduate training in residency and fellowships

  4. Regulate updating of GAI-based clinical applications - I think of our requirements to be lifelong learners and taking CEs

  5. Reporting periodically to regulatory authorities - I think of our reporting of CEs to our professional boards

  6. Making results of these evaluations available to public - I think of our publicly available, professional licenses as healthcare professionals

Circling back to hallucinations, David and Bakul make a compelling point about how our current systems are "imperfect" when it comes to regulating human clinical intelligence and how "virtually every clinician, no matter how distinguished or experienced, makes errors in diagnosis and treatment during their careers". I think it's a fair argument.

Conclusion

I'm still forming my opinions around the various topics I shared in this article, but I'm quite interested in further exploring FHIR as a solution to improving our training data sets for clinical AI and, subsequently, how an evaluation framework and feedback loop for how we can safely and effectively evaluate this new era of artificial intelligence. As it stands, I'm likely more pro human-in-the-loop and 'human + AI' > 'human' or 'AI' alone.

Brian Fung

I’m a Health Data Architect / Informatics Pharmacist by day, and a content creator by night. I enjoy building things and taking ideas from conception to execution. My goal in life is to connect the world’s healthcare data.

https://www.briankfung.com/
Previous
Previous

1.5 years of OpenClinTech

Next
Next

My Experience Cracking the HL7 FHIR Certifications (R4 & Foundation)