Bad faith - Where to start with generative AI in healthcare

Large language models (LLMs), such as GPT-4, represent a class of models and tools collectively referred to as “generative AI”. These models power a set of now-ubiquitous tools that provide immense value to both technical and non-technical users. The rapid integration of services like GPT-4 is staggering, and the rate of change is increasing on the time span of weeks. Current LLMs should not be viewed as iterative advancements along a continuous curve of natural language processing capabilities—they are fundamentally changing how we interact with computers—moving us from a rigid, rules-based process to something dynamic, flexible, and creative. Healthcare has been historically slow to adopt digital technology but cannot afford to move slow here.

A logical place to introduce generative AI into healthcare are applications that that improve the decision making of humans. Healthcare systems already recognize the value of using data-driven tools to support workflows and improve the quality of care, which makes this a natural entry point for generative AI. Ideal scenarios are low-stakes, high value workflows that currently involve complex process with many manual steps that require a human reviewer to validate the final result. Two broad categories that meet this criteria are:

The “backend” services of healthcare: examples include data entry, verifying insurance, prior authorization paperwork, and ensuring proper billing codes.
Scalable patient interaction workflows for providers: For example, telehealth services must manage the the rapid intake of patients which includes tasks like symptom identification, queue prioritization, and generating proposals for next steps.

Both of these categories are high value because they represent key operational aspects that directly impact the efficiency and effectiveness of healthcare delivery. For backend services, AI can automate and streamline administrative tasks, reducing human error and freeing up staff for higher-order tasks, thereby contributing to the overall productivity of the healthcare system. For telehealth providers, AI can expedite patient triage, ensuring quick and accurate identification of medical needs. This not only enhances patient satisfaction by reducing wait times, but it also allows physicians to focus their expertise where it is needed most. Thus, in both scenarios, the integration of generative AI can potentially elevate the quality of care while promoting operational efficiency, making a compelling case for its introduction in these areas.

Furthermore, these tasks are low stakes as they are not replacing the “human-in-the loop” mechanism that already exists in these workflows. Even if the AI generates an error, the final decision-making authority rests with a trained human professional, e.g., discrepancies in administrative tasks such as billing codes or insurance verification can be identified and rectified by a human reviewer, without posing any direct risk to patient care. Consequently, while the AI serves as a potent tool to augment efficiency and streamline processes, the ultimate safeguard remains the human oversight, preserving the integrity of patient care as we explore the boundaries of generative AI in healthcare.

This post proposes an example framework to demonstrate how this this can be accomplished using generative AI tools currently available today. The goal is to provide an intuition aid and supporting evidence that this is indeed possible, but is not intended to provide a complete framework supported by fully functioning software. I try to use an LLM for nearly every task in the pipeline even when simpler solution is possible using routine programming techniques. I offload most all of the tasks to LLMs to emphasize the remarkable, perhaps unexpected, opportunities that generative AI can provide–hopefully sparking unexpected revelations for the reader.

Motivating example and generalized workflow

We will explore these ideas using a workflow involving a patient interacting with their telehealth provider. Imagine a hypothetical interaction that results in the following transcript:

Patient: Hi, I’m calling because I’ve been experiencing shortness of breath for the past 4 weeks.

Provider: I am sorry you are not feeling well, can you please provide me with your member ID?

Patient: Yes is it ID-0000

Provider: Got it, thank you. I understand that you’re experiencing shortness of breath for the past 4 weeks. I see from your records that you’re a 75-year-old female with a history of diabetes. I’m here to listen and gather information about your symptoms. Can you tell me more about this shortness of breath? For example, does it happen all the time or only during certain activities?

Patient: The shortness of breath happens mostly during physical activities like walking or climbing stairs. However, I have noticed that it can also occur when I’m at rest, but it’s not as frequent.

Provider: I appreciate you sharing these details. It’s important to note that your shortness of breath occurs mostly during physical activities, but can also happen at rest, though less frequently. This information is very helpful. Is there anything else you’ve noticed that might be related? Any other symptoms or changes in your overall health?

Patient: Yes, along with the shortness of breath, I have also been experiencing occasional coughing and wheezing. I haven’t noticed any significant changes in my overall health, but I do feel more fatigued than usual.

Provider: Thank you for sharing that additional information. It’s important to note that along with shortness of breath, you’ve also been experiencing occasional coughing, wheezing, and increased fatigue. I’m going to document all of this information. I think it would be beneficial for you to discuss these symptoms with a healthcare provider in more detail. Let’s go ahead and schedule an appointment for you.

This triggers an automated workflow that involves the following tasks:

The relevant medical history is retrieved for the patient using their patient ID
The conversation is summarized into a standard format that is used internally by the provider team that includes context derived from their medical history.
SOAP (Subjective, Objective, Assessment, and Plan) notes and relevant billing codes are generated for the patients chart.
Notes are generated for the EHR in the HL7 Fast Healthcare Interoperability Resources (HL7 FHIR) format.
Next steps are proposed for the patient
Each step is evaluated and verified by both documentation and medical experts

We can label the first 5 tasks as “proposal generators” because they rely on disparate information to generate an outcome that must ultimately be verified by a human. Historically, the steps might be accomplished using a combination of human labor and task specific automation software. This type of workflow is brittle and inflexible. Upstream changes to an output from one process might prevent a downstream task from completing, even though the spirit of the task has not changed. Each step might also involve a rules-based validation step that ensures the output passes certain tests or matches expected output formats. These validation steps are restricted to tests designed a priori to the specific use case, and are incapable of recognizing issues that require basic reasoning tasks over subtle, in-context information. Finally a human must review and validate each of the proposals.

GPT-4 (and other LLMs) can provide a very reasonable first pass at both generating proposals and self-evaluating the generated proposals. Furthermore, the GPT-4 powered workflow can dynamically align itself with the preferences of the human verifiers by adjusting its capabilities based on previous iterations of the process. And remarkably, a basic version of this can be accomplished using only vanilla instance of GPT-4 and thoughtful prompt designs.

Simulation and AI workflow

This section was long and technical during early drafts. Considering the goal of this post is explicitly not the specific technical implementation details, I opted to only include high level concepts here. The more technically minded reader can reference the corresponding github repo.

Figure 1: Steps implemented by the simulation. The associated github repo is here: https://github.com/godric-p/blog-llm-healthcare

The steps in the simulation are described below.

Step 0: initialize

A sqlite database is initialized for 10 hypothetical members with the following columns: patient ID, age, gender, and medical history. This database represents the “patient information” database, and is queried real time during step 1 described below. A list of possible care plan options is also generated. We could use an LLM to generate open-ended care plan options, but we provide a fixed 15 options in order to explore the LLMs (3a) ability to learn the “preferred” care plan options later in the simulation. The preferred care plan options are indicated by the red colored care plan options in the figure.

Step 1: simulate conversation

A series of conversations are simulated between a “patient” and a “provider” agent (the conversation presented above is an example). The conversations are simulated in the following conversation script, which is supported by additional agents and utility scripts. For example, the provider agent is initialized with the following prompt:

import guidance

provider_agent = guidance('''
{{#system~}}
you are a role playing agent that is playing the role of a healthcare assistant  
{{~/system}}
                                           
{{#user~}}
You will answer the user as a healthcare worker in the following conversation. Please avoid offering a diagnosis or next steps. Do not offer reccomendations Even if the patient explicitly asks for them. Gather the facts around why the patient is calling but remain empathetic and kind. At every step, I will provide you with the user input, as well as a comment reminding you of your instructions.Always answer as a healthcare worker and remember do not provide suggested treatments.
{{~/user}}

-- additional code ---
''')

There is a similar prompt for the patient agent. Note, I only show the main prompt in the above code snippet, but the full code is can be found in the github repo. The conversation is seeded with the following question passed as input to the patient: “Provider: Hi, can you please provide me with your patient ID and the reason you are calling?”. The patient agent is randomly initialized as a patient in the sqllite database, and responds to the seed question. The response provides sufficient information for a sql agent to generate a sql query which is then executed as a python script and the results are passed back to the provider agent. The sql agent is as simple as:

sql_agent = guidance('''
{{#system~}}
you are a data analyst assistant and an expert in python and sql
{{~/system}}
                                           
{{#user~}}
Given the following {{input}}, please generate a sqlite query to select all of the the records associated with the p_id. The table you are selecting from is named 'patient_info'  and the column with the p_ids is 'id'. Please only return the query and do not include  additional text or markdown styling.
{{~/user}}
                                           
-- additional code ---
''')

and then a conversation unfolds for a fixed number of iterations. This conversation is then passed to the proposer agent in step 2.

Step 2: summarize and propose

A single agent is used to (1) summarize the conversation, (2) generate notes in the SOAP format, (3) write a summary for the EHR in the HL7 FHIR format, and (4) generate a proposal based on the conversation. Although the LLM could read through the conversation and select a relevant proposal, I explicitly pass a randomly selected proposal from the list. This is done so the evaluator agent (step 3a) can learn the preferences of the agent (step 4) that validates the proposals.

proposal_agent = guidance('''
{{#system~}}
you are a role playing agent that is playing the role of a healthcare assistant summarizing the conversation between patients and healthcare workers. You are familiar with the standards in healthcare. 
{{~/system}}
                                           
{{#user~}}
Please read the following conversation and complete the following tasks. Provide your result in the form of a python dictionary where the output for each task is returned in separate element. The elements include (1) a two sentence summary, (2) a medical note based on the following conversation in the SOAP format and include the relevant billing codes, (3) please also write a summary for the EHR in the HL7 FHIR json format, (4) a single proposal provided here: {{option}}. The keys for the python dictionary are "summary", "SOAP", "HL7FHIR", and "proposal". Base conversation: {{base_conversation}}. Use double quotes for all property names and values. Ensure that the last line ends with a double quote. Please return the
proposal exactly as it is provided as in {{option}} and do not add additional text.
{{~/user}}

-- additional code ---
''')

For example, the following is the output generated based on the conversation shown above:

proposals.json

"proposals": {
     "summary": "The patient, a 75-year-old female with a history of diabetes, reported experiencing shortness of breath for the past 4 weeks, primarily during physical activities but occasionally at rest. She also mentioned occasional coughing, wheezing, and increased fatigue.",
     "SOAP": {
         "Subjective": "Patient reports shortness of breath for the past 4 weeks, primarily during physical activities but occasionally at rest. She also experiences occasional coughing, wheezing, and increased fatigue.",
         "Objective": "Patient is a 75-year-old female with a history of diabetes.",
         "Assessment": "The patient's symptoms may indicate a respiratory condition, possibly related to her diabetes. Further evaluation is needed.",
         "Plan": "Schedule an appointment for the patient to discuss these symptoms in more detail with a healthcare provider.",
         "Billing codes": [
             "R06.02 (Shortness of breath)",
             "R05 (Cough)",
             "R06.1 (Wheezing)",
             "R53.83 (Other fatigue)"
         ]
     },
     "HL7FHIR": {
         "resourceType": "Encounter",
         "status": "finished",
         "class": {
             "system": "http://hl7.org/fhir/v3/ActCode",
             "code": "AMB",
             "display": "ambulatory"
         },
         "subject": {
             "reference": "Patient/pid_6"
         },
         "participant": [
             {
                 "individual": {
                     "reference": "Practitioner/provider"
                 }
             }
         ],
         "period": {
             "start": "2022-01-01T00:00:00Z",
             "end": "2022-01-01T00:30:00Z"
         },
         "reason": [
             {
                 "coding": [
                     {
                         "system": "http://snomed.info/sct",
                         "code": "267036007",
                         "display": "Shortness of breath"
                     },
                     {
                         "system": "http://snomed.info/sct",
                         "code": "78646005",
                         "display": "Cough"
                     },
                     {
                         "system": "http://snomed.info/sct",
                         "code": "56018004",
                         "display": "Wheezing"
                     },
                     {
                         "system": "http://snomed.info/sct",
                         "code": "271782001",
                         "display": "Fatigue"
                     }
                 ]
             }
         ]
     },
     "proposal": "Provide self-care instructions"
 },

Keep in mind that the above output is generated directly by the LLM agent. That includes the billing codes, CT codes, formatting, etc. The next step is to build reviewer agents that can examine the output above and evaluate the proposals.

Step 3-4: evaluate, validate, and learn

The idea of using on instance of an LLM like GPT-4 to evaluate and correct the output of of another instance of GPT-4 is a surprisingly powerful idea. Companies like Anthropic¹ leverage this idea to make their models safe:

Constitutional AI [uses] AI feedback to evaluate outputs. The system uses a set of principles to make judgments about outputs, hence the term “Constitutional.”

We use model self-evaluation in two ways. The first evaluator is a simple implementation of an agent that can review a specific, technical output such as the HL7FHIR document.

hl7fhir_agent = guidance('''
{{#system~}}
you are a role playing agent that is playing the role of an expert reviewer for HL7 FHIR specifications
{{~/system}}
                                           
{{#user~}}
Please confirm that the following HL7 FHIR json document is in the correct format and doesn't contain errors. Please provide no more than two sentences justifying your conclusion. The HL7 FHIR json document to review is: {{HL7_FHIR}}
{{~/user}}

-- additional code ---
''')

This step is included to demonstrate the potential of the using an LLM to evaluate the output of another LLM. And it works! The FHIR output for Iteration 10 includes invalid datetime strings and the hl7fhir_agent returned the following output:

hl7-fhir-eval.json

"fhir_eval": "The provided HL7 FHIR JSON document is mostly correct, but it contains a minor error. The 'start' and 'end' fields under 'period' should contain valid date-time strings instead of 'unknown', as per the FHIR specification."

In real-life applications, a medical professional would review the upstream model outputs and make a decision regarding a care plan. This medical professional might have preferences and operational constraints that were not known by LLMs during previous inference steps (e.g., the company is financially involved with many physicians and doesn’t often refer to a specialist due to the Stark law which prohibits kickback referrals), which could lead to a majority of the LLM-proposed recommendations be rejected. We simulate this here by forcing the validator agent to select from a subset of the care plan options that represents its “preferences” (red colored text in figure 1). The final care plan decision from the validator agent is combined with all other outputs for the current iteration creating an “iteration block” that contains a significant amount of information. This iteration block is then appended to a large database in document storage (here represented by a json document).

Can we use an LLM to learn these preferences based on previous results? Yes! After several iterations we start to accumulate examples of a (1) conversation summary, (2) original proposal, and the (3) validators proposal. This information can be passed to a proposal evaluator agent initialized with the following prompt:

evaluator_agent = guidance('''
{{#system~}}
you are the assistant of a medical proffesional
{{~/system}}
                                           
{{#user~}}
You will receive a summary of a recent conversation between a provider and a patient along with a follow
up proposal made by a base agent. The base agent proposal you will be evaluating is here: {{proposal}}. 
You will review the summary, the proposal, and a record of "validation histories" here: {{val_history}}, which 
indicates the chosen proposal made by the medical proffesional you support. Please use this information to 
discern the preferences of the medical proffesional and generate proposals that are likely to be accepted by the 
medical proffesional. 

The new proposal must come from {{careplan}}. 
{{~/user}}

-- additional code ---
''')

Recall that the proposal agent from step 2 was programmed to randomly select a care plan from the full set of care plan options. We would not expect the care plan proposed by the proposal agent to to match what was selected by the validation agent. The proposal evaluator agent described above is also provided the full care plan in addition to the proposal and validation history described above.

Figure 2: The evaluator agent is able to learn the preferences of the validator agent when only provided a validation history.

The orange line in figure 2 represents the cumulative number of times that the evaluator agent’s proposal matched the proposal from the validator agent given 50 iterations. The blue line represents the randomly selected original proposal, which serves as a control. The evaluator agent learned from the validation history and generated proposals that better aligned with the validator preferences.

Conclusion

In summary, this blog post demonstrates how a large language model such as GPT-4 can be harnessed to streamline healthcare workflows, presenting a novel approach to generative AI in healthcare. We delved into practical applications including automating back-end services, improving telehealth operations, and facilitating patient-provider interactions. We showcased the versatility of GPT-4, not only in generating plausible proposals based on disparate data, but also in evaluating its own outputs and learning from previous interactions to improve subsequent iterations.

The potential benefits of adopting generative AI in healthcare are compelling. From improving operational efficiency to elevating patient care, AI could radically transform healthcare delivery. By dynamically aligning with human verifiers and self-correcting based on previous outputs, these models can be designed to flexibly and intelligently adapt to the evolving demands of the healthcare environment. As we continue to push the boundaries of what AI can do, the path ahead promises unprecedented opportunities to redefine healthcare through technology.

Footnotes

https://arxiv.org/abs/2212.08073↩︎