Why Do LLMs Enable Digital Twins in Healthcare?
Understanding the DT-GPT Study and the Broader Shift Toward LLM-Driven Patient Forecasting
Digital twins—virtual replicas of patients that evolve over time and simulate plausible health futures—have long been considered a central ambition of precision medicine. Building such twins requires accurate, multivariable, temporally coherent forecasts of patient states, resilience to missing or noisy data, and the ability to generalize to new variables and clinical contexts. Until recently, digital twins were the domain of specialized machine learning pipelines or mechanistic disease models. These systems, however, tended to struggle with sparse real-world data, relied heavily on imputation, or lacked the flexibility needed to scale across clinical domains.
A new paper in npj Digital Medicine introduces Digital Twin–GPT (DT-GPT), an approach that leverages large language models (LLMs) to forecast patient trajectories using electronic health records encoded as text narratives. The system demonstrates state-of-the-art performance across three very different clinical settings, maintains inter-variable correlations, and shows emergent zero-shot forecasting capabilities. These characteristics are core prerequisites for functioning digital twins.
1763482552105
What follows is an expanded analysis of why LLMs are particularly well-suited to power digital twins, as illuminated by the DT-GPT study, and how this approach compares to prior forecasting models such as Foresight and Epic’s Cosmos.
1. How DT-GPT Works: Turning EHRs Into Language
A defining innovation of DT-GPT is its insistence on treating longitudinal EHR data as a language modeling problem rather than a traditional multivariate time-series problem.
1.1 Foundation model
DT-GPT fine-tunes BioMistral-7B, a biomedical LLM pre-trained on biomedical literature and curated medical corpora. This gives the model a prior over clinical relationships that can be refined using patient-level data.
1763482552105
1.2 Encoding EHRs as narratives
Instead of tabular formats or dense tensors, patient histories are converted into chronological text templates describing:
past visits,
laboratory values,
treatments and lines of therapy,
vitals, diagnoses, ECOG scores, medications,
the forecast horizon and target variables.
1763482552105
This design, borrowed from recent LLM forecasting work, allows the model to interpret missingness, irregular measurement intervals, and inconsistent variable formats naturally—just as an LLM handles incomplete textual context.
1.3 Fine-tuning via next-token prediction
Fine-tuning is carried out with a masked cross-entropy loss applied only to the output portion of the prompt, where the model is asked to generate future values. There are no architectural modifications, imputations, or data normalization steps.
1763482552105
1.4 Diverse datasets
The model is evaluated on three datasets representing three temporal regimes:
NSCLC (Flatiron Health): 16,496 patients, weekly biomarker forecasting over 13 weeks
ICU (MIMIC-IV): 35,131 patients, hourly vital forecasting over 24 hours
ADNI (Alzheimer’s): 1,140 patients, cognitive decline forecasting over 24 months
1763482552105
These datasets differ sharply in missingness, density, timescale, and variable semantics, offering a comprehensive stress test.
2. Forecasting Results: Why LLMs Excel at Digital-Twin Tasks
2.1 Superior predictive performance across all tasks
Across 14 baseline models—ranging from LightGBM to Temporal Fusion Transformers, Transformers, CNNs, RNNs, TiDE, LLMTime, Time-LLM, and a 32B-parameter general LLM (Qwen3)—DT-GPT achieved the lowest scaled MAE in every dataset:
NSCLC: 3.4% improvement over LightGBM
ICU: 1.3% improvement over LightGBM
Alzheimer’s: 1.8% improvement over TFT
1763482552105
This is notable given the simplicity of the method and the relatively small parameter count (7B).
2.2 Preservation of biological correlations
A core requirement for digital twins is the preservation of inter-variable relationships. DT-GPT’s predictions maintain the correlation structure of the data extremely well:
R² = 0.98–0.99 for NSCLC and ICU
1763482552105
In contrast, many classical time-series models treat channels independently or require architectural interventions to capture cross-variable interactions.
2.3 Exploiting informative missingness
EHRs encode a key feature absent in most canonical time-series datasets: missingness patterns that reflect clinical choices. For example, a test not ordered may imply clinical reassurance. LLMs, trained on human language containing ambiguity and implicit information, are well-suited to interpret this form of signal. The paper explicitly draws this analogy.
1763482552105
2.4 Extreme robustness to sparsity and noise
The NSCLC dataset exhibits 94.4% baseline missingness. DT-GPT remains stable even as an additional 20% of observed values are masked.
It is also resilient to extensive textual noise, only degrading significantly after approximately 25 injected misspellings per patient history.
1763482552105
Traditional ML models routinely fail under these conditions.
2.5 Multi-trajectory generation and uncertainty
Because DT-GPT is generative, it produces multiple possible future trajectories. While the paper uses simple averaging, the diversity of samples implicitly captures uncertainty and multimodality—two essential components of digital-twin behavior.
1763482552105
3. Interpretability and Interaction: A Distinctive Advantage
One seldom-discussed requirement of digital twins is clinician interpretability. DT-GPT inherits the conversational abilities of an LLM, allowing clinicians to:
ask why the model produced a trajectory,
request the most influential variables,
explore counterfactuals.
1763482552105
The model identifies clinically meaningful predictors such as:
therapy type,
ECOG status,
leukocyte counts,
lactate dehydrogenase,
age.
1763482552105
These explanations align with known clinical relationships (e.g., chemotherapy-induced anemia, ECOG-linked prognosis), increasing clinician trust and supporting adoption.
4. Zero-Shot Forecasting: The Emergent Capability That Makes LLMs Uniquely Suitable
A key demonstration is zero-shot prediction of 69 clinical variables not included in fine-tuning. DT-GPT outperforms LightGBM on 13 of these despite LightGBM being trained on over 13,000 patients per variable.
1763482552105
The variables where DT-GPT excels are those:
highly correlated with the fine-tuned targets, or
clinically or physiologically related (e.g., ferritin-to-hemoglobin ratio, ALBI score components).
1763482552105
This suggests that:
The LLM is leveraging latent biomedical knowledge learned during pre-training.
It is inferring mechanistic or causal relationships even when supervised data are absent.
It can generalize to new tasks without retraining—an essential property for a generalizable digital twin.
No classical model in the benchmark possesses this capability.
5. Limitations and What They Mean for Digital-Twin Deployment
The paper is transparent about the system’s limitations.
5.1 Poor performance on rare acute events
Critically low hemoglobin (<7.5 g/dL) is predicted no better than chance (AUC ≈ 0.506) due to extremely low prevalence (1.2%).
1763482552105
This is consistent with statistical language models that optimize likelihood and thus underfit tail events.
5.2 Restricted number of forecasted variables
Sequence-length limits constrain the number of variables that can be embedded into the input narrative. True digital twins require much higher dimensionality—potentially thousands of structured variables.
1763482552105
5.3 Hallucination risk and inherited bias
The model may hallucinate rationales in explanations, and it reproduces underlying EHR biases, including demographic biases common in U.S. healthcare data.
1763482552105
A human-in-the-loop workflow is required for safe use.
6. How This Compares to Prior Digital-Twin and Forecasting Work
Foresight and Cosmos
Earlier GPT-based health-trajectory forecasting systems—such as the Foresight model and Epic’s Cosmos—demonstrated that LLMs could ingest high-dimensional clinical codes and predict future states. Your own observation that Foresight used 15,000–30,000 SNOMED concepts underscores how coarse the 300-variable input to DT-GPT is. Yet DT-GPT performs strongly across three independent domains, despite having dramatically fewer inputs. This suggests an efficiency advantage conferred by narrative encoding and LLM priors.
Traditional ML forecasting models
Transformers, TFT, and gradient boosting models like LightGBM have delivered domain-specific state-of-the-art performance but generally require:
heavy preprocessing,
imputation,
strict channel alignment,
variable normalization or scaling,
bespoke architectures for each dataset.
DT-GPT operates with none of these requirements.
Causal or mechanistic models
While causal modeling provides interpretability and support for counterfactual reasoning, such models are limited by small datasets and structural assumptions. DT-GPT demonstrates that generative LLMs can implicitly learn relationships without explicit causal modeling, although future systems may combine the two approaches.
7. Why LLMs Are a Natural Fit for Digital Twins
From the evidence in the paper, several principles emerge:
LLMs learn from heterogeneous, irregular, and incomplete sequences, mimicking how clinicians reason from imperfect records.
They can integrate multimodal inputs using a single text interface, removing the need for complex feature engineering pipelines.
LLMs leverage pre-trained biomedical knowledge, enabling zero-shot performance and better generalization.
Conversational interfaces support interpretability and counterfactual exploration, which are essential for clinical decision making.
Generative modeling yields multi-trajectory simulation, mimicking the stochastic nature of disease progression.
Taken together, these properties form the bedrock of a digital twin: a high-dimensional, personalized, explanatory, and dynamically evolving model of the patient.
8. Conclusion
DT-GPT demonstrates that LLMs can reliably forecast real-world clinical trajectories across domains as different as oncology, critical care, and neurodegeneration. It performs robustly with minimal preprocessing, maintains biological coherence, provides interpretable rationales, and even predicts variables it was never trained on.
While the system remains limited by sequence length, rare-event under-prediction, and the ever-present risk of hallucination, its performance shows that LLM-based digital twins are rapidly moving from conceptual vision to operational reality.
Future work will require larger context windows, hybrid causal-generative architectures, multimodal ingestion (images, genomics, notes), and rigorous safety validation. But the trajectory is clear: LLMs are uniquely aligned with the demands of digital-twin modeling, and DT-GPT represents a significant milestone in that direction.