flowchart LR
A[Multi-center Data] --> B[Leave One Center Out]
B --> C[Train on N-1 Centers]
C --> D[Validate on Held-out Center]
D --> E[Repeat for Each Center]
E --> F[Pool Results]
Internal-External Cross-Validation Tutorial
A Practical Guide to Validating Clinical Prediction Models
Introduction
Welcome to this interactive tutorial on Internal-External Cross-Validation (IECV) for clinical prediction models. This guide will walk you through:
- Understanding when and why to use IECV
- Running IECV with different model types
- Interpreting the results
- Creating publication-quality visualizations
What is IECV?
Internal-External Cross-Validation is a validation strategy for prediction models developed using multi-center or multi-study data. Unlike standard cross-validation that randomly splits data, IECV leverages the natural clustering in your data (hospitals, studies, time periods) to provide more realistic performance estimates.
IECV is ideal when:
- You have data from multiple centers/hospitals
- You want to estimate how well your model will perform at new centers
- You need to assess model transportability across settings
The IECV Process
Each center takes a turn being the “external” validation set, while the model is trained on all other centers. This mimics the real-world scenario of applying a model to a new hospital.
Setup
First, let’s load the required packages and our IECV function.
# Load required packages
library(tidyverse)
library(tidymodels)
library(furrr)
library(probably)
library(dcurves)
library(bonsai)
library(cli)
library(patchwork)
# Load the IECV function
source("../R/iecv_modelling.R")Now let’s load our example dataset - simulated multi-center patient data.
# Load the dataset
patient_data <- read_csv("../data/simulated_patient_data.csv")
# Quick overview
glimpse(patient_data)Rows: 1,346
Columns: 7
$ patient_id <chr> "A001", "A002", "A003", "A004", "A005", "A006", "A007", "A…
$ center <chr> "Hospital_A", "Hospital_A", "Hospital_A", "Hospital_A", "H…
$ age <dbl> 73.5, 62.6, 48.7, 68.5, 69.0, 54.1, 80.7, 47.7, 63.8, 49.0…
$ sex <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0…
$ biomarker <dbl> 4.18, 4.79, 3.85, 3.97, 4.89, 3.04, 5.12, 5.12, 4.24, 7.84…
$ comorbidity <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0…
$ outcome <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
# Check the distribution across centers
patient_data %>%
group_by(center) %>%
summarise(
n_patients = n(),
n_events = sum(outcome),
event_rate = mean(outcome) * 100
) %>%
arrange(center) %>%
knitr::kable(
col.names = c("Center", "Patients", "Events", "Event Rate (%)"),
digits = 1,
caption = "Distribution of patients and outcomes by center"
)| Center | Patients | Events | Event Rate (%) |
|---|---|---|---|
| Hospital_A | 198 | 171 | 86.4 |
| Hospital_B | 228 | 184 | 80.7 |
| Hospital_C | 225 | 162 | 72.0 |
| Hospital_D | 208 | 162 | 77.9 |
| Hospital_E | 249 | 211 | 84.7 |
| Hospital_F | 238 | 176 | 73.9 |
IECV requires at least 3 clusters to function properly. More clusters provide more validation folds and more stable estimates. Each cluster should have enough events for reliable metric estimation.
Running IECV: Logistic Regression
Let’s start with a logistic regression model - the workhorse of clinical prediction modeling.
Basic Usage
# Run IECV with logistic regression
result_lr <- iecv_modelling(
data = patient_data,
outcome = "outcome",
predictors = c("age", "sex", "biomarker", "comorbidity"),
cluster = "center",
model = "logistic",
n_boot = 100, # Bootstrap replicates for CIs
n_cores = 2, # Parallel processing
verbose = TRUE # Show progress
)Viewing Results
The print() method provides a quick summary:
print(result_lr)Internal-External Cross-Validation Results
==========================================
Model: Logistic Regression
Clusters: 6
Metrics: auc, brier, cal_intercept, cal_slope
Bootstrap replicates: 100
Confidence level: 95 %
Summary across clusters:
# A tibble: 4 × 6
metric mean sd median min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUC 0.737 0.0218 0.745 0.701 0.759
2 Brier Score 0.149 0.0240 0.144 0.115 0.179
3 Calibration Intercept -0.0563 0.752 -0.0101 -0.933 1.15
4 Calibration Slope 1.06 0.157 1.09 0.798 1.27
For more detailed output, use summary():
summary(result_lr)IECV Summary
============
Model: Logistic Regression
Per-cluster results:
# A tibble: 6 × 7
id n_validation n_events auc brier cal_intercept cal_slope
<chr> <int> <dbl> <chr> <chr> <chr> <chr>
1 Resample1 238 176 0.739 (0.675-0.… 0.17… -0.740 (-1.7… 1.074 (0…
2 Resample2 225 162 0.759 (0.690-0.… 0.17… -0.933 (-1.7… 1.269 (0…
3 Resample3 198 171 0.701 (0.612-0.… 0.13… 1.146 (0.635… 0.798 (0…
4 Resample4 208 162 0.724 (0.641-0.… 0.14… -0.199 (-0.8… 0.999 (0…
5 Resample5 249 211 0.752 (0.649-0.… 0.11… 0.210 (-0.67… 1.125 (0…
6 Resample6 228 184 0.751 (0.679-0.… 0.14… 0.179 (-0.37… 1.115 (0…
Pooled summary:
# A tibble: 4 × 6
metric mean sd median min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 AUC 0.737 0.0218 0.745 0.701 0.759
2 Brier Score 0.149 0.0240 0.144 0.115 0.179
3 Calibration Intercept -0.0563 0.752 -0.0101 -0.933 1.15
4 Calibration Slope 1.06 0.157 1.09 0.798 1.27
Understanding the Metrics
| Metric | Interpretation |
|---|---|
| AUC | Discrimination - how well the model separates events from non-events |
| Brier Score | Overall accuracy - lower is better |
| Calibration Intercept | Systematic bias - positive means underprediction |
| Calibration Slope | < 1 indicates overfitting, > 1 underfitting |
Visualization
Forest Plots
Forest plots show performance across all validation centers, helping identify heterogeneity.
# All metrics in one view
plot(result_lr)
You can also plot individual metrics:
# Just AUC
plot(result_lr, type = "auc")
Calibration Plot
Calibration shows how well predicted probabilities match observed outcomes.
plot(result_lr, type = "calibration")
- Perfect calibration: points follow the diagonal line
- Above the line: model underpredicts risk
- Below the line: model overpredicts risk
- The ribbon shows 90% confidence interval
Decision Curve Analysis
Decision curves help assess clinical utility - whether using the model improves decision-making compared to default strategies.
plot(result_lr, type = "dca")
- Net benefit > 0: Using the model is better than treating no one
- Model above “Treat All”: Using the model is better than treating everyone
- The range of threshold probabilities where the model curve is highest indicates where it adds most value
Model Coefficients
For logistic regression, we can extract interpretable odds ratios:
tidy_final_model(result_lr) %>%
knitr::kable(digits = 3, caption = "Logistic Regression Coefficients (Odds Ratios)")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.036 | 0.453 | -7.313 | 0.000 | 0.015 | 0.088 |
| age | 1.044 | 0.006 | 7.204 | 0.000 | 1.032 | 1.057 |
| sex | 1.631 | 0.145 | 3.375 | 0.001 | 1.230 | 2.172 |
| biomarker | 1.336 | 0.037 | 7.740 | 0.000 | 1.243 | 1.439 |
| comorbidity | 2.285 | 0.162 | 5.105 | 0.000 | 1.673 | 3.158 |
Comparing Model Types
One of the key features of iecv_modelling() is support for multiple model types. Let’s compare logistic regression with gradient boosting models.
XGBoost
result_xgb <- iecv_modelling(
data = patient_data,
outcome = "outcome",
predictors = c("age", "sex", "biomarker", "comorbidity"),
cluster = "center",
model = "xgboost",
n_boot = 100,
n_cores = 2,
verbose = TRUE
)LightGBM
result_lgb <- iecv_modelling(
data = patient_data,
outcome = "outcome",
predictors = c("age", "sex", "biomarker", "comorbidity"),
cluster = "center",
model = "lightgbm",
n_boot = 100,
n_cores = 2,
verbose = TRUE
)Performance Comparison
# Combine summaries
comparison <- bind_rows(
result_lr$summary %>% mutate(model = "Logistic"),
result_xgb$summary %>% mutate(model = "XGBoost"),
result_lgb$summary %>% mutate(model = "LightGBM")
) %>%
select(model, metric, mean, sd, min, max) %>%
arrange(metric, model)
comparison %>%
knitr::kable(
digits = 3,
caption = "Model Comparison: Pooled Performance Across Validation Centers"
)| model | metric | mean | sd | min | max |
|---|---|---|---|---|---|
| LightGBM | AUC | 0.657 | 0.051 | 0.591 | 0.723 |
| Logistic | AUC | 0.737 | 0.022 | 0.701 | 0.759 |
| XGBoost | AUC | 0.704 | 0.033 | 0.645 | 0.742 |
| LightGBM | Brier Score | 0.173 | 0.017 | 0.146 | 0.192 |
| Logistic | Brier Score | 0.149 | 0.024 | 0.115 | 0.179 |
| XGBoost | Brier Score | 0.155 | 0.021 | 0.123 | 0.180 |
| LightGBM | Calibration Intercept | 0.709 | 0.633 | -0.159 | 1.585 |
| Logistic | Calibration Intercept | -0.056 | 0.752 | -0.933 | 1.146 |
| XGBoost | Calibration Intercept | 0.263 | 0.736 | -0.506 | 1.380 |
| LightGBM | Calibration Slope | 0.388 | 0.153 | 0.184 | 0.607 |
| Logistic | Calibration Slope | 1.064 | 0.157 | 0.798 | 1.269 |
| XGBoost | Calibration Slope | 0.767 | 0.177 | 0.441 | 0.944 |
# Visual comparison of AUC
auc_comparison <- bind_rows(
result_lr$cluster_results %>%
select(id, auc, auc_lower, auc_upper) %>%
mutate(model = "Logistic"),
result_xgb$cluster_results %>%
select(id, auc, auc_lower, auc_upper) %>%
mutate(model = "XGBoost"),
result_lgb$cluster_results %>%
select(id, auc, auc_lower, auc_upper) %>%
mutate(model = "LightGBM")
)
ggplot(auc_comparison, aes(x = auc, y = id, color = model, shape = model)) +
geom_point(size = 3, position = position_dodge(width = 0.5)) +
geom_errorbar(
aes(xmin = auc_lower, xmax = auc_upper),
width = 0.2,
position = position_dodge(width = 0.5)
) +
geom_vline(xintercept = 0.5, linetype = "dotted", color = "gray50") +
scale_color_manual(values = c("Logistic" = "#2E86AB", "XGBoost" = "#A23B72", "LightGBM" = "#F18F01")) +
labs(
x = "AUC (95% CI)",
y = "Validation Center",
title = "Model Comparison: AUC Across External Validations",
color = "Model",
shape = "Model"
) +
theme_minimal() +
theme(
legend.position = "bottom",
panel.grid.minor = element_blank()
)
SHAP Analysis for Tree Models
For tree-based models (XGBoost, LightGBM), we can use SHAP (SHapley Additive exPlanations) to understand feature importance and effects.
SHAP Summary Plot
# SHAP beeswarm plot for XGBoost
plot(result_xgb, type = "shap")
- Each dot represents a patient
- X-axis: SHAP value (contribution to prediction)
- Color: Feature value (red = high, blue = low)
- Features are sorted by importance (top = most important)
Variable Importance
# SHAP-based importance
variable_importance(result_xgb) %>%
knitr::kable(digits = 3, caption = "Variable Importance (Mean |SHAP Value|)")| variable | mean_abs_shap |
|---|---|
| biomarker | 0.497 |
| age | 0.493 |
| comorbidity | 0.343 |
| sex | 0.238 |
SHAP Dependence Plots
Dependence plots show how a specific feature affects predictions:
plot_shap_dependence(result_xgb, feature = "age")
plot_shap_dependence(result_xgb, feature = "biomarker")
Calibration Comparison
Let’s compare calibration across all three models:
p1 <- plot(result_lr, type = "calibration") +
labs(title = "Logistic Regression") +
theme(legend.position = "none")
p2 <- plot(result_xgb, type = "calibration") +
labs(title = "XGBoost") +
theme(legend.position = "none")
p3 <- plot(result_lgb, type = "calibration") +
labs(title = "LightGBM") +
theme(legend.position = "none")
p1 + p2 + p3 + plot_layout(ncol = 3)
DCA Comparison
d1 <- plot(result_lr, type = "dca") +
labs(title = "Logistic Regression")
d2 <- plot(result_xgb, type = "dca") +
labs(title = "XGBoost")
d3 <- plot(result_lgb, type = "dca") +
labs(title = "LightGBM")
d1 + d2 + d3 + plot_layout(ncol = 3)
Extracting Predictions
You can access the out-of-fold predictions for further analysis:
# Get predictions
preds <- result_lr$predictions
head(preds, 10) %>%
knitr::kable(digits = 3, caption = "Sample Out-of-Fold Predictions")| .row | fold | outcome | predicted_prob | linear_predictor |
|---|---|---|---|---|
| 1 | Resample3 | 1 | 0.862 | 1.828 |
| 2 | Resample3 | 1 | 0.656 | 0.647 |
| 3 | Resample3 | 1 | 0.531 | 0.122 |
| 4 | Resample3 | 0 | 0.761 | 1.159 |
| 5 | Resample3 | 1 | 0.731 | 1.001 |
| 6 | Resample3 | 1 | 0.536 | 0.143 |
| 7 | Resample3 | 1 | 0.841 | 1.663 |
| 8 | Resample3 | 1 | 0.783 | 1.286 |
| 9 | Resample3 | 1 | 0.631 | 0.537 |
| 10 | Resample3 | 1 | 0.849 | 1.727 |
These pooled out-of-fold predictions are useful for:
- Custom calibration analyses
- Threshold selection
- Building calibration plots at specific risk strata
Best Practices
1. Sample Size
- Ensure each cluster has sufficient events (ideally 20+)
- More clusters = more stable estimates
- Consider pooling small clusters if needed
2. Bootstrap Replicates
# For exploratory analysis
n_boot = 50
# For publication
n_boot = 2003. Model Selection
| Scenario | Recommended Model |
|---|---|
| Interpretability needed | Logistic regression |
| Complex non-linear relationships | XGBoost/LightGBM |
| Small sample size | Logistic regression |
| Large sample, many predictors | XGBoost/LightGBM |
4. Reporting Checklist
Summary
In this tutorial, you learned how to:
- Run IECV with
iecv_modelling()for logistic regression, XGBoost, and LightGBM - Interpret metrics including AUC, calibration, and Brier score
- Visualize results with forest plots, calibration curves, and decision curves
- Compare models to choose the best approach for your data
- Use SHAP to understand tree model predictions
IECV provides a robust framework for validating clinical prediction models when you have multi-center data. By treating each center as an external validation, you get realistic estimates of how your model will perform in new settings.
Session Info
sessionInfo()R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Copenhagen
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.2 cli_3.6.5 bonsai_0.4.0 dcurves_0.5.1
[5] probably_1.2.0 furrr_0.3.1 future_1.68.0 yardstick_1.3.2
[9] workflowsets_1.1.1 workflows_1.3.0 tune_2.0.1 tailor_0.1.0
[13] rsample_1.3.1 recipes_1.3.1 parsnip_1.4.0 modeldata_1.5.1
[17] infer_1.1.0 dials_1.4.2 scales_1.4.0 broom_1.0.11
[21] tidymodels_1.4.1 lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0
[25] dplyr_1.1.4 purrr_1.2.0 readr_2.1.6 tidyr_1.3.2
[29] tibble_3.3.0 ggplot2_4.0.1 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] rlang_1.1.6 magrittr_2.0.4 otel_0.2.0
[4] compiler_4.5.0 mgcv_1.9-4 vctrs_0.6.5
[7] lhs_1.2.0 pkgconfig_2.0.3 crayon_1.5.3
[10] fastmap_1.2.0 backports_1.5.0 labeling_0.4.3
[13] utf8_1.2.6 rmarkdown_2.30 prodlim_2025.04.28
[16] tzdb_0.5.0 bit_4.6.0 xfun_0.55
[19] jsonlite_2.0.0 parallel_4.5.0 R6_2.6.1
[22] stringi_1.8.7 RColorBrewer_1.1-3 parallelly_1.46.0
[25] rpart_4.1.24 shapviz_0.10.3 xgboost_3.1.2.1
[28] Rcpp_1.1.0 knitr_1.51 future.apply_1.20.1
[31] lightgbm_4.6.0 Matrix_1.7-4 splines_4.5.0
[34] nnet_7.3-20 timechange_0.3.0 tidyselect_1.2.1
[37] rstudioapi_0.17.1 yaml_2.3.12 timeDate_4051.111
[40] codetools_0.2-20 listenv_0.10.0 lattice_0.22-7
[43] withr_3.0.2 S7_0.2.1 evaluate_1.0.5
[46] archive_1.1.12.1 survival_3.8-3 pillar_1.11.1
[49] generics_0.1.4 vroom_1.6.7 hms_1.1.4
[52] mirai_2.5.3 globals_0.18.0 class_7.3-23
[55] glue_1.8.0 tools_4.5.0 data.table_1.18.0
[58] modelenv_0.2.0 gower_1.0.2 grid_4.5.0
[61] ipred_0.9-15 nlme_3.1-168 DiceDesign_1.10
[64] viridisLite_0.4.2 lava_1.8.2 gtable_0.3.6
[67] GPfit_1.0-9 digest_0.6.39 nanonext_1.7.2
[70] htmlwidgets_1.6.4 farver_2.1.2 htmltools_0.5.9
[73] lifecycle_1.0.4 hardhat_1.4.2 bit64_4.6.0-1
[76] MASS_7.3-65 sparsevctrs_0.3.5