Internal-External Cross-Validation Tutorial

A Practical Guide to Validating Clinical Prediction Models

Author

CPM_Shortcuts

Published

January 12, 2026

Introduction

Welcome to this interactive tutorial on Internal-External Cross-Validation (IECV) for clinical prediction models. This guide will walk you through:

Understanding when and why to use IECV
Running IECV with different model types
Interpreting the results
Creating publication-quality visualizations

What is IECV?

Internal-External Cross-Validation is a validation strategy for prediction models developed using multi-center or multi-study data. Unlike standard cross-validation that randomly splits data, IECV leverages the natural clustering in your data (hospitals, studies, time periods) to provide more realistic performance estimates.

When to use IECV

IECV is ideal when:

You have data from multiple centers/hospitals
You want to estimate how well your model will perform at new centers
You need to assess model transportability across settings

The IECV Process

flowchart LR
    A[Multi-center Data] --> B[Leave One Center Out]
    B --> C[Train on N-1 Centers]
    C --> D[Validate on Held-out Center]
    D --> E[Repeat for Each Center]
    E --> F[Pool Results]

Each center takes a turn being the “external” validation set, while the model is trained on all other centers. This mimics the real-world scenario of applying a model to a new hospital.

Setup

First, let’s load the required packages and our IECV function.

# Load required packages
library(tidyverse)
library(tidymodels)
library(furrr)
library(probably)
library(dcurves)
library(bonsai)
library(cli)
library(patchwork)

# Load the IECV function
source("../R/iecv_modelling.R")

Now let’s load our example dataset - simulated multi-center patient data.

# Load the dataset
patient_data <- read_csv("../data/simulated_patient_data.csv")

# Quick overview
glimpse(patient_data)

Rows: 1,346
Columns: 7
$ patient_id  <chr> "A001", "A002", "A003", "A004", "A005", "A006", "A007", "A…
$ center      <chr> "Hospital_A", "Hospital_A", "Hospital_A", "Hospital_A", "H…
$ age         <dbl> 73.5, 62.6, 48.7, 68.5, 69.0, 54.1, 80.7, 47.7, 63.8, 49.0…
$ sex         <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0…
$ biomarker   <dbl> 4.18, 4.79, 3.85, 3.97, 4.89, 3.04, 5.12, 5.12, 4.24, 7.84…
$ comorbidity <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0…
$ outcome     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…

# Check the distribution across centers
patient_data %>%
  group_by(center) %>%
  summarise(
    n_patients = n(),
    n_events = sum(outcome),
    event_rate = mean(outcome) * 100
  ) %>%
  arrange(center) %>%
  knitr::kable(
    col.names = c("Center", "Patients", "Events", "Event Rate (%)"),
    digits = 1,
    caption = "Distribution of patients and outcomes by center"
  )

Distribution of patients and outcomes by center
Center	Patients	Events	Event Rate (%)
Hospital_A	198	171	86.4
Hospital_B	228	184	80.7
Hospital_C	225	162	72.0
Hospital_D	208	162	77.9
Hospital_E	249	211	84.7
Hospital_F	238	176	73.9

Sample Size Considerations

IECV requires at least 3 clusters to function properly. More clusters provide more validation folds and more stable estimates. Each cluster should have enough events for reliable metric estimation.

Running IECV: Logistic Regression

Let’s start with a logistic regression model - the workhorse of clinical prediction modeling.

Basic Usage

# Run IECV with logistic regression
result_lr <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "logistic",
  n_boot = 100,        # Bootstrap replicates for CIs
  n_cores = 2,         # Parallel processing
  verbose = TRUE       # Show progress
)

Viewing Results

The print() method provides a quick summary:

print(result_lr)

Internal-External Cross-Validation Results
==========================================

Model: Logistic Regression 
Clusters: 6 
Metrics: auc, brier, cal_intercept, cal_slope 
Bootstrap replicates: 100 
Confidence level: 95 %

Summary across clusters:
# A tibble: 4 × 6
  metric                   mean     sd  median    min   max
  <chr>                   <dbl>  <dbl>   <dbl>  <dbl> <dbl>
1 AUC                    0.737  0.0218  0.745   0.701 0.759
2 Brier Score            0.149  0.0240  0.144   0.115 0.179
3 Calibration Intercept -0.0563 0.752  -0.0101 -0.933 1.15 
4 Calibration Slope      1.06   0.157   1.09    0.798 1.27

For more detailed output, use summary():

summary(result_lr)

IECV Summary
============

Model: Logistic Regression 

Per-cluster results:

# A tibble: 6 × 7
  id        n_validation n_events auc              brier cal_intercept cal_slope
  <chr>            <int>    <dbl> <chr>            <chr> <chr>         <chr>    
1 Resample1          238      176 0.739 (0.675-0.… 0.17… -0.740 (-1.7… 1.074 (0…
2 Resample2          225      162 0.759 (0.690-0.… 0.17… -0.933 (-1.7… 1.269 (0…
3 Resample3          198      171 0.701 (0.612-0.… 0.13… 1.146 (0.635… 0.798 (0…
4 Resample4          208      162 0.724 (0.641-0.… 0.14… -0.199 (-0.8… 0.999 (0…
5 Resample5          249      211 0.752 (0.649-0.… 0.11… 0.210 (-0.67… 1.125 (0…
6 Resample6          228      184 0.751 (0.679-0.… 0.14… 0.179 (-0.37… 1.115 (0…


Pooled summary:

# A tibble: 4 × 6
  metric                   mean     sd  median    min   max
  <chr>                   <dbl>  <dbl>   <dbl>  <dbl> <dbl>
1 AUC                    0.737  0.0218  0.745   0.701 0.759
2 Brier Score            0.149  0.0240  0.144   0.115 0.179
3 Calibration Intercept -0.0563 0.752  -0.0101 -0.933 1.15 
4 Calibration Slope      1.06   0.157   1.09    0.798 1.27

Understanding the Metrics

Metric	Interpretation
AUC	Discrimination - how well the model separates events from non-events
Brier Score	Overall accuracy - lower is better
Calibration Intercept	Systematic bias - positive means underprediction
Calibration Slope	< 1 indicates overfitting, > 1 underfitting

Visualization

Forest Plots

Forest plots show performance across all validation centers, helping identify heterogeneity.

# All metrics in one view
plot(result_lr)

You can also plot individual metrics:

# Just AUC
plot(result_lr, type = "auc")

Calibration Plot

Calibration shows how well predicted probabilities match observed outcomes.

plot(result_lr, type = "calibration")

Reading Calibration Plots

Perfect calibration: points follow the diagonal line
Above the line: model underpredicts risk
Below the line: model overpredicts risk
The ribbon shows 90% confidence interval

Decision Curve Analysis

Decision curves help assess clinical utility - whether using the model improves decision-making compared to default strategies.

plot(result_lr, type = "dca")

Interpreting Decision Curves

Net benefit > 0: Using the model is better than treating no one
Model above “Treat All”: Using the model is better than treating everyone
The range of threshold probabilities where the model curve is highest indicates where it adds most value

Model Coefficients

For logistic regression, we can extract interpretable odds ratios:

tidy_final_model(result_lr) %>%
  knitr::kable(digits = 3, caption = "Logistic Regression Coefficients (Odds Ratios)")

Logistic Regression Coefficients (Odds Ratios)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.036	0.453	-7.313	0.000	0.015	0.088
age	1.044	0.006	7.204	0.000	1.032	1.057
sex	1.631	0.145	3.375	0.001	1.230	2.172
biomarker	1.336	0.037	7.740	0.000	1.243	1.439
comorbidity	2.285	0.162	5.105	0.000	1.673	3.158

Comparing Model Types

One of the key features of iecv_modelling() is support for multiple model types. Let’s compare logistic regression with gradient boosting models.

XGBoost

result_xgb <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "xgboost",
  n_boot = 100,
  n_cores = 2,
  verbose = TRUE
)

LightGBM

result_lgb <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "lightgbm",
  n_boot = 100,
  n_cores = 2,
  verbose = TRUE
)

Performance Comparison

# Combine summaries
comparison <- bind_rows(
  result_lr$summary %>% mutate(model = "Logistic"),
  result_xgb$summary %>% mutate(model = "XGBoost"),
  result_lgb$summary %>% mutate(model = "LightGBM")
) %>%
  select(model, metric, mean, sd, min, max) %>%
  arrange(metric, model)

comparison %>%
  knitr::kable(
    digits = 3,
    caption = "Model Comparison: Pooled Performance Across Validation Centers"
  )

Model Comparison: Pooled Performance Across Validation Centers
model	metric	mean	sd	min	max
LightGBM	AUC	0.657	0.051	0.591	0.723
Logistic	AUC	0.737	0.022	0.701	0.759
XGBoost	AUC	0.704	0.033	0.645	0.742
LightGBM	Brier Score	0.173	0.017	0.146	0.192
Logistic	Brier Score	0.149	0.024	0.115	0.179
XGBoost	Brier Score	0.155	0.021	0.123	0.180
LightGBM	Calibration Intercept	0.709	0.633	-0.159	1.585
Logistic	Calibration Intercept	-0.056	0.752	-0.933	1.146
XGBoost	Calibration Intercept	0.263	0.736	-0.506	1.380
LightGBM	Calibration Slope	0.388	0.153	0.184	0.607
Logistic	Calibration Slope	1.064	0.157	0.798	1.269
XGBoost	Calibration Slope	0.767	0.177	0.441	0.944

# Visual comparison of AUC
auc_comparison <- bind_rows(
  result_lr$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "Logistic"),
  result_xgb$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "XGBoost"),
  result_lgb$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "LightGBM")
)

ggplot(auc_comparison, aes(x = auc, y = id, color = model, shape = model)) +
  geom_point(size = 3, position = position_dodge(width = 0.5)) +
  geom_errorbar(
    aes(xmin = auc_lower, xmax = auc_upper),
    width = 0.2,
    position = position_dodge(width = 0.5)
  ) +
  geom_vline(xintercept = 0.5, linetype = "dotted", color = "gray50") +
  scale_color_manual(values = c("Logistic" = "#2E86AB", "XGBoost" = "#A23B72", "LightGBM" = "#F18F01")) +
  labs(
    x = "AUC (95% CI)",
    y = "Validation Center",
    title = "Model Comparison: AUC Across External Validations",
    color = "Model",
    shape = "Model"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank()
  )

SHAP Analysis for Tree Models

For tree-based models (XGBoost, LightGBM), we can use SHAP (SHapley Additive exPlanations) to understand feature importance and effects.

SHAP Summary Plot

# SHAP beeswarm plot for XGBoost
plot(result_xgb, type = "shap")

Reading SHAP Beeswarm Plots

Each dot represents a patient
X-axis: SHAP value (contribution to prediction)
Color: Feature value (red = high, blue = low)
Features are sorted by importance (top = most important)

Variable Importance

# SHAP-based importance
variable_importance(result_xgb) %>%
  knitr::kable(digits = 3, caption = "Variable Importance (Mean |SHAP Value|)")

Variable Importance (Mean |SHAP Value|)
variable	mean_abs_shap
biomarker	0.497
age	0.493
comorbidity	0.343
sex	0.238

SHAP Dependence Plots

Dependence plots show how a specific feature affects predictions:

plot_shap_dependence(result_xgb, feature = "age")

plot_shap_dependence(result_xgb, feature = "biomarker")

Calibration Comparison

Let’s compare calibration across all three models:

p1 <- plot(result_lr, type = "calibration") +
  labs(title = "Logistic Regression") +
  theme(legend.position = "none")

p2 <- plot(result_xgb, type = "calibration") +
  labs(title = "XGBoost") +
  theme(legend.position = "none")

p3 <- plot(result_lgb, type = "calibration") +
  labs(title = "LightGBM") +
  theme(legend.position = "none")

p1 + p2 + p3 + plot_layout(ncol = 3)

DCA Comparison

d1 <- plot(result_lr, type = "dca") +
  labs(title = "Logistic Regression")

d2 <- plot(result_xgb, type = "dca") +
  labs(title = "XGBoost")

d3 <- plot(result_lgb, type = "dca") +
  labs(title = "LightGBM")

d1 + d2 + d3 + plot_layout(ncol = 3)

Extracting Predictions

You can access the out-of-fold predictions for further analysis:

# Get predictions
preds <- result_lr$predictions

head(preds, 10) %>%
  knitr::kable(digits = 3, caption = "Sample Out-of-Fold Predictions")

Sample Out-of-Fold Predictions
.row	fold	outcome	predicted_prob	linear_predictor
1	Resample3	1	0.862	1.828
2	Resample3	1	0.656	0.647
3	Resample3	1	0.531	0.122
4	Resample3	0	0.761	1.159
5	Resample3	1	0.731	1.001
6	Resample3	1	0.536	0.143
7	Resample3	1	0.841	1.663
8	Resample3	1	0.783	1.286
9	Resample3	1	0.631	0.537
10	Resample3	1	0.849	1.727

These pooled out-of-fold predictions are useful for:

Custom calibration analyses
Threshold selection
Building calibration plots at specific risk strata

Best Practices

1. Sample Size

Ensure each cluster has sufficient events (ideally 20+)
More clusters = more stable estimates
Consider pooling small clusters if needed

2. Bootstrap Replicates

# For exploratory analysis
n_boot = 50

# For publication
n_boot = 200

3. Model Selection

Scenario	Recommended Model
Interpretability needed	Logistic regression
Complex non-linear relationships	XGBoost/LightGBM
Small sample size	Logistic regression
Large sample, many predictors	XGBoost/LightGBM

4. Reporting Checklist

Report all metrics (AUC, calibration, Brier)
Show per-cluster heterogeneity (forest plots)
Include calibration plot
Consider decision curve analysis for clinical context
Report confidence intervals

Summary

In this tutorial, you learned how to:

Run IECV with iecv_modelling() for logistic regression, XGBoost, and LightGBM
Interpret metrics including AUC, calibration, and Brier score
Visualize results with forest plots, calibration curves, and decision curves
Compare models to choose the best approach for your data
Use SHAP to understand tree model predictions

IECV provides a robust framework for validating clinical prediction models when you have multi-center data. By treating each center as an external validation, you get realistic estimates of how your model will perform in new settings.

Session Info

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Copenhagen
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.2    cli_3.6.5          bonsai_0.4.0       dcurves_0.5.1     
 [5] probably_1.2.0     furrr_0.3.1        future_1.68.0      yardstick_1.3.2   
 [9] workflowsets_1.1.1 workflows_1.3.0    tune_2.0.1         tailor_0.1.0      
[13] rsample_1.3.1      recipes_1.3.1      parsnip_1.4.0      modeldata_1.5.1   
[17] infer_1.1.0        dials_1.4.2        scales_1.4.0       broom_1.0.11      
[21] tidymodels_1.4.1   lubridate_1.9.4    forcats_1.0.1      stringr_1.6.0     
[25] dplyr_1.1.4        purrr_1.2.0        readr_2.1.6        tidyr_1.3.2       
[29] tibble_3.3.0       ggplot2_4.0.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] rlang_1.1.6         magrittr_2.0.4      otel_0.2.0         
 [4] compiler_4.5.0      mgcv_1.9-4          vctrs_0.6.5        
 [7] lhs_1.2.0           pkgconfig_2.0.3     crayon_1.5.3       
[10] fastmap_1.2.0       backports_1.5.0     labeling_0.4.3     
[13] utf8_1.2.6          rmarkdown_2.30      prodlim_2025.04.28 
[16] tzdb_0.5.0          bit_4.6.0           xfun_0.55          
[19] jsonlite_2.0.0      parallel_4.5.0      R6_2.6.1           
[22] stringi_1.8.7       RColorBrewer_1.1-3  parallelly_1.46.0  
[25] rpart_4.1.24        shapviz_0.10.3      xgboost_3.1.2.1    
[28] Rcpp_1.1.0          knitr_1.51          future.apply_1.20.1
[31] lightgbm_4.6.0      Matrix_1.7-4        splines_4.5.0      
[34] nnet_7.3-20         timechange_0.3.0    tidyselect_1.2.1   
[37] rstudioapi_0.17.1   yaml_2.3.12         timeDate_4051.111  
[40] codetools_0.2-20    listenv_0.10.0      lattice_0.22-7     
[43] withr_3.0.2         S7_0.2.1            evaluate_1.0.5     
[46] archive_1.1.12.1    survival_3.8-3      pillar_1.11.1      
[49] generics_0.1.4      vroom_1.6.7         hms_1.1.4          
[52] mirai_2.5.3         globals_0.18.0      class_7.3-23       
[55] glue_1.8.0          tools_4.5.0         data.table_1.18.0  
[58] modelenv_0.2.0      gower_1.0.2         grid_4.5.0         
[61] ipred_0.9-15        nlme_3.1-168        DiceDesign_1.10    
[64] viridisLite_0.4.2   lava_1.8.2          gtable_0.3.6       
[67] GPfit_1.0-9         digest_0.6.39       nanonext_1.7.2     
[70] htmlwidgets_1.6.4   farver_2.1.2        htmltools_0.5.9    
[73] lifecycle_1.0.4     hardhat_1.4.2       bit64_4.6.0-1      
[76] MASS_7.3-65         sparsevctrs_0.3.5