Internal-External Cross-Validation Tutorial

A Practical Guide to Validating Clinical Prediction Models

Author

CPM_Shortcuts

Published

January 12, 2026

Introduction

Welcome to this interactive tutorial on Internal-External Cross-Validation (IECV) for clinical prediction models. This guide will walk you through:

  1. Understanding when and why to use IECV
  2. Running IECV with different model types
  3. Interpreting the results
  4. Creating publication-quality visualizations

What is IECV?

Internal-External Cross-Validation is a validation strategy for prediction models developed using multi-center or multi-study data. Unlike standard cross-validation that randomly splits data, IECV leverages the natural clustering in your data (hospitals, studies, time periods) to provide more realistic performance estimates.

When to use IECV

IECV is ideal when:

  • You have data from multiple centers/hospitals
  • You want to estimate how well your model will perform at new centers
  • You need to assess model transportability across settings

The IECV Process

flowchart LR
    A[Multi-center Data] --> B[Leave One Center Out]
    B --> C[Train on N-1 Centers]
    C --> D[Validate on Held-out Center]
    D --> E[Repeat for Each Center]
    E --> F[Pool Results]

Each center takes a turn being the “external” validation set, while the model is trained on all other centers. This mimics the real-world scenario of applying a model to a new hospital.


Setup

First, let’s load the required packages and our IECV function.

# Load required packages
library(tidyverse)
library(tidymodels)
library(furrr)
library(probably)
library(dcurves)
library(bonsai)
library(cli)
library(patchwork)

# Load the IECV function
source("../R/iecv_modelling.R")

Now let’s load our example dataset - simulated multi-center patient data.

# Load the dataset
patient_data <- read_csv("../data/simulated_patient_data.csv")

# Quick overview
glimpse(patient_data)
Rows: 1,346
Columns: 7
$ patient_id  <chr> "A001", "A002", "A003", "A004", "A005", "A006", "A007", "A…
$ center      <chr> "Hospital_A", "Hospital_A", "Hospital_A", "Hospital_A", "H…
$ age         <dbl> 73.5, 62.6, 48.7, 68.5, 69.0, 54.1, 80.7, 47.7, 63.8, 49.0…
$ sex         <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0…
$ biomarker   <dbl> 4.18, 4.79, 3.85, 3.97, 4.89, 3.04, 5.12, 5.12, 4.24, 7.84…
$ comorbidity <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0…
$ outcome     <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0…
# Check the distribution across centers
patient_data %>%
  group_by(center) %>%
  summarise(
    n_patients = n(),
    n_events = sum(outcome),
    event_rate = mean(outcome) * 100
  ) %>%
  arrange(center) %>%
  knitr::kable(
    col.names = c("Center", "Patients", "Events", "Event Rate (%)"),
    digits = 1,
    caption = "Distribution of patients and outcomes by center"
  )
Distribution of patients and outcomes by center
Center Patients Events Event Rate (%)
Hospital_A 198 171 86.4
Hospital_B 228 184 80.7
Hospital_C 225 162 72.0
Hospital_D 208 162 77.9
Hospital_E 249 211 84.7
Hospital_F 238 176 73.9
Sample Size Considerations

IECV requires at least 3 clusters to function properly. More clusters provide more validation folds and more stable estimates. Each cluster should have enough events for reliable metric estimation.


Running IECV: Logistic Regression

Let’s start with a logistic regression model - the workhorse of clinical prediction modeling.

Basic Usage

# Run IECV with logistic regression
result_lr <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "logistic",
  n_boot = 100,        # Bootstrap replicates for CIs
  n_cores = 2,         # Parallel processing
  verbose = TRUE       # Show progress
)

Viewing Results

The print() method provides a quick summary:

print(result_lr)
Internal-External Cross-Validation Results
==========================================

Model: Logistic Regression 
Clusters: 6 
Metrics: auc, brier, cal_intercept, cal_slope 
Bootstrap replicates: 100 
Confidence level: 95 %

Summary across clusters:
# A tibble: 4 × 6
  metric                   mean     sd  median    min   max
  <chr>                   <dbl>  <dbl>   <dbl>  <dbl> <dbl>
1 AUC                    0.737  0.0218  0.745   0.701 0.759
2 Brier Score            0.149  0.0240  0.144   0.115 0.179
3 Calibration Intercept -0.0563 0.752  -0.0101 -0.933 1.15 
4 Calibration Slope      1.06   0.157   1.09    0.798 1.27 

For more detailed output, use summary():

summary(result_lr)
IECV Summary
============

Model: Logistic Regression 

Per-cluster results:

# A tibble: 6 × 7
  id        n_validation n_events auc              brier cal_intercept cal_slope
  <chr>            <int>    <dbl> <chr>            <chr> <chr>         <chr>    
1 Resample1          238      176 0.739 (0.675-0.… 0.17… -0.740 (-1.7… 1.074 (0…
2 Resample2          225      162 0.759 (0.690-0.… 0.17… -0.933 (-1.7… 1.269 (0…
3 Resample3          198      171 0.701 (0.612-0.… 0.13… 1.146 (0.635… 0.798 (0…
4 Resample4          208      162 0.724 (0.641-0.… 0.14… -0.199 (-0.8… 0.999 (0…
5 Resample5          249      211 0.752 (0.649-0.… 0.11… 0.210 (-0.67… 1.125 (0…
6 Resample6          228      184 0.751 (0.679-0.… 0.14… 0.179 (-0.37… 1.115 (0…


Pooled summary:

# A tibble: 4 × 6
  metric                   mean     sd  median    min   max
  <chr>                   <dbl>  <dbl>   <dbl>  <dbl> <dbl>
1 AUC                    0.737  0.0218  0.745   0.701 0.759
2 Brier Score            0.149  0.0240  0.144   0.115 0.179
3 Calibration Intercept -0.0563 0.752  -0.0101 -0.933 1.15 
4 Calibration Slope      1.06   0.157   1.09    0.798 1.27 

Understanding the Metrics

Metric Interpretation
AUC Discrimination - how well the model separates events from non-events
Brier Score Overall accuracy - lower is better
Calibration Intercept Systematic bias - positive means underprediction
Calibration Slope < 1 indicates overfitting, > 1 underfitting

Visualization

Forest Plots

Forest plots show performance across all validation centers, helping identify heterogeneity.

# All metrics in one view
plot(result_lr)

You can also plot individual metrics:

# Just AUC
plot(result_lr, type = "auc")

Calibration Plot

Calibration shows how well predicted probabilities match observed outcomes.

plot(result_lr, type = "calibration")

Reading Calibration Plots
  • Perfect calibration: points follow the diagonal line
  • Above the line: model underpredicts risk
  • Below the line: model overpredicts risk
  • The ribbon shows 90% confidence interval

Decision Curve Analysis

Decision curves help assess clinical utility - whether using the model improves decision-making compared to default strategies.

plot(result_lr, type = "dca")

Interpreting Decision Curves
  • Net benefit > 0: Using the model is better than treating no one
  • Model above “Treat All”: Using the model is better than treating everyone
  • The range of threshold probabilities where the model curve is highest indicates where it adds most value

Model Coefficients

For logistic regression, we can extract interpretable odds ratios:

tidy_final_model(result_lr) %>%
  knitr::kable(digits = 3, caption = "Logistic Regression Coefficients (Odds Ratios)")
Logistic Regression Coefficients (Odds Ratios)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.036 0.453 -7.313 0.000 0.015 0.088
age 1.044 0.006 7.204 0.000 1.032 1.057
sex 1.631 0.145 3.375 0.001 1.230 2.172
biomarker 1.336 0.037 7.740 0.000 1.243 1.439
comorbidity 2.285 0.162 5.105 0.000 1.673 3.158

Comparing Model Types

One of the key features of iecv_modelling() is support for multiple model types. Let’s compare logistic regression with gradient boosting models.

XGBoost

result_xgb <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "xgboost",
  n_boot = 100,
  n_cores = 2,
  verbose = TRUE
)

LightGBM

result_lgb <- iecv_modelling(
  data = patient_data,
  outcome = "outcome",
  predictors = c("age", "sex", "biomarker", "comorbidity"),
  cluster = "center",
  model = "lightgbm",
  n_boot = 100,
  n_cores = 2,
  verbose = TRUE
)

Performance Comparison

# Combine summaries
comparison <- bind_rows(
  result_lr$summary %>% mutate(model = "Logistic"),
  result_xgb$summary %>% mutate(model = "XGBoost"),
  result_lgb$summary %>% mutate(model = "LightGBM")
) %>%
  select(model, metric, mean, sd, min, max) %>%
  arrange(metric, model)

comparison %>%
  knitr::kable(
    digits = 3,
    caption = "Model Comparison: Pooled Performance Across Validation Centers"
  )
Model Comparison: Pooled Performance Across Validation Centers
model metric mean sd min max
LightGBM AUC 0.657 0.051 0.591 0.723
Logistic AUC 0.737 0.022 0.701 0.759
XGBoost AUC 0.704 0.033 0.645 0.742
LightGBM Brier Score 0.173 0.017 0.146 0.192
Logistic Brier Score 0.149 0.024 0.115 0.179
XGBoost Brier Score 0.155 0.021 0.123 0.180
LightGBM Calibration Intercept 0.709 0.633 -0.159 1.585
Logistic Calibration Intercept -0.056 0.752 -0.933 1.146
XGBoost Calibration Intercept 0.263 0.736 -0.506 1.380
LightGBM Calibration Slope 0.388 0.153 0.184 0.607
Logistic Calibration Slope 1.064 0.157 0.798 1.269
XGBoost Calibration Slope 0.767 0.177 0.441 0.944
# Visual comparison of AUC
auc_comparison <- bind_rows(
  result_lr$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "Logistic"),
  result_xgb$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "XGBoost"),
  result_lgb$cluster_results %>%
    select(id, auc, auc_lower, auc_upper) %>%
    mutate(model = "LightGBM")
)

ggplot(auc_comparison, aes(x = auc, y = id, color = model, shape = model)) +
  geom_point(size = 3, position = position_dodge(width = 0.5)) +
  geom_errorbar(
    aes(xmin = auc_lower, xmax = auc_upper),
    width = 0.2,
    position = position_dodge(width = 0.5)
  ) +
  geom_vline(xintercept = 0.5, linetype = "dotted", color = "gray50") +
  scale_color_manual(values = c("Logistic" = "#2E86AB", "XGBoost" = "#A23B72", "LightGBM" = "#F18F01")) +
  labs(
    x = "AUC (95% CI)",
    y = "Validation Center",
    title = "Model Comparison: AUC Across External Validations",
    color = "Model",
    shape = "Model"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank()
  )


SHAP Analysis for Tree Models

For tree-based models (XGBoost, LightGBM), we can use SHAP (SHapley Additive exPlanations) to understand feature importance and effects.

SHAP Summary Plot

# SHAP beeswarm plot for XGBoost
plot(result_xgb, type = "shap")

Reading SHAP Beeswarm Plots
  • Each dot represents a patient
  • X-axis: SHAP value (contribution to prediction)
  • Color: Feature value (red = high, blue = low)
  • Features are sorted by importance (top = most important)

Variable Importance

# SHAP-based importance
variable_importance(result_xgb) %>%
  knitr::kable(digits = 3, caption = "Variable Importance (Mean |SHAP Value|)")
Variable Importance (Mean |SHAP Value|)
variable mean_abs_shap
biomarker 0.497
age 0.493
comorbidity 0.343
sex 0.238

SHAP Dependence Plots

Dependence plots show how a specific feature affects predictions:

plot_shap_dependence(result_xgb, feature = "age")

plot_shap_dependence(result_xgb, feature = "biomarker")


Calibration Comparison

Let’s compare calibration across all three models:

p1 <- plot(result_lr, type = "calibration") +
  labs(title = "Logistic Regression") +
  theme(legend.position = "none")

p2 <- plot(result_xgb, type = "calibration") +
  labs(title = "XGBoost") +
  theme(legend.position = "none")

p3 <- plot(result_lgb, type = "calibration") +
  labs(title = "LightGBM") +
  theme(legend.position = "none")

p1 + p2 + p3 + plot_layout(ncol = 3)


DCA Comparison

d1 <- plot(result_lr, type = "dca") +
  labs(title = "Logistic Regression")

d2 <- plot(result_xgb, type = "dca") +
  labs(title = "XGBoost")

d3 <- plot(result_lgb, type = "dca") +
  labs(title = "LightGBM")

d1 + d2 + d3 + plot_layout(ncol = 3)


Extracting Predictions

You can access the out-of-fold predictions for further analysis:

# Get predictions
preds <- result_lr$predictions

head(preds, 10) %>%
  knitr::kable(digits = 3, caption = "Sample Out-of-Fold Predictions")
Sample Out-of-Fold Predictions
.row fold outcome predicted_prob linear_predictor
1 Resample3 1 0.862 1.828
2 Resample3 1 0.656 0.647
3 Resample3 1 0.531 0.122
4 Resample3 0 0.761 1.159
5 Resample3 1 0.731 1.001
6 Resample3 1 0.536 0.143
7 Resample3 1 0.841 1.663
8 Resample3 1 0.783 1.286
9 Resample3 1 0.631 0.537
10 Resample3 1 0.849 1.727

These pooled out-of-fold predictions are useful for:

  • Custom calibration analyses
  • Threshold selection
  • Building calibration plots at specific risk strata

Best Practices

1. Sample Size

  • Ensure each cluster has sufficient events (ideally 20+)
  • More clusters = more stable estimates
  • Consider pooling small clusters if needed

2. Bootstrap Replicates

# For exploratory analysis
n_boot = 50

# For publication
n_boot = 200

3. Model Selection

Scenario Recommended Model
Interpretability needed Logistic regression
Complex non-linear relationships XGBoost/LightGBM
Small sample size Logistic regression
Large sample, many predictors XGBoost/LightGBM

4. Reporting Checklist


Summary

In this tutorial, you learned how to:

  1. Run IECV with iecv_modelling() for logistic regression, XGBoost, and LightGBM
  2. Interpret metrics including AUC, calibration, and Brier score
  3. Visualize results with forest plots, calibration curves, and decision curves
  4. Compare models to choose the best approach for your data
  5. Use SHAP to understand tree model predictions

IECV provides a robust framework for validating clinical prediction models when you have multi-center data. By treating each center as an external validation, you get realistic estimates of how your model will perform in new settings.


Session Info

sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS 26.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Copenhagen
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] patchwork_1.3.2    cli_3.6.5          bonsai_0.4.0       dcurves_0.5.1     
 [5] probably_1.2.0     furrr_0.3.1        future_1.68.0      yardstick_1.3.2   
 [9] workflowsets_1.1.1 workflows_1.3.0    tune_2.0.1         tailor_0.1.0      
[13] rsample_1.3.1      recipes_1.3.1      parsnip_1.4.0      modeldata_1.5.1   
[17] infer_1.1.0        dials_1.4.2        scales_1.4.0       broom_1.0.11      
[21] tidymodels_1.4.1   lubridate_1.9.4    forcats_1.0.1      stringr_1.6.0     
[25] dplyr_1.1.4        purrr_1.2.0        readr_2.1.6        tidyr_1.3.2       
[29] tibble_3.3.0       ggplot2_4.0.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] rlang_1.1.6         magrittr_2.0.4      otel_0.2.0         
 [4] compiler_4.5.0      mgcv_1.9-4          vctrs_0.6.5        
 [7] lhs_1.2.0           pkgconfig_2.0.3     crayon_1.5.3       
[10] fastmap_1.2.0       backports_1.5.0     labeling_0.4.3     
[13] utf8_1.2.6          rmarkdown_2.30      prodlim_2025.04.28 
[16] tzdb_0.5.0          bit_4.6.0           xfun_0.55          
[19] jsonlite_2.0.0      parallel_4.5.0      R6_2.6.1           
[22] stringi_1.8.7       RColorBrewer_1.1-3  parallelly_1.46.0  
[25] rpart_4.1.24        shapviz_0.10.3      xgboost_3.1.2.1    
[28] Rcpp_1.1.0          knitr_1.51          future.apply_1.20.1
[31] lightgbm_4.6.0      Matrix_1.7-4        splines_4.5.0      
[34] nnet_7.3-20         timechange_0.3.0    tidyselect_1.2.1   
[37] rstudioapi_0.17.1   yaml_2.3.12         timeDate_4051.111  
[40] codetools_0.2-20    listenv_0.10.0      lattice_0.22-7     
[43] withr_3.0.2         S7_0.2.1            evaluate_1.0.5     
[46] archive_1.1.12.1    survival_3.8-3      pillar_1.11.1      
[49] generics_0.1.4      vroom_1.6.7         hms_1.1.4          
[52] mirai_2.5.3         globals_0.18.0      class_7.3-23       
[55] glue_1.8.0          tools_4.5.0         data.table_1.18.0  
[58] modelenv_0.2.0      gower_1.0.2         grid_4.5.0         
[61] ipred_0.9-15        nlme_3.1-168        DiceDesign_1.10    
[64] viridisLite_0.4.2   lava_1.8.2          gtable_0.3.6       
[67] GPfit_1.0-9         digest_0.6.39       nanonext_1.7.2     
[70] htmlwidgets_1.6.4   farver_2.1.2        htmltools_0.5.9    
[73] lifecycle_1.0.4     hardhat_1.4.2       bit64_4.6.0-1      
[76] MASS_7.3-65         sparsevctrs_0.3.5