🖥️ Practical Emergent Diversity

Exploring outputs

In this practical we will explore the outputs from gen3sis using the island simulations we ran yesterday. We will learn how to use this data with common R packages for phylogenetic comparative methods, community phylogenetics, biogeography and much more.

The goal for today will be to produce:

A map of Species Richness from the simulation summary object
A plot of Lineages Through Time from the phylogeny
Plot of species trait values on a phylogenetic tree, by linking the species objects to the phylogeny
Maps of Phylogenetic Diversity by linking species objects to the landscape and phylogeny

First, let’s make sure we have the necessary packages loaded

Simulation summary object (sgen3sis.rds)

The first object we will look at today is the sgenesis.rds file. This file contains a summary of the simulation. Which we can plot (as we did yesterday) with the plot_summary function. This is the same object, that you would have in memory by running a simulation with run_simulation.

# path to the simulation output folder used throughout this practical
config_dir <- here::here("output", "islands", "config_islands_simple_Day1Prac3_M1")

# load the simulation summary object produced by gen3sis
sim <- readRDS(file.path(config_dir, "sgen3sis.rds"))

# look at what the simulation summary contains
names(sim)

[1] "summary"    "flag"       "system"     "parameters"

The first element is the sim summary. This contains a record of the history of speciation, extinction, and species richness through time (phylo_summary), a history of the number of total grid cells occupied during the simulation through time (occupancy) and the species richness of each grid cell at the final time step.

str(sim$summary)

List of 3
 $ phylo_summary : num [1:52, 1:4] 4 4 4 4 4 4 4 4 4 4 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:52] "initial" "50" "49" "48" ...
  .. ..$ : chr [1:4] "total" "alive" "speciations" "extinctions"
 $ occupancy     : Named num [1:52] 1 0.353 0.338 0.325 0.319 ...
  ..- attr(*, "names")= chr [1:52] "initial" "50" "49" "48" ...
 $ richness-final: num [1:3600, 1:3] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3600] "1" "2" "3" "4" ...
  .. ..$ : chr [1:3] "x" "y" "0"

# phylo summary
head(sim$summary$phylo_summary)

        total alive speciations extinctions
initial     4     4           4           0
50          4     4           0           0
49          4     4           0           0
48          4     4           0           0
47          4     4           0           0
46          4     4           0           0

# occupancy
head(sim$summary$occupancy)

  initial        50        49        48        47        46 
1.0000000 0.3530592 0.3375242 0.3245532 0.3185118 0.3078276

# occupancy
head(sim$summary$`richness-final`)

    x    y 0
1 0.5 59.5 0
2 1.5 59.5 0
3 2.5 59.5 0
4 3.5 59.5 0
5 4.5 59.5 0
6 5.5 59.5 0

These data can be visualized with the plot_summary function

# Visualize the outputs 
plot_summary(sim)

We can also use these data to map out patterns of species richness

# make sure the landscape is loaded
lc <- readRDS(here::here("data", "landscapes", "islands", "landscapes.rds"))

# can remove cells with elevation below sea level at timestep 0 (present-day) to see the outlines of the islands
na_mask <- is.na(lc$elevation[,"0"])
rich <- sim$summary$`richness-final`
rich[na_mask,3] <- NA

# turn the summary table of x, y, richness values into a raster
richness <- rast(rich, type="xyz")

# plot, the sea is darblue, given by hexadecimal color code 
plot(richness, col=c("grey", gen3sis::color_richness(12)), colNA="#000033")

The next part of the simulation summary object is the flag which will tell us if the simulation ran successfully. It should give “OK”

sim$flag

[1] "OK"

Next is the system summary. This is information about the R version, R packages, and operating system used in the simulation. This ensures complete repeatability. It also tells us the runtime of the simulation.

names(sim$system)

[1] "runtime-hours"       "gen3sis-version"     "R-version"          
[4] "OS"                  "session-information"

sim$system[1:2]

$`runtime-hours`
[1] 0.01649552

$`gen3sis-version`
[1] '1.6.0'

Finally the summary object contains the config information and model parameters used.

names(sim$parameters)

[1] "gen3sis"     "user"        "directories"

names(sim$parameters$gen3sis)

[1] "general"        "initialization" "dispersal"      "speciation"    
[5] "mutation"       "ecology"

Phylogeny object (phy.nex)

The phylogeny object is pretty straight forward. It is a nexus file containing the relationships between the species.

# read phy
phy <- read.nexus(file.path(config_dir, "phy.nex"))

# quick tree plot; axisPhylo adds time/depth along the x-axis
plot(phy, cex=0.4)
axisPhylo()

From this object we can look at lineages through time plots and estimate trends in diversification. The gamma statistic is one way to detect diversification slowdowns or speedups, with positive values indicating nodes are more closely pushed up towards the tips (speed up) and negative values indicating nodes are closer to the root (slowdown) (Pybus and Harvey 2000). There are lots of other kinds of measures of phylogenetic tree shape, such as metrics of tree imbalance (Colless’s Index, Beta-Splitting Paramater, etc.)

# ltt() both plots the lineage-through-time curve and returns summary statistics
ltt_M1 <-ltt(phy)

# look at the gamma statistic
print(paste0("Gamma = ", round(ltt_M1$gamma, 2)))

[1] "Gamma = -0.81"

# is there a significant deviation from constant rates?
print(paste0("P = ", round(ltt_M1$p, 2)))

[1] "P = 0.42"

Species objects (phy.nex)

Now its time to get into the meat of the gen3sis outputs. Most of the information from the simulation is stored in the species objects. These are .rds files that contain a list which includes information on every species, extinct or extant, that existed during the simulation. These are saved per default at every time step, but can be fine tuned according to your needs. The naming convention is “species_t_0.rds” for time step 0 (present-day), and “species_t_50” for timestep 50, etc.

# load object
species_t_0 <- readRDS(file.path(config_dir, "species", "species_t_0.rds"))

# look at object class and length
class(species_t_0)

[1] "list"

length(species_t_0)

[1] 10

# compare to number of tips in the phylogeny
Ntip(phy)

[1] 10

There are 14 elements in the list, representing our 14 species and this number matches the number of species in our phylogeny. No species went extinct in this particular simulation, but if they did, they would match extinct tips in the phylogeny.

Lets look at a single species

names(species_t_0[[1]])

[1] "id"         "abundance"  "traits"     "divergence"

The species has an ID which allows us to match it to the phylogeny.

species_t_0[[1]]$id

[1] "1"

The species’ abundances are also linked to grid cells in the landscape, which can be matched with their corresponding names.

species_t_0[[1]]$abundance

247 248 249 250 251 252 253 288 289 290 291 292 294 306 307 308 310 311 312 313 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
314 315 316 317 318 348 349 351 352 353 354 355 363 364 365 366 367 376 377 378 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
379 406 407 408 414 415 416 422 423 424 437 438 439 440 441 465 466 467 475 476 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
482 483 500 501 524 525 526 535 536 537 542 559 560 561 584 585 597 602 618 619 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
620 621 644 645 646 656 657 662 663 676 677 678 679 680 705 706 716 717 722 723 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
736 737 738 765 766 767 775 776 777 782 783 784 785 786 794 795 796 797 827 829 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
834 835 836 844 845 846 847 848 849 850 851 852 853 854 855 856 887 888 889 890 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
891 892 893 894 895 906 907 908 909 910 911 912 913 949 950 951 952 953 
  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

In this case the species has abundance values of 1 in all populations from the cells that it occupies. This is because we set abundances to be binary: 1=present, 0=absent. We can see the species occupies cells 247, 248, 249, etc.

The species also has values of it’s traits for each of its populations.

head(species_t_0[[1]]$traits)

    dispersal temp_niche_centre temp_niche_width start_island
247         5          20.79586                1            1
248         5          20.83195                1            1
249         5          20.83002                1            1
250         5          20.84086                1            1
251         5          20.82863                1            1
252         5          20.84910                1            1

We can see the population that each row is linked by the rownames. Here you can see 247, 248, 249, etc. Each of these populations have a dispersal trait of 5 (because we didn’t vary this) slightly different temp_mean traits (because these evolved stochastically under a Brownian motion model) and a temp width of 1 (again, we didn;t vary this in model 1).

So, if we want to map out the distribution of species 1 at timestep 0, we just need to link those cell names (267/268/etc) to the landscape object and make a raster. Why don;t we try and see what islands species 1 is found on.

par(mfrow=c(1,3))

# first plot the islands out from the landscape object
patch_xyz <- rast(lc$patch[,c("x", "y", "0")], type="xyz")
plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)], colNA="#000033")

# match occupied cell IDs from the species traits table back to landscape coordinates
species1_xyz <- lc$patch[which(rownames(lc$patch) %in% rownames(species_t_0[[1]]$traits)), c("x", "y", "0")]

# convert those occupied cells to a raster and extend to the full island extent for plotting
species1_xyz <- rast(species1_xyz, type="xyz")
species1_xyz <- extend(species1_xyz, patch_xyz)
plot(species1_xyz, main= "Species 1 Distribution", colNA="#000033")

# alternatively we can use the plot_species function on gen3sis..
# for that we load the landscape object of the respective time-step
landscape_t_0 <- readRDS(file.path(config_dir, "landscapes", "landscape_t_0.rds"))
gen3sis::plot_species_presence(species_t_0[[1]], landscape_t_0)

Linking the species object and phylogeny

Lets try and link the species trait data to the phylogeny to start learning something about what exactly took place during our simulation! Let’s get the mean trait value of the temperature niche and also the islands that each species belongs too.

# build a species-level table that we can join to the phylogeny
daf <- data.frame("id"= paste0("species",sapply(species_t_0, function(x)x$id)),
                 "mean_temp"=NA,
                 "island1"=0,
                 "island2"=0,
                 "island3"=0,
                 "island4"=0, 
                 "island_start"=NA)

# take a look
head(daf)

        id mean_temp island1 island2 island3 island4 island_start
1 species1        NA       0       0       0       0           NA
2 species2        NA       0       0       0       0           NA
3 species3        NA       0       0       0       0           NA
4 species4        NA       0       0       0       0           NA
5 species5        NA       0       0       0       0           NA
6 species6        NA       0       0       0       0           NA

# Use sapply on the species object to get their mean trait values
daf$mean_temp <- sapply(species_t_0, function(x){
  # average the temperature niche centre across all populations of each species
  mean(x$traits[, "temp_niche_centre"], na.rm=T)
  })

# Get the island patch id values in a for loop
for(i in 1:length(species_t_0)){
  # useful debugging shortcut:
  # i <- 1 

  # recover the island patch IDs occupied by this species
  speciesi_xyz <- lc$patch[which(rownames(lc$patch) %in% rownames(species_t_0[[i]]$traits)), c("x", "y", "0")]
  
  # then pull out the unique values (note that species might occur on more than one island)
  islands <- unique(speciesi_xyz[, 3])
  
  # store simple presence/absence indicators for each island
  daf$island1[i] <- ifelse(1 %in% islands, 1, 0)
  daf$island2[i] <- ifelse(2 %in% islands, 1, 0)
  daf$island3[i] <- ifelse(3 %in% islands, 1, 0)
  daf$island4[i] <- ifelse(4 %in% islands, 1, 0)
}

# get the starting island from the traits object since we recorded this in the initialization step
daf$island_start <- sapply(species_t_0, function(xasa) unique(xasa$traits[, "start_island"]))

Now look again at the data frame

head(daf)

        id mean_temp island1 island2 island3 island4 island_start
1 species1  20.82914       1       1       0       0            1
2 species2  20.95619       1       1       0       0            2
3 species3  21.10761       0       0       1       0            3
4 species4  22.01025       0       0       0       1            4
5 species5  20.82236       1       1       0       0            1
6 species6  20.97156       1       1       0       0            2

Plot out the continuously evolving temperature niche trait

# create a named trait vector so tip labels can be matched automatically
temp_niche <- daf$mean_temp
names(temp_niche) <- daf$id

# plot it out with dots = trait
dotTree(phy,temp_niche,ftype="i", length=8, fsize=1.2, standardize=T)

Not much variation in that trait, due to the combined effects of trait homogenization and a low rate of change. Lets plot the tip states of the islands on the phylogeny using the phytools package

par(mfrow=c(1,2))

# helper to convert an island membership column into the matrix format expected by tiplabels()
formatIsland <- function(island, phy=phy, daf=daf){
  islandf <- as.factor(daf[, island])
  names(islandf) <- daf$id

  # reorder rows so the island states line up with the phylogeny tip order
  islandmat<-to.matrix(islandf,levels(islandf))
  islandmat<-islandmat[phy$tip.label,]
  return(list(islandf, islandmat))
}

# first panel: tree with pie symbols showing present-day island occupancy
plotTree(phy,ftype="i",offset=1,fsize=0.9, xlim=c(0, 75))

tiplabels(pie=formatIsland(island="island1", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,2)],cex=0.6, adj=12+1)
tiplabels(pie=formatIsland(island="island2", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,3)],cex=0.6, adj=12+3)
tiplabels(pie=formatIsland(island="island3", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,4)],cex=0.6, adj=12+5)
tiplabels(pie=formatIsland(island="island4", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,6)],cex=0.6, adj=12+7)

plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)])

Interesting. What do you notice about the distribution of species on islands? Could you predict which island each lineage began on?

We actually know which islands each lineage started on because we recorded this as a trait (we could also look at the past species objects to figure this out but we have used a shortcut though the traits).

par(mfrow=c(1,2))

# second comparison: starting island first, then present-day occupancy columns
plotTree(phy,ftype="i",offset=1,fsize=0.4, xlim=c(0, 75))
my_cex=0.9
# add the starting island as the first colum
tiplabels(pie=formatIsland(island="island_start", phy=phy, daf=daf)[[2]],piecol=palette()[c(2,3,4,6)],cex=my_cex*3, adj=12)

tiplabels(pie=formatIsland(island="island1", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,2)],cex=my_cex, adj=12+1)
tiplabels(pie=formatIsland(island="island2", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,3)],cex=my_cex, adj=12+3)
tiplabels(pie=formatIsland(island="island3", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,4)],cex=my_cex, adj=12+5)
tiplabels(pie=formatIsland(island="island4", phy=phy, daf=daf)[[2]],piecol=palette()[c(8,6)],cex=my_cex, adj=12+7)

# add island plot
plot(patch_xyz, main="Island Patches", col=palette()[c(2,3,4,6)])

So whats really apparent here is that the clade that originated on the green island has speciated allopatrically into the red island multiple times in the recent past. The same is true for the red clade, however the deeper divergence between species9 and species3 have had enough time to recolonize both islands.

Linking the species object, landscape, and phylogeny

Common spatial biodiversity analyses link information measured at the species level to maps of their distribution in space using presence-absence matrices or PAMs. PAMs typically are data frame with each row representing a site, could be an island or could be a grid cell, and each column representing a species. Values of 1 are given if the species is present in the site, if not a value of 0 is given.

# grid cell level PAM

# define site IDs from landscape row names (fallback to row indices if missing)
site_ids <- rownames(lc$elevation)
if (is.null(site_ids)) {
  site_ids <- as.character(seq_len(nrow(lc$elevation)))
}

# create an empty site-by-species presence-absence matrix
PAM <- matrix(
  0L,
  nrow = length(site_ids),
  ncol = length(species_t_0),
  dimnames = list(site_ids, paste0("species", sapply(species_t_0, function(x) x$id)))
)

# fill the matrix with presences by matching occupied cell IDs to rows
for (i in seq_along(species_t_0)) {
  present_sites <- names(species_t_0[[i]]$abundance)

  # if abundance names are unavailable, use trait row names as occupied cell IDs
  if (is.null(present_sites) || length(present_sites) == 0) {
    present_sites <- rownames(species_t_0[[i]]$traits)
  }

  present_sites <- intersect(site_ids, as.character(present_sites))
  if (length(present_sites) > 0) {
    PAM[present_sites, i] <- 1L
  }
}

PAM <- as.data.frame(PAM, check.names = FALSE)

# how does it look? show first occupied cells so the preview includes presences
occupied_rows <- which(rowSums(PAM) > 0)
print(PAM[head(occupied_rows, 10), seq_len(min(10, ncol(PAM)))])

    species1 species2 species3 species4 species5 species6 species7 species8
247        1        1        0        0        1        1        1        0
248        1        1        0        0        1        1        1        1
249        1        1        0        0        1        1        1        1
250        1        1        0        0        1        1        1        1
251        1        1        0        0        1        1        1        1
252        1        1        0        0        1        1        1        1
253        1        1        0        0        1        1        1        0
255        0        1        0        0        0        1        1        0
288        1        1        0        0        1        1        1        1
289        1        1        0        0        1        1        1        1
    species9 species10
247        1         1
248        1         1
249        1         1
250        1         1
251        1         1
252        1         1
253        1         1
255        0         1
288        1         1
289        1         1

# these functions treat PAM rows as sites and columns as species
# PD = total branch length represented in a site
pd_islands <- pd(PAM, phy)

# MPD and MNTD summarize how related co-occurring species are
mpd_islands <- mpd(PAM, cophenetic(phy))
mntd_islands <- mntd(PAM,cophenetic(phy))

# join the community metrics back to x/y coordinates for mapping
community_phylo <- cbind(lc$elevation[, c("x", "y")], pd_islands, mpd_islands, mntd_islands)

par(mfrow=c(2,2))

# build one raster per metric from the x/y/value columns
sr_ras <- rast(community_phylo[, c("x", "y", "SR")], type="xyz")
pd_ras <- rast(community_phylo[, c("x", "y", "PD")], type="xyz")
mpd_ras <- rast(community_phylo[, c("x", "y", "mpd_islands")], type="xyz")
mntd_ras <- rast(community_phylo[, c("x", "y", "mntd_islands")], type="xyz")

plot(sr_ras, main="Species Richness", colNA ="red")
plot(pd_ras, main="Phylogenetic Diversity")
plot(mpd_ras, main="Mean Phylogenetic Pairwise Distance")
plot(mntd_ras, main= "Mean Nearest Neighbour Phylogenetic Distance")

Sensitivity analysis

In this part we are going to load in a data set from Skeels et al. (2022) SystBiol in which we simulated data with Gen3sis under four alternative models to test the evolutionary speed hypothesis (ESH). The ESH hypothesizes that faster rates of evolution occurs in lineages from warm regions like the tropics because they are have higher mutagenesis from faster life histories associated with warm temperatures and smaller body sizes. The four models used there were

M0 - the null. population divergence is independent of temperature and body size
M1 - Temperature Trailblazer. environmental temperature drives rate of population divergence
M2 - Size Shaper. body size drives the rate of population divergence
M3 - Synergistic Drivers. environmental temperature and body size drives the rate of population divergence

Not only did we change the overall model of evolution, we also varied key parameters for rates of niche evolution (simga_squared_t), rates of body size evolution (sigma_squared_bs), dispersal, and the temperature niche breadth (omega), the exponent of the divergence factor with temp/body size (lambda), and the divergence threshold. Load in the data and take a look, the first 6 columns are the model parameters we varied.

# each row is one simulation replicate; the first six columns are input parameters
sim_data <- read.csv(here::here("data", "simulated_summary_statistics.csv"))

# look at the first few columns
head(sim_data)

  divergence_threshold lambda omega sigma_squared_bs sigma_squared_t dispersal
1                    6  3.497 0.022            0.010           0.008     2.748
2                    8  2.743 0.029            0.006           0.012     2.119
3                    4  4.251 0.016            0.015           0.004     3.376
4                    5  3.120 0.026            0.003           0.013     3.691
5                    9  4.629 0.013            0.013           0.006     2.433
6                    7  2.365 0.019            0.008           0.003     3.062
   m n_extant_diversity richness_lat_cor richness_temp0mya_cor temp_skewness
1 m0               3763       -0.5123028             0.4603989    -0.3377488
2 m0               2514       -0.4425833             0.3879755    -0.4301348
3 m0               2881       -0.6090030             0.5980889    -0.5485366
4 m0               3259       -0.2692291             0.2709599    -0.4056149
5 m0               1720       -0.6320276             0.5569583    -0.2034315
6 m0               2040       -0.6901237             0.7143851    -0.5818263
  temp_kurtosis rs_skewness rs_kurtosis bodysize_temp_cor
1      2.404199    7.054563    70.57922      0.0001267958
2      2.781235    7.065556    73.60686     -0.0573601538
3      2.686734    6.568041    66.80411      0.0627634742
4      2.578726    6.372164    56.40946     -0.0099951252
5      2.039092    5.943670    62.21730     -0.0491801922
6      2.692505    5.557220    43.19415     -0.1425113733
  bodysize_rangesize_cor rangesize_temp_cor bodysize_MRDa_cor bodysize_MRDn_cor
1           -0.021888162          0.3166909       -0.22987475       -0.22987475
2           -0.004099039          0.3750050        0.11431674        0.11431674
3           -0.007583409          0.1745926       -0.17385167       -0.17385167
4            0.019477523          0.3315223        0.15882106        0.15882106
5           -0.020735305          0.3106268       -0.01367189       -0.01367189
6           -0.068035906          0.2386339       -0.25345879       -0.25345879
  bodysize_MRDs_cor bodysize_ED_cor bodysize_ES_cor bodysize_DR_cor
1       -0.22987475    -0.004220075    -0.001379360     0.004220206
2        0.11431674    -0.048918445    -0.047336234     0.048918445
3       -0.17385167     0.156214848     0.124189954    -0.156214848
4        0.15882106     0.025569939     0.020867045    -0.025569939
5       -0.01367189    -0.091018579    -0.071841554     0.091018579
6       -0.25345879    -0.011839185    -0.009998883     0.011839185
  temp_MRDa_cor temp_MRDn_cor temp_MRDs_cor temp_ED_cor temp_ES_cor temp_DR_cor
1   -0.26576712   -0.26576712   -0.26576712  0.26183743   0.2202009 -0.26183732
2   -0.17175772   -0.17175772   -0.17175772  0.24312164   0.2193826 -0.24312164
3   -0.26126033   -0.26126033   -0.26126033  0.20306519   0.2015201 -0.20306519
4   -0.18047283   -0.18047283   -0.18047283  0.09627562   0.0921575 -0.09627562
5   -0.29855804   -0.29855804   -0.29855804  0.51772365   0.4230468 -0.51772365
6    0.05847105    0.05847105    0.05847105  0.11024859   0.1080941 -0.11024859
  rangesize_MRDa_cor rangesize_MRDn_cor rangesize_MRDs_cor rangesize_ED_cor
1        -0.06934918        -0.06934918        -0.06934918       0.08511133
2        -0.10091990        -0.10091990        -0.10091990       0.12507118
3         0.04022504         0.04022504         0.04022504      -0.09452590
4        -0.04660310        -0.04660310        -0.04660310       0.07933631
5        -0.13259037        -0.13259037        -0.13259037       0.18191779
6         0.06267093         0.06267093         0.06267093      -0.01025940
  rangesize_ES_cor rangesize_DR_cor  collessI    sackinI     gamma  lat_pd_cor
1      0.077434173      -0.08511127 1.2941150  1.2446979 -2.842964  0.31465895
2      0.122658910      -0.12507118 0.2469526  0.3055442 -3.432895  0.25315279
3     -0.069705157       0.09452590 1.7248814  1.6445494  3.552396  0.18284230
4      0.087443271      -0.07933631 0.1092457 -0.1254847 -2.856134  0.11539323
5      0.147346956      -0.18191779 1.9250034  1.8373325  1.221135 -0.06489522
6      0.006120436       0.01025940 1.4952264  1.3605311 -5.120426  0.53662583
  lat_mpd_cor lat_mntd_cor temp_pd_cor temp_mpd_cor temp_mntd_cor
1  0.03036122   0.33145967 -0.24781657   0.01816540  -0.268472337
2  0.10829373   0.24772423 -0.19630776  -0.09235074  -0.184665665
3  0.18551711   0.24143456 -0.18427695  -0.19302604  -0.243213118
4 -0.03477720   0.15145822 -0.09837927   0.03810852  -0.129764794
5 -0.20454388   0.09798374  0.14050779   0.24629073  -0.006957113
6  0.46224942   0.53924111 -0.54984407  -0.47923803  -0.551691573
  richness_pd_cor richness_mpd_cor richness_mntd_cor temp_bsm_cor temp_bssd_cor
1      -0.7653721      -0.41190740        -0.7499123   0.13029352    0.10296607
2      -0.5019377      -0.13404370        -0.4901978   0.02405246    0.32276835
3      -0.4206957      -0.37006059        -0.5361401   0.29116267    0.28959998
4      -0.6620540      -0.41176821        -0.6769946  -0.05320780    0.04856645
5      -0.1328847       0.06086838        -0.2905060  -0.08286640    0.08193951
6      -0.8328187      -0.71895993        -0.8346579   0.31747185    0.34586235
  richness_bsm_cor richness_bssd_cor   pd_bsm_cor mpd_bsm_cor mntd_bsm_cor
1       0.11558582         0.2699839  0.003325102   0.2260060  -0.05039301
2      -0.04537885         0.3499518 -0.263036186  -0.2484294  -0.25064129
3       0.23668036         0.3628398  0.515713659   0.4795434   0.49672112
4       0.04042515         0.1778522 -0.052266148  -0.1187109  -0.03605613
5       0.31046985         0.3722813  0.061612978   0.1029425   0.06221910
6       0.45747017         0.3837205 -0.038428520   0.3474728  -0.12562991

# look at the models
unique(sim_data$m)

[1] "m0" "m1" "m2" "m3"

This data set has 27 metrics used in our paper to measure patterns in the distribution of species, such as range size metrics, or correlations between temperature and diversity, as well as phylogenetic tree shape metrics, such as gamma, and measures of functional diversity, like body size variance. We predicted that these different models of evolution (M0-M4) should leave discernible signatures in these metrics. We can plot a few associations between biodiversity metrics and these model parameters to test this hypothesis.

# how is diversity related to the dispersal ability of a clade?
ggplot(sim_data, aes(x=m, y=log(n_extant_diversity), fill=m))+
  geom_point(alpha=0.7, position = "jitter")+
  geom_boxplot(alpha=0.7)+
  stat_smooth()+
  theme_classic()

# how about the temperature~diversity gradient?
ggplot(sim_data, aes(x=m, y=richness_temp0mya_cor, fill=m))+
  geom_point(alpha=0.7, position = "jitter")+
  geom_boxplot(alpha=0.7)+
  stat_smooth()+
  theme_classic()

# how about the relationship between  body size and diversification rate?
ggplot(sim_data, aes(x=m, y=bodysize_DR_cor, fill=m))+
  geom_point(alpha=0.7, position = "jitter")+
  geom_boxplot(alpha=0.7)+
  stat_smooth()+
  theme_classic()

# how about the relationship between temperature and diversification rate?
ggplot(sim_data, aes(x=m, y=temp_DR_cor, fill=m))+
  geom_point(alpha=0.7, position = "jitter")+
  geom_boxplot(alpha=0.7)+
  stat_smooth()+
  theme_classic()

❓ Question [max 5 min] Do these patterns fit our predictions? Which model effects look strongest, and which metrics seem least informative?

Think of this section in two steps:

Which model parameters influence biodiversity metrics?
Which metrics are weakly explained and likely need nonlinear models/interactions?

From the scatter/boxplot views, some associations are visible, but the strength differs by metric. In general, some metrics show clearer model separation than others. This is expected in eco-evolutionary simulations because multiple processes can produce similar emergent patterns.

Useful interpretation template:

Stronger-looking effects: metrics with clearer separation among models and tighter within-model spread.
Weaker/less informative effects: metrics with substantial overlap among models and high dispersion.
Key caveat: visual trends are helpful, but formal models (below) are needed to quantify effect size and uncertainty.

We can also look at how some of these metrics varied with the continuous model parameters such as dispersal ability.

# how is diversity related to the dispersal ability of a clade?
ggplot(sim_data, aes(x=dispersal, y=log(n_extant_diversity)))+
  geom_point(alpha=0.7)+
  stat_smooth()+
  theme_classic()

# how about the latitude diversity gradient?
ggplot(sim_data, aes(x=dispersal, y=temp_DR_cor))+
  geom_point(alpha=0.7)+
  stat_smooth()+
  theme_classic()

To perform a simple kind of sensitivity test we might ask how each of the model parameters predicts linear changes in the distribution of a biodiversity patterns using a multiple regression model. One example where we have a good idea of what the relationship should be is the how variance in the distribution of temperature niches across species (e.g., skewness of the distribution) relates to model paramaters. We expect that this should scale with the rate of rate of temperature niche evolution - faster rates of change = more variation in the trait = more kutosis.

# copy the data and standardize only the six input parameters
# this keeps coefficient magnitudes comparable across predictors
sim_data_scaled <- sim_data
sim_data_scaled[,1:6] <- scale(sim_data_scaled[,1:6] )

# fit the multiple regression
lm1  <- lm(temp_kurtosis ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)

# look at model coefficients
summary(lm1)


Call:
lm(formula = temp_kurtosis ~ divergence_threshold + lambda + 
    omega + sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7214 -0.7281 -0.3103  0.2481 22.9740 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3.26301    0.04670  69.872  < 2e-16 ***
divergence_threshold  0.48515    0.04978   9.745  < 2e-16 ***
lambda                0.22304    0.04717   4.729 2.49e-06 ***
omega                 0.07447    0.04769   1.562    0.119    
sigma_squared_bs      0.02415    0.04677   0.516    0.606    
sigma_squared_t       0.29197    0.04690   6.225 6.37e-10 ***
dispersal             0.26180    0.04915   5.326 1.17e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.737 on 1377 degrees of freedom
Multiple R-squared:  0.1076,    Adjusted R-squared:  0.1037 
F-statistic: 27.68 on 6 and 1377 DF,  p-value: < 2.2e-16

🏋💻 Exercise [max 10 min] Interpret the regression output for lm1. Are our expectations met for temperature niche evolution, and what surprises do you see?

lm1 asks whether model parameters predict temp_kurtosis. - Kurtosis captures tail heaviness (how much probability is in extreme values). - Higher kurtosis means more extreme trait values relative to a normal-like distribution.

Why scale parameters first?

sim_data_scaled[,1:6] <- scale(sim_data_scaled[,1:6])
This puts predictors on comparable units (mean 0, sd 1), so coefficient magnitudes are interpretable side-by-side.

How to interpret coefficients:

A positive coefficient means temp_kurtosis tends to increase as that parameter increases (in standard deviation units).
A negative coefficient means the opposite.
P-values indicate evidence for a linear effect, conditional on the other predictors.

Expectation check:

If sigma_squared_t (rate of temperature niche evolution) is positive and supported, that matches the biological expectation: faster niche evolution can generate more extreme trait values and heavier tails in the niche distribution.

What can be surprising:

Some parameters may be weak/non-significant even if biologically important.
R^2 may be modest. This is common in complex systems where nonlinearities, interactions, and stochasticity drive much of the variation.

Let’s try a few other biodiversity patterns where the predictions are less clear.

# fit the multiple regressions

# Gamma statsitic for phylogenetic tree shape
lm2  <- lm(gamma ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)

# skewness of the range size distribution of species
lm3  <- lm(rs_skewness ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)

# correlation between species range sizes and temperature
lm4  <- lm(rangesize_temp_cor ~ divergence_threshold+lambda+omega+sigma_squared_bs+sigma_squared_t+dispersal, data=sim_data_scaled)

# look at model coefficients
summary(lm2)


Call:
lm(formula = gamma ~ divergence_threshold + lambda + omega + 
    sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.6901  -3.3262  -0.3125   3.0625  19.1039 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.58975    0.13209  12.035  < 2e-16 ***
divergence_threshold -1.67245    0.14081 -11.877  < 2e-16 ***
lambda               -0.36850    0.13341  -2.762  0.00582 ** 
omega                 1.13745    0.13489   8.432  < 2e-16 ***
sigma_squared_bs      0.04682    0.13230   0.354  0.72347    
sigma_squared_t      -0.17454    0.13266  -1.316  0.18848    
dispersal             0.99940    0.13903   7.188 1.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.914 on 1377 degrees of freedom
Multiple R-squared:  0.2175,    Adjusted R-squared:  0.2141 
F-statistic: 63.78 on 6 and 1377 DF,  p-value: < 2.2e-16

summary(lm3)


Call:
lm(formula = rs_skewness ~ divergence_threshold + lambda + omega + 
    sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5219 -1.7186 -0.2771  1.3923 21.2181 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.14631    0.06701  91.717  < 2e-16 ***
divergence_threshold -2.11448    0.07144 -29.598  < 2e-16 ***
lambda               -0.60219    0.06768  -8.897  < 2e-16 ***
omega                -0.31157    0.06844  -4.553 5.76e-06 ***
sigma_squared_bs      0.22387    0.06712   3.335 0.000875 ***
sigma_squared_t      -0.57043    0.06730  -8.476  < 2e-16 ***
dispersal            -1.33498    0.07053 -18.927  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.493 on 1377 degrees of freedom
Multiple R-squared:  0.4552,    Adjusted R-squared:  0.4528 
F-statistic: 191.8 on 6 and 1377 DF,  p-value: < 2.2e-16

summary(lm4)


Call:
lm(formula = rangesize_temp_cor ~ divergence_threshold + lambda + 
    omega + sigma_squared_bs + sigma_squared_t + dispersal, data = sim_data_scaled)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.72559 -0.08895  0.01488  0.09329  0.68488 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           0.197748   0.004238  46.658  < 2e-16 ***
divergence_threshold -0.029361   0.004518  -6.498 1.13e-10 ***
lambda               -0.022882   0.004281  -5.345 1.05e-07 ***
omega                -0.042760   0.004328  -9.879  < 2e-16 ***
sigma_squared_bs     -0.001199   0.004245  -0.283    0.778    
sigma_squared_t       0.041124   0.004256   9.662  < 2e-16 ***
dispersal            -0.019362   0.004461  -4.340 1.53e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1577 on 1377 degrees of freedom
Multiple R-squared:  0.1388,    Adjusted R-squared:  0.135 
F-statistic: 36.99 on 6 and 1377 DF,  p-value: < 2.2e-16

In all these cases there are interesting associations between model parameters and the biodiversity metrics. However, look at the R-squared values and what do you find? They are highly variable and in some cases quite low. This means that lots of variance in these biodiversity metrics are not explained by a linear combination of our model parameters. This is important when interpreting the effect. There are other ways of inferring more complex relationships between model parameters and biodiversity metrics, such as by including quadratic effects, exploring interaction terms, fitting non-linear models such as generalised additive models (GAMs), or even using machine learning methods such as neural networks which allow for highly-dimension non-linear effects. We won’t cover these today but they are all useful options to explore during sensitivity analysis.

Model Selection

Once we have established that some biodiversity metrics showed predictable relationships with model parameters or the generative model (e.g., M0-M4) we can use these metrics to perform model selection on empirical data. Here we are using the match between observed biodiversity patterns and simulated biodiversity patterns, to ask which model might be most likely to have generated the real patterns.

To do this we are going to perform a linear discriminant analysis which is a classification tool that can fit fit fairly quickly compared to some of the other models. We want to validate how well the model performs so we will perform a 10-fold cross validation repeated 10 times. Here we train the model on a subset of the data, repeating the process and optimising the predictive capacity. Then we predict how good a job our classifier is on a witheld portion of the data (test data). If the model performs well we should be able to accurately predict what models generated what biodiversity metrics.

if (!requireNamespace("caret", quietly = TRUE)) {
  stop(
    "The 'caret' package is required for the Model Selection section. ",
    "Install it with install.packages('caret') and re-render."
  )
}

# keep the class label `m` plus the biodiversity summary statistics
sim_data_ms <- sim_data[,7:ncol(sim_data)]

# drop the MRD summary variables to match the teaching example feature set
sim_data_ms <- sim_data_ms[, -which(grepl("MRDs",colnames(sim_data_ms)))]

# reuse the course-wide seed so this split stays reproducible across renders
set.seed(course_seed)

# create a stratified split based on the model label `m`
# p = .66 keeps about two thirds of rows for training
# list = FALSE returns integer row indices, and times = 1 makes one split
train_index <- caret::createDataPartition(sim_data_ms[["m"]], p = .66, list = FALSE,  times = 1)
train_data <- sim_data_ms[ train_index ,]
test_data  <- sim_data_ms[-train_index ,]

# learn centering/scaling from the training set only, then apply the same transform to both sets
preprocessed_values <- caret::preProcess(train_data, method = c("center", "scale"))
train_transformed   <- predict(preprocessed_values, train_data)
test_transformed    <- predict(preprocessed_values, test_data )

# repeated 10-fold CV estimates expected training performance during model fitting
# class probabilities are kept so we can inspect uncertainty later
train_control <- caret::trainControl(
  method = "repeatedcv",
  number = 10,
  repeats = 10,
  classProbs = TRUE,
  savePredictions = TRUE,
  allowParallel = FALSE
)

# use all biodiversity metrics as predictors of the generating model `m`
f1 <- formula(paste("m ~ ", paste(names(sim_data_ms)[2:c(length(names(sim_data_ms)))], collapse=" + ")))

## LINEAR DISCRIMINANT ANALYSIS
# note we'll run an LDA because they're quick - many other kinds of models to choose from 
set.seed(course_seed)
lda_train        <- caret::train(f1, data=train_transformed, method = "lda", trControl = train_control, verbose = T)

# evaluate on the held-out test set that was not used during fitting
lda_test        <- predict(lda_train, test_transformed)

# compute one-vs-all precision/recall metrics for each class
lda_cm        <- caret::confusionMatrix(data = lda_test, reference = as.factor(test_transformed[["m"]]), mode = "prec_recall")

# how well did the model perform
lda_cm

Confusion Matrix and Statistics

          Reference
Prediction  m0  m1  m2  m3
        m0 102  13   1   7
        m1   5  92   4   4
        m2   3   1  83  30
        m3   7  10  29  76

Overall Statistics
                                          
               Accuracy : 0.7559          
                 95% CI : (0.7143, 0.7942)
    No Information Rate : 0.2505          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.6745          
                                          
 Mcnemar's Test P-Value : 0.1768          

Statistics by Class:

                     Class: m0 Class: m1 Class: m2 Class: m3
Precision               0.8293    0.8762    0.7094    0.6230
Recall                  0.8718    0.7931    0.7094    0.6496
F1                      0.8500    0.8326    0.7094    0.6360
Prevalence              0.2505    0.2484    0.2505    0.2505
Detection Rate          0.2184    0.1970    0.1777    0.1627
Detection Prevalence    0.2634    0.2248    0.2505    0.2612
Balanced Accuracy       0.9059    0.8780    0.8061    0.7591

Quick guide to the class metrics

These are reported separately for each model class (m0, m1, m2, m3), treating one class at a time.

Precision: Of the cases predicted to be this class, how many were correct?
Recall: Of the cases that truly were this class, how many did the model recover?
F1: The harmonic mean of precision and recall; useful when you want both to be high.
Prevalence: The true proportion of this class in the test data.
Detection Rate: The proportion of all cases that were correctly assigned to this class.
Detection Prevalence: The proportion of all cases that the model predicted as this class.
Balanced Accuracy: The average of sensitivity and specificity; useful when comparing classes fairly.

❓ Question [max 5 min] What is the overall classification accuracy, and are some models predicted better than others?

From the confusion matrix output shown here:

Overall accuracy: 0.7559 (~75.59%)
95% CI: (0.7143, 0.7942)
No-information rate: 0.2505 (chance level with 4 balanced classes)

So the classifier performs far above chance.

Are some models predicted better?

Yes. Using F1 score (higher is better):
- m0: 0.8500 (best overall in this seeded run)
- m1: 0.8326
- m2: 0.7094
- m3: 0.6360 (hardest)

If you look at other metrics, m1 has the highest precision (0.8762), while m0 has the highest recall (0.8718) and balanced accuracy (0.9059).

Main confusion pattern:

m2 and m3 are still the most frequently confused pair, suggesting these two models can produce similar biodiversity-statistic signatures.

Now we’ll use this model to predict the possible model of diversification in orders of terrestrial vertebrates.

# load the empirical vertebrate summary statistics
empirical_data <- read.csv(here::here("data", "order_empirical_summary_statistics.csv"))

# restrict to clades with enough species for the summaries to be informative
empirical_data <- na.omit(empirical_data[which(empirical_data$n_species >= 20),])

# make labels lower case
empirical_data$taxon <- tolower(empirical_data$taxon)

# rename columns so the empirical data uses the same feature names as the simulation table
colnames(empirical_data)[which(colnames(empirical_data) == "taxon")] <- "m"
colnames(empirical_data)[which(colnames(empirical_data)=="rs_kutosis")] <- "rs_kurtosis"
colnames(empirical_data)[which(colnames(empirical_data)=="n_species")] <- "n_extant_diversity"
colnames(empirical_data) <- gsub("_p_cor", "_cor",colnames(empirical_data)) # change _p_cor for posterior samplescould also change _m_cor to use MCC samples
colnames(empirical_data) <- gsub("DivRate", "DR",colnames(empirical_data))
colnames(empirical_data)[which(colnames(empirical_data) == "taxon")] <- "m"
colnames(empirical_data)[which(colnames(empirical_data) == "collessI_posterior")] <- "collessI"
colnames(empirical_data)[which(colnames(empirical_data) == "sackinI_mcc")] <- "sackinI"
colnames(empirical_data)[which(colnames(empirical_data) == "gamma_mcc")] <- "gamma"


# keep only features shared with the simulated training data, then match the same column order
empirical_data <- empirical_data[, which(colnames(empirical_data) %in% colnames(sim_data_ms))]
empirical_subset <- empirical_data[, match(colnames(sim_data_ms), colnames(empirical_data))]

# check names match
colnames(sim_data_ms)[which(!colnames(sim_data_ms) %in% colnames(empirical_subset))]

character(0)

colnames(empirical_subset)[which(!colnames(empirical_subset) %in% colnames(sim_data_ms))]

character(0)

colnames(empirical_subset) == colnames(sim_data_ms)

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

# apply the exact same preprocessing learned from the training set
simulated_transformed <- predict(preprocessed_values, sim_data_ms )
empirical_transformed <- predict(preprocessed_values, empirical_subset )

# keep the interface general so more than one fitted model could be compared later
model_set <- list(lda_train)

# predict on empirical
class_predictions   <- predict(model_set, newdata = empirical_transformed, type = "raw", na.action = na.omit)
class_probabilities <- predict(model_set, newdata = empirical_transformed, type = "prob", na.action = na.omit)

# count how many empirical orders are assigned to each generating model
colSums(do.call(rbind, class_predictions)) # TODO define class prediction table

 [1] 3 2 2 2 1 2 3 3 3 3 3 3 2 3 2 3 2 2 3 2 3 1 3 2 3 2 2 2 3 3 2 2 3 2 3 3 1 2
[39] 2 4 1 3 2 3 3 2 2 3

# now we can see which model is the most supported for each order
cbind(empirical_transformed$m, as.character(class_predictions[[1]]))

      [,1]                [,2]
 [1,] "serpentes"         "m2"
 [2,] "anguimorpha"       "m1"
 [3,] "dibamia"           "m1"
 [4,] "gekkota"           "m1"
 [5,] "iguania"           "m0"
 [6,] "lacertoidea"       "m1"
 [7,] "scincoidea"        "m2"
 [8,] "afrosoricida"      "m2"
 [9,] "carnivora"         "m2"
[10,] "cetartiodactyla"   "m2"
[11,] "chiroptera"        "m2"
[12,] "cingulata"         "m2"
[13,] "dasyuromorphia"    "m1"
[14,] "didelphimorphia"   "m2"
[15,] "diprotodontia"     "m1"
[16,] "eulipotyphla"      "m2"
[17,] "lagomorpha"        "m1"
[18,] "primates"          "m1"
[19,] "rodentia"          "m2"
[20,] "scandentia"        "m1"
[21,] "accipitriformes"   "m2"
[22,] "anseriformes"      "m0"
[23,] "apodiformes"       "m2"
[24,] "bucerotiformes"    "m1"
[25,] "caprimulgiformes"  "m2"
[26,] "charadriiformes"   "m1"
[27,] "columbiformes"     "m1"
[28,] "coraciiformes"     "m1"
[29,] "cuculiformes"      "m2"
[30,] "falconiformes"     "m2"
[31,] "galliformes"       "m1"
[32,] "gruiformes"        "m1"
[33,] "musophagiformes"   "m2"
[34,] "otidiformes"       "m1"
[35,] "passeriformes"     "m2"
[36,] "pelecaniformes"    "m2"
[37,] "piciformes"        "m0"
[38,] "procellariiformes" "m1"
[39,] "psittaciformes"    "m1"
[40,] "strigiformes"      "m3"
[41,] "suliformes"        "m0"
[42,] "tinamiformes"      "m2"
[43,] "trogoniformes"     "m1"
[44,] "anura"             "m2"
[45,] "caudata"           "m2"
[46,] "gymnophiona"       "m1"
[47,] "crocodylia"        "m1"
[48,] "testudines"        "m2"

table(class_predictions[[1]])


m0 m1 m2 m3 
 4 21 22  1

# can also look at the variation in this
cbind(empirical_transformed$m, round(class_probabilities[[1]], 3))

   empirical_transformed$m    m0    m1    m2    m3
1                serpentes 0.000 0.000 1.000 0.000
2              anguimorpha 0.000 1.000 0.000 0.000
3                  dibamia 0.000 1.000 0.000 0.000
4                  gekkota 0.000 1.000 0.000 0.000
5                  iguania 0.935 0.001 0.062 0.002
6              lacertoidea 0.000 1.000 0.000 0.000
7               scincoidea 0.000 0.000 1.000 0.000
8             afrosoricida 0.024 0.000 0.976 0.000
9                carnivora 0.000 0.000 1.000 0.000
10         cetartiodactyla 0.000 0.000 1.000 0.000
11              chiroptera 0.000 0.000 1.000 0.000
12               cingulata 0.000 0.000 1.000 0.000
13          dasyuromorphia 0.000 1.000 0.000 0.000
15         didelphimorphia 0.000 0.000 1.000 0.000
16           diprotodontia 0.000 1.000 0.000 0.000
17            eulipotyphla 0.000 0.000 1.000 0.000
19              lagomorpha 0.000 1.000 0.000 0.000
28                primates 0.000 1.000 0.000 0.000
30                rodentia 0.000 0.000 1.000 0.000
31              scandentia 0.069 0.931 0.000 0.000
33         accipitriformes 0.000 0.000 1.000 0.000
34            anseriformes 0.815 0.005 0.173 0.006
35             apodiformes 0.000 0.000 1.000 0.000
37          bucerotiformes 0.000 1.000 0.000 0.000
38        caprimulgiformes 0.000 0.000 0.989 0.011
41         charadriiformes 0.003 0.997 0.000 0.000
44           columbiformes 0.000 1.000 0.000 0.000
45           coraciiformes 0.000 1.000 0.000 0.000
46            cuculiformes 0.052 0.013 0.911 0.024
48           falconiformes 0.000 0.000 1.000 0.000
49             galliformes 0.000 1.000 0.000 0.000
51              gruiformes 0.000 1.000 0.000 0.000
53         musophagiformes 0.000 0.000 1.000 0.000
54             otidiformes 0.051 0.795 0.006 0.149
55           passeriformes 0.000 0.000 1.000 0.000
56          pelecaniformes 0.000 0.000 1.000 0.000
59              piciformes 0.993 0.000 0.006 0.000
61       procellariiformes 0.000 1.000 0.000 0.000
62          psittaciformes 0.000 1.000 0.000 0.000
66            strigiformes 0.000 0.157 0.000 0.843
67              suliformes 0.997 0.003 0.000 0.000
68            tinamiformes 0.000 0.000 1.000 0.000
69           trogoniformes 0.000 1.000 0.000 0.000
70                   anura 0.001 0.000 0.994 0.005
71                 caudata 0.000 0.000 1.000 0.000
72             gymnophiona 0.000 1.000 0.000 0.000
73              crocodylia 0.000 1.000 0.000 0.000
74              testudines 0.000 0.000 1.000 0.000

🏋💻 Exercise [max 10 min] Summarize which diversification models are most supported across vertebrate orders and propose one biological interpretation.

Use this workflow:

Count predicted classes with table(class_predictions[[1]]).
Inspect confidence with class_probabilities[[1]].
Report both dominant classes and uncertainty.

Example interpretation template:

If one model is most frequent across orders, that model has strongest support overall.
If many clades have split probabilities across two models, those models are not well-separated for those clades.

Biological interpretation (example):

If m3 (joint temperature + body size effects) dominates, this supports a multi-driver view of diversification where interacting mechanisms shape observed biodiversity patterns.
If m2 is common in some groups, body-size-linked divergence may be especially important in those clades.

Important caveat:

Model predictions are probabilistic, not absolute. Always report class probabilities, not only top-1 classes.