A binomial regression using mdoels to predict palmetto species from a lsit of observed variables.
This analysis looks at the survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017. It uses binary logistic regression to predict whether a palmetto is species Serenoa repens or Sabal etonia. Data was collected by the Environmental Data Initiative.
This analysis uses binary logistic regression to test feasibility of using variables plant height (height), canopy length (length), canopy width (width), and number of green leaves (green_lvs) to classify whether a palmetto is species Serenoa repens or Sabal etonia.
# Create 2 finalized data visualizations exploring differences in height, canopy length, canopy width, and green leaves for the two species. Note: species code 1 is Serenoa repens and species code 2 is Sabal etonia.
palmetto_clean <- palmetto %>%
select(year, species, height:green_lvs) %>%
drop_na() %>%
mutate(species = case_when(species == "1" ~ "Serenoa repens",
species == "2" ~ "Sabal etonia"))
#Creating summary of height to pull means for in-line code
palmetto_summary <- palmetto_clean %>%
group_by(species) %>%
summarize(count = n(),
mean_height = mean(height, na.rm = TRUE),
mean_width = mean(width, na.rm = TRUE),
mean_length = mean(length, na.rm = TRUE))
# Plot 1: Height beeswarm
height <- ggplot(data = palmetto_clean, aes(x = species, y = height)) +
geom_beeswarm(aes(color = species),
dodge.width = 0, cex = 0.4, size = 0.1,
show.legend = FALSE) +
geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
outlier.color = NA) +
stat_summary(fun.y = "mean",
col = "darkslategray",
shape = 18) +
scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
theme_bw() +
labs(x = "Species", y = "Height (cm)")
# Plot 2: Leaves histogram
leaves <- ggplot(data = palmetto_clean, aes(x = green_lvs)) +
geom_histogram(aes(fill = species), color = "darkslategray",
binwidth = 1, show.legend = FALSE) +
scale_fill_manual(values = c("darkolivegreen3", "seagreen")) +
facet_wrap(~ species) +
theme_bw() +
labs(x = "Green Leaves (number)", y = "Count")
# combine plots
plots <- plot_grid(height, leaves,
ncol = 2,
rel_widths = c(1, 1))
plot_grid(plots, ncol = 1,
rel_heights = c(0.1, 1, 0.1))
These graphs show that both species have a similar height, indicating this variable may not be significant in the model. The difference in mean height between the two species is only 2.1centimeters. The taller of the two species, Serenoa repens tend to have more green leaves and the range of leaf number is wider. Sabal etonia were more closely clustered around 3-4 leaves. This indicted that leaf number could be one of the strongest predictive variables in this model.
# Plot 1: width
width <- ggplot(data = palmetto_clean, aes(x = species, y = width)) +
geom_jitter(aes(color = species), width = 0.3, size = 0.5,
show.legend = FALSE) +
geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
outlier.color = NA) +
stat_summary(fun.y = "mean",
col = "darkslategray",
shape = 18) +
scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.title.y = element_text(size = 10)) +
labs(y = "Canopy Width (cm)")
# Plot 2: Length
length <- ggplot(data = palmetto_clean, aes(x = species, y = length)) +
geom_jitter(aes(color = species), width = 0.3, size = 0.5,
show.legend = FALSE) +
geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
outlier.color = NA) +
stat_summary(fun.y = "mean",
col = "darkslategray",
shape = 18) +
scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
theme_bw() +
theme(axis.title.x=element_blank(),
axis.title.y = element_text(size = 10)) +
labs(y = "Canopy Length (cm)")
# combine plots
plots <- plot_grid(width, length,
ncol = 2,
rel_widths = c(1, 1))
plot_grid(plots, ncol = 1,
rel_heights = c(0.1, 1, 0.1))
Both species show similar average canopy width and length with the Sabal etonia showing higher mean values for both. For the width, the difference in means between the two species is 7.3 centimeters. The difference in average length is 23.28 centimeters. This indicates that canopy length could be a stronger determinant of species in the model compared to canopy width.
Now knowing a few general characteristics about the two species, this analysis will build a binomial logistic regression to try and predict the species from the data.
# Perform the analysis twice, using cross validation to compare two models.
palmetto_sub <- palmetto %>%
select(year, species, height:green_lvs) %>%
mutate(species = as.factor(species)) %>%
mutate(species = fct_drop(species)) %>%
drop_na()
# Note: Sabal etonia is the '0' level
# Model 1: Log odds of plant type using plant height, canopy length, canopy width and green leaves as predictor variable.
f1 <- species ~ height + width + length + green_lvs
palmetto_blm1 <- glm(formula = f1,
data = palmetto_sub,
family = 'binomial')
# Log odds to probability
blm1_fitted <- palmetto_blm1 %>%
augment(type.predict = 'response')
# Model 2: Log odds of plant type using plant height, canopy width and green leaves (i.e., drop canopy length for this model)
f2 <- species ~ height + width + green_lvs
palmetto_blm2 <- glm(formula = f2,
data = palmetto_sub,
family = 'binomial')
# Log odds to probability
blm2_fitted <- palmetto_blm2 %>%
augment(type.predict = 'response')
# Assess the models with AICc first
aic <- aictab(list(palmetto_blm1, palmetto_blm2))
# Model 1 has a lower corrected AIC value, indicating better fit.
An initial look at the AIC values indicate Model 1 may be a better fit. The corrected AIC value for Model 1 (5195) is lower compared to model 2 (5987). The difference in these values is large, indicating that including canopy length could be important for this model.
# Assess the models with AICc first
aic <- aictab(list(palmetto_blm1, palmetto_blm2))
# Model 1 has a lower corrected AIC value, indicating better fit.
# Perform cross-validation with caret
set.seed(123)
tr_ctrl <- trainControl(method = "repeatedcv",
number = 10, repeats = 10)
model1 <- train(f1, data = palmetto_sub,
method = "glm", family = 'binomial',
trControl = tr_ctrl)
model2 <- train(f2, data = palmetto_sub,
method = "glm", family = 'binomial',
trControl = tr_ctrl)
Model 1 predicts the species of Palmetto with 91.7% percent accuracy. This is slightly higher than the 89.9% predictive accuracy of Model 2. Given the more favorable AIC value and higher predictive accuracy, this analysis will use Model 1.
# Train the final model (model 3) on the entire data set (not the folds).
final_model <- glm(formula = f1,
data = palmetto_sub,
family = 'binomial')
# Use tidy to clean this up
tidy_model <- tidy(final_model)
tidy_model$p.value <- ifelse(tidy_model$p.value < .001, paste("< .001"))
# Use kable to create a table of these stats
final_table <- tidy_model %>%
select(-statistic) %>%
mutate(term = case_when(term == '(Intercept)' ~ 'Intercept',
term == 'height' ~ 'Height',
term == 'width' ~ 'Width',
term == 'length' ~ 'Length',
term == 'green_lvs' ~ 'Green Leaves')) %>%
kable(col.names = c("Variable", "Coefficients",
"Standard Error", "P Value"),
digits = 3) %>%
kable_styling(bootstrap_options = "striped",
position = "left", full_width = FALSE)
final_table
Variable | Coefficients | Standard Error | P Value |
---|---|---|---|
Intercept | 3.227 | 0.142 | < .001 |
Height | -0.029 | 0.002 | < .001 |
Width | 0.039 | 0.002 | < .001 |
Length | 0.046 | 0.002 | < .001 |
Green Leaves | -1.908 | 0.039 | < .001 |
This table shows the coefficients associated with each variable from Model 1. These reflect differences in the characteristics visualized above. In particular, the large negative coefficient for the number of green leaves means it is less likely the species is Sabal etonia (the reference species) as the number of green leaves increases. This is confirmed in Figure 1.
# Use the fitted data set created above and establish a threshold at 50% where the model will predict species 2 above the threshold, and species 1 below the threshold.
blm1_thresh <- blm1_fitted %>%
mutate(predicted = case_when(.fitted >= .5 ~ "2",
.fitted < .5 ~ "1")) %>% # Now mutate a new column to show correct/incorrect
mutate(outcome = case_when(
species == predicted ~ "Correct",
species != predicted ~ "Incorrect"))
# Create a new df to show the counts correct/incorrect
thresh_summary <- blm1_thresh %>%
group_by(species, outcome) %>%
count(outcome) %>%
pivot_wider(names_from = outcome,
values_from = n) %>%
mutate(species = case_when(
species == 1 ~ "Serenoa repens",
species == 2 ~ "Sabal etonia")) %>%
column_to_rownames(var = "species") %>%
mutate(pct_correct = round(Correct/(Correct + Incorrect)*100, 0)) %>%
mutate(pct_correct = paste(pct_correct,"%", sep = ''))
# Pass this new data frame into a finalized table.
thresh_summary %>%
kable(col.names = c("Correct", "Incorrect",
"Percent Correct")) %>%
kable_styling(bootstrap_options = "striped",
position = "left", full_width = FALSE)
Correct | Incorrect | Percent Correct | |
---|---|---|---|
Serenoa repens | 5548 | 564 | 91% |
Sabal etonia | 5701 | 454 | 93% |
Model 1 was more accurate in predicting Sabal etonia comapred to Serenoa repens. The model correctly predicted that a palmetto was Serenoa repens 91% of the time. The accuracy of predicting Sabal etonia was 93%. This gives the average accuracy of 92% referenced above.
Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5