Conner Smith: Binomial Logistic Regression with Palmetto Species

show

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

library(tidyverse)
library(here)
library(ggbeeswarm)
library(broom)
library(caret)
library(AICcmodavg)
library(cowplot)
library(kableExtra)
library(janitor)

Overview

This analysis looks at the survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017. It uses binary logistic regression to predict whether a palmetto is species Serenoa repens or Sabal etonia. Data was collected by the Environmental Data Initiative.

Data Visualization

show

# Read in the data 

palmetto <- read_csv(here("_projects", "binomial_regression", "data", "palmetto.csv"))

This analysis uses binary logistic regression to test feasibility of using variables plant height (height), canopy length (length), canopy width (width), and number of green leaves (green_lvs) to classify whether a palmetto is species Serenoa repens or Sabal etonia.

Figure 1: Palmetto Height and Leaf Count by Species

show

# Create 2 finalized data visualizations exploring differences in height, canopy length, canopy width, and green leaves for the two species. Note: species code 1 is Serenoa repens and species code 2 is Sabal etonia. 

palmetto_clean <- palmetto %>% 
  select(year, species, height:green_lvs) %>% 
  drop_na() %>% 
  mutate(species = case_when(species == "1" ~ "Serenoa repens",
                   species == "2" ~ "Sabal etonia"))

#Creating summary of height to pull means for in-line code 

palmetto_summary <- palmetto_clean %>%
  group_by(species) %>% 
  summarize(count = n(),
            mean_height = mean(height, na.rm = TRUE),
            mean_width = mean(width, na.rm = TRUE),
            mean_length = mean(length, na.rm = TRUE))

# Plot 1: Height beeswarm 

height <- ggplot(data = palmetto_clean, aes(x = species, y = height)) +
  geom_beeswarm(aes(color = species), 
                dodge.width = 0, cex = 0.4, size = 0.1,
                show.legend = FALSE) +
  geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
               outlier.color = NA) +
  stat_summary(fun.y = "mean",
               col = "darkslategray",
               shape = 18) +
  scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
  theme_bw() +
  labs(x = "Species", y = "Height (cm)")

# Plot 2: Leaves histogram

leaves <- ggplot(data = palmetto_clean, aes(x = green_lvs)) +
  geom_histogram(aes(fill = species), color = "darkslategray",
                 binwidth = 1, show.legend = FALSE) +
  scale_fill_manual(values = c("darkolivegreen3", "seagreen")) +
  facet_wrap(~ species) +
  theme_bw() +
  labs(x = "Green Leaves (number)", y = "Count")

# combine plots

plots <- plot_grid(height, leaves,
                   ncol = 2,
                   rel_widths = c(1, 1))

plot_grid(plots, ncol = 1,
          rel_heights = c(0.1, 1, 0.1))

Figure 1: The left graph plots palmetto height by species with the box showing the median values as well as the the 75th and 25th percentile ranges. The diamond points represent the mean values. The right plot shows the distribution of leaf number by species.

These graphs show that both species have a similar height, indicating this variable may not be significant in the model. The difference in mean height between the two species is only 2.1centimeters. The taller of the two species, Serenoa repens tend to have more green leaves and the range of leaf number is wider. Sabal etonia were more closely clustered around 3-4 leaves. This indicted that leaf number could be one of the strongest predictive variables in this model.

Figure 2: Palmetto Canopy Width and Length by Species

show

# Plot 1: width

width <- ggplot(data = palmetto_clean, aes(x = species, y = width)) +
  geom_jitter(aes(color = species), width = 0.3, size = 0.5,
             show.legend = FALSE) +
  geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
               outlier.color = NA) +
  stat_summary(fun.y = "mean",
               col = "darkslategray",
               shape = 18) +
  scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
  theme_bw() +
  theme(axis.title.x=element_blank(), 
        axis.title.y = element_text(size = 10)) +
  labs(y = "Canopy Width (cm)")

# Plot 2: Length

length <- ggplot(data = palmetto_clean, aes(x = species, y = length)) +
  geom_jitter(aes(color = species), width = 0.3, size = 0.5,
             show.legend = FALSE) +
  geom_boxplot(fill = NA, width = 0.2, color = 'darkslategray',
               outlier.color = NA) +
  stat_summary(fun.y = "mean",
               col = "darkslategray",
               shape = 18) +
  scale_color_manual(values = c("darkolivegreen3", "seagreen")) +
  theme_bw() +
  theme(axis.title.x=element_blank(), 
        axis.title.y = element_text(size = 10)) +
  labs(y = "Canopy Length (cm)")

# combine plots

plots <- plot_grid(width, length,
                   ncol = 2,
                   rel_widths = c(1, 1))

plot_grid(plots, ncol = 1,
          rel_heights = c(0.1, 1, 0.1))

Figure 2: The graph shows the height and length of palmetto canopy across the two species.

Both species show similar average canopy width and length with the Sabal etonia showing higher mean values for both. For the width, the difference in means between the two species is 7.3 centimeters. The difference in average length is 23.28 centimeters. This indicates that canopy length could be a stronger determinant of species in the model compared to canopy width.

Model Analysis

Now knowing a few general characteristics about the two species, this analysis will build a binomial logistic regression to try and predict the species from the data.

show

# Perform the analysis twice, using cross validation to compare two models.

palmetto_sub <- palmetto %>% 
  select(year, species, height:green_lvs) %>% 
  mutate(species = as.factor(species)) %>% 
  mutate(species = fct_drop(species)) %>% 
  drop_na()

# Note: Sabal etonia is the '0' level

# Model 1: Log odds of plant type using plant height, canopy length, canopy width and green leaves as predictor variable.

f1 <- species ~ height + width + length + green_lvs

palmetto_blm1 <- glm(formula = f1, 
                    data = palmetto_sub, 
                    family = 'binomial')


# Log odds to probability

blm1_fitted <- palmetto_blm1 %>% 
  augment(type.predict = 'response')

# Model 2: Log odds of plant type using plant height, canopy width and green leaves (i.e., drop canopy length for this model)

f2 <- species ~ height + width + green_lvs

palmetto_blm2 <- glm(formula = f2, 
                    data = palmetto_sub, 
                    family = 'binomial')

# Log odds to probability

blm2_fitted <- palmetto_blm2 %>% 
  augment(type.predict = 'response')

# Assess the models with AICc first 

aic <- aictab(list(palmetto_blm1, palmetto_blm2))

# Model 1 has a lower corrected AIC value, indicating better fit.

An initial look at the AIC values indicate Model 1 may be a better fit. The corrected AIC value for Model 1 (5195) is lower compared to model 2 (5987). The difference in these values is large, indicating that including canopy length could be important for this model.

show

# Assess the models with AICc first 

aic <- aictab(list(palmetto_blm1, palmetto_blm2))

# Model 1 has a lower corrected AIC value, indicating better fit. 

# Perform cross-validation with caret

set.seed(123)

tr_ctrl <- trainControl(method = "repeatedcv", 
                        number = 10, repeats = 10)

model1 <- train(f1, data = palmetto_sub, 
                method = "glm", family = 'binomial',
                trControl = tr_ctrl)


model2 <- train(f2, data = palmetto_sub, 
                method = "glm", family = 'binomial',
                trControl = tr_ctrl)

Model 1 predicts the species of Palmetto with 91.7% percent accuracy. This is slightly higher than the 89.9% predictive accuracy of Model 2. Given the more favorable AIC value and higher predictive accuracy, this analysis will use Model 1.

Table 1: This table provides a summary of the coefficients associated with the different variables used in the model. Each coefficient had a p value less than 0.001. Serenoa repens is the reference species.

show

# Train the final model (model 3) on the entire data set (not the folds). 

final_model <- glm(formula = f1, 
                    data = palmetto_sub, 
                    family = 'binomial')

# Use tidy to clean this up

tidy_model <- tidy(final_model)

tidy_model$p.value <- ifelse(tidy_model$p.value < .001, paste("< .001"))

# Use kable to create a table of these stats

final_table <- tidy_model %>% 
  select(-statistic) %>% 
  mutate(term = case_when(term == '(Intercept)' ~ 'Intercept',
                          term == 'height' ~ 'Height',
                          term == 'width' ~ 'Width',
                          term == 'length' ~ 'Length',
                          term == 'green_lvs' ~ 'Green Leaves')) %>%
  kable(col.names = c("Variable", "Coefficients", 
                      "Standard Error", "P Value"),
        digits = 3) %>% 
  kable_styling(bootstrap_options = "striped", 
                position = "left", full_width = FALSE)
final_table

Variable	Coefficients	Standard Error	P Value
Intercept	3.227	0.142	< .001
Height	-0.029	0.002	< .001
Width	0.039	0.002	< .001
Length	0.046	0.002	< .001
Green Leaves	-1.908	0.039	< .001

This table shows the coefficients associated with each variable from Model 1. These reflect differences in the characteristics visualized above. In particular, the large negative coefficient for the number of green leaves means it is less likely the species is Sabal etonia (the reference species) as the number of green leaves increases. This is confirmed in Figure 1.

Table 2: This table provides a summary of the percentage of accurate predictions from Model 1 for each species. The average accuracy of the model was 92%

show

# Use the fitted data set created above and establish a threshold at 50% where the model will predict species 2 above the threshold, and species 1 below the threshold. 

blm1_thresh <- blm1_fitted %>% 
  mutate(predicted = case_when(.fitted >= .5 ~ "2", 
    .fitted < .5 ~ "1")) %>% # Now mutate a new column to show correct/incorrect
  mutate(outcome = case_when(
    species == predicted ~ "Correct",
    species != predicted ~ "Incorrect"))
  
# Create a new df to show the counts correct/incorrect

thresh_summary <- blm1_thresh %>% 
  group_by(species, outcome) %>% 
  count(outcome) %>% 
  pivot_wider(names_from = outcome,
              values_from = n) %>% 
  mutate(species = case_when(
    species == 1 ~ "Serenoa repens",
    species == 2 ~ "Sabal etonia")) %>% 
  column_to_rownames(var = "species") %>% 
  mutate(pct_correct = round(Correct/(Correct + Incorrect)*100, 0)) %>% 
  mutate(pct_correct = paste(pct_correct,"%", sep = ''))

# Pass this new data frame into a finalized table. 

thresh_summary %>% 
  kable(col.names = c("Correct", "Incorrect", 
                      "Percent Correct")) %>% 
  kable_styling(bootstrap_options = "striped", 
                position = "left", full_width = FALSE)

	Correct	Incorrect	Percent Correct
Serenoa repens	5548	564	91%
Sabal etonia	5701	454	93%

Model 1 was more accurate in predicting Sabal etonia comapred to Serenoa repens. The model correctly predicted that a palmetto was Serenoa repens 91% of the time. The accuracy of predicting Sabal etonia was 93%. This gives the average accuracy of 92% referenced above.

Data Citation

Abrahamson, W.G. 2019. Survival, growth and biomass estimates of two dominant palmetto species of south-central Florida from 1981 - 2017, ongoing at 5-year intervals ver 1. Environmental Data Initiative. https://doi.org/10.6073/pasta/f2f96ec76fbbd4b9db431c79a770c4d5