Using a binomial GLMM for partially paired categorical data
The motivation for this talk is a comparison of two models to predict invasiveness of a horticulturally interesting tree or shrub. The data are a list of trees and shrubs introduced for horticultural purposes to a state and, for each species, whether it is invasive (i.e., has established new populations in non-horticultural sites, and 11 biological characteristics. Random Forest models predicted the probability of invasiveness from the biological characteristics. We developed two models, one for species in the Chicago area and one for species in Iowa. The biologists wanted to know whether one model made more accurate predictions than the other. The complication is that the species in the Chicago data set partially overlap those in the Iowa data set. If there was no overlap, the analysis is a comparison of proportions from independent observations. If all species were on both lists, the analysis is a comparison of proportions from paired observations. Our experience with these models is that some species (e.g., Berberis thunbergii) are simply hard to predict accurately, so the pairing can not be ignored. To use all the data, we need an analysis that allows for partial overlap of lists. Hence, the model needs to account for heterogeneity among species. One option is a Bernoulli GLMM with a random effect for species. I compare four approximately likelihood estimators (SAS/PQL, SAS/PQL with KR adjustments, SAS/Laplace, and R/Laplace) and a Bayesian estimator implemented using Rstanarm.