I'm analyzing Medicare data -- my first real experience with a large dataset, where the number of observations of interest to me is in the millions. We have repeated measures/clusters to worry about, each ranging from 2 to 10 observations, give or take.

I'm struggling with performance issues in pretty much every approach I take to this dataset. One outcome of interest is a proportion. zoib is painfully slow, even when I take a (stratified) random sample of 2% of rows -- in an hour it's only 4% done fitting my null model. Boundary values (0,1) are common in the data, ruling out "transform and just do lmer."

What general tools are available for modeling bigger datasets in R? Because of data privacy agreements I'm required to do all of the computing on-prem, so unfortunately I don't know that I can take advantage of high throughput computing on other servers, if it were even workable in this case.

#rstats #lme4 #zoib

@emjonaitis Doubt these package can be used for entire dataset, but here are some beta regression packages that handle clustering and have some optimizations included (e.g. C++, parallelization). Not sure about the 0s and 1s though. Anyway, they might worth checking out.
betaregscale, https://evandeilton.github.io/betaregscale/
cobin, https://github.com/changwoo-lee/cobin
glmmTMB, https://cran.r-project.org/web/packages/glmmTMB/index.html
GLMMadaptive, https://drizopoulos.github.io/GLMMadaptive/
Beta Regression for Interval-Censored Scale-Derived Outcomes

Maximum-likelihood estimation of beta regression models for responses derived from bounded rating scales. Observations are treated as interval-censored on (0, 1) after a scale-to-unit transformation, and the likelihood is built from the difference of the beta CDF at the interval endpoints. The complete likelihood supports mixed censoring types: uncensored, left-censored, right-censored, and interval-censored observations. Both fixed- and variable-dispersion submodels are supported, with flexible link functions for the mean and precision components. A compiled C++ backend (via Rcpp and RcppArmadillo) provides numerically stable, high-performance log-likelihood evaluation. Standard S3 methods (print(), summary(), coef(), fitted(), residuals(), predict(), plot(), confint(), vcov(), logLik(), AIC(), BIC()) are available for fitted objects.

@erc_bk thanks for this! I'm working on glmmTMB now and we'll see. It was tractable with a stratified sample of 25% (about 20 minutes) but 100% is slow going so far.