I'm analyzing Medicare data -- my first real experience with a large dataset, where the number of observations of interest to me is in the millions. We have repeated measures/clusters to worry about, each ranging from 2 to 10 observations, give or take.

I'm struggling with performance issues in pretty much every approach I take to this dataset. One outcome of interest is a proportion. zoib is painfully slow, even when I take a (stratified) random sample of 2% of rows -- in an hour it's only 4% done fitting my null model. Boundary values (0,1) are common in the data, ruling out "transform and just do lmer."

What general tools are available for modeling bigger datasets in R? Because of data privacy agreements I'm required to do all of the computing on-prem, so unfortunately I don't know that I can take advantage of high throughput computing on other servers, if it were even workable in this case.

#rstats #lme4 #zoib

@emjonaitis
a local DBMS instance with appropriate indexes defined and using the relevant R-dbms interface might be worth the hassle in this case? (Or pre-processing in DBMS to small extract files.)

@BRicker @emjonaitis

I have had good experience dealing with datasets in the hundreds of millions, using duckdb as the local database. Also, judicious use of parquet partitions might help.

Large Data in R: Tools and Techniques

HBS RCS Large Data in R workshop. Adapted from the original by Ben Sabath at FAS-RC in 2021

large_data_in_R

@emjonaitis how many/what types of predictors do you have? If you really want to zoib that's just a really intense inference procedure.

Options: do "standard" beta regression which is maybe a little more efficiently implemented with ML (rather than Bayesian) estimation? It commonly shrinks 0 and 1 a little bit towards 0.5 as a preprocessing step so it'll be a bit less exact but it may not matter at all with this much data

Another option:

@emjonaitis procedures like the dbreg package (https://grantmcdermott.com/dbreg/) basically do the following: summarize the rows into buckets of profiles, and then perform weighted regression (weighted by how many people fall in each profile)

This might be possible for you? Even though there are many rows there might be an order of magnitude fewer unique rows. You would need to bin your continuous predictors, so this is also approximate

And not sure if this will even work with the nested structure

dbreg

@erikjan I gravitated toward zoib because of its ability to handle clustered data. I don't believe that betareg has that capability. The fixed portion of the model is simple (the effects of interest are basically a group effect and a group x time interaction) but I don't want to disregard the longitudinal aspect of the data by e.g. selecting one random obs per cluster, because the trajectories are of interest to the investigators.
@emjonaitis oh yeah, you're right, forgot that this is not in betareg
@emjonaitis this is a stretch, but do you think BRMS / Stan with Laplace approximation will be any faster?
@erikjan I haven't tried brms yet - thanks for the reminder.

@emjonaitis

For random sampling, nothing beats a properly configured index. The simplest form I used with csv (I know, not sophisticated) is a format of the data where every record is given the same width in bytes and you can just rand() pick out any row off the disk without loading any of the rest of the file.

Proper databases will have better indexing systems of course.

Always python, not R though.

@emjonaitis
At firsti thought something like folding@hime or bionic, but protines have a lot fewer privacy issues than people.

Try and convince your higher ups to buy you a used bitcoin farm?

@emjonaitis Doubt these package can be used for entire dataset, but here are some beta regression packages that handle clustering and have some optimizations included (e.g. C++, parallelization). Not sure about the 0s and 1s though. Anyway, they might worth checking out.
betaregscale, https://evandeilton.github.io/betaregscale/
cobin, https://github.com/changwoo-lee/cobin
glmmTMB, https://cran.r-project.org/web/packages/glmmTMB/index.html
GLMMadaptive, https://drizopoulos.github.io/GLMMadaptive/
Beta Regression for Interval-Censored Scale-Derived Outcomes

Maximum-likelihood estimation of beta regression models for responses derived from bounded rating scales. Observations are treated as interval-censored on (0, 1) after a scale-to-unit transformation, and the likelihood is built from the difference of the beta CDF at the interval endpoints. The complete likelihood supports mixed censoring types: uncensored, left-censored, right-censored, and interval-censored observations. Both fixed- and variable-dispersion submodels are supported, with flexible link functions for the mean and precision components. A compiled C++ backend (via Rcpp and RcppArmadillo) provides numerically stable, high-performance log-likelihood evaluation. Standard S3 methods (print(), summary(), coef(), fitted(), residuals(), predict(), plot(), confint(), vcov(), logLik(), AIC(), BIC()) are available for fitted objects.

@erc_bk thanks for this! I'm working on glmmTMB now and we'll see. It was tractable with a stratified sample of 25% (about 20 minutes) but 100% is slow going so far.