I'm analyzing Medicare data -- my first real experience with a large dataset, where the number of observations of interest to me is in the millions. We have repeated measures/clusters to worry about, each ranging from 2 to 10 observations, give or take.

I'm struggling with performance issues in pretty much every approach I take to this dataset. One outcome of interest is a proportion. zoib is painfully slow, even when I take a (stratified) random sample of 2% of rows -- in an hour it's only 4% done fitting my null model. Boundary values (0,1) are common in the data, ruling out "transform and just do lmer."

What general tools are available for modeling bigger datasets in R? Because of data privacy agreements I'm required to do all of the computing on-prem, so unfortunately I don't know that I can take advantage of high throughput computing on other servers, if it were even workable in this case.

#rstats #lme4 #zoib

@emjonaitis
a local DBMS instance with appropriate indexes defined and using the relevant R-dbms interface might be worth the hassle in this case? (Or pre-processing in DBMS to small extract files.)

@BRicker @emjonaitis

I have had good experience dealing with datasets in the hundreds of millions, using duckdb as the local database. Also, judicious use of parquet partitions might help.

Large Data in R: Tools and Techniques

HBS RCS Large Data in R workshop. Adapted from the original by Ben Sabath at FAS-RC in 2021

large_data_in_R

@emjonaitis how many/what types of predictors do you have? If you really want to zoib that's just a really intense inference procedure.

Options: do "standard" beta regression which is maybe a little more efficiently implemented with ML (rather than Bayesian) estimation? It commonly shrinks 0 and 1 a little bit towards 0.5 as a preprocessing step so it'll be a bit less exact but it may not matter at all with this much data

Another option:

@emjonaitis procedures like the dbreg package (https://grantmcdermott.com/dbreg/) basically do the following: summarize the rows into buckets of profiles, and then perform weighted regression (weighted by how many people fall in each profile)

This might be possible for you? Even though there are many rows there might be an order of magnitude fewer unique rows. You would need to bin your continuous predictors, so this is also approximate

And not sure if this will even work with the nested structure

dbreg

@erikjan I gravitated toward zoib because of its ability to handle clustered data. I don't believe that betareg has that capability. The fixed portion of the model is simple (the effects of interest are basically a group effect and a group x time interaction) but I don't want to disregard the longitudinal aspect of the data by e.g. selecting one random obs per cluster, because the trajectories are of interest to the investigators.
@emjonaitis oh yeah, you're right, forgot that this is not in betareg
@emjonaitis this is a stretch, but do you think BRMS / Stan with Laplace approximation will be any faster?
@erikjan I haven't tried brms yet - thanks for the reminder.

@emjonaitis

For random sampling, nothing beats a properly configured index. The simplest form I used with csv (I know, not sophisticated) is a format of the data where every record is given the same width in bytes and you can just rand() pick out any row off the disk without loading any of the rest of the file.

Proper databases will have better indexing systems of course.

Always python, not R though.