Mastodawn

I'm analyzing Medicare data -- my first real experience with a large dataset, where the number of observations of interest to me is in the millions. We have repeated measures/clusters to worry about, each ranging from 2 to 10 observations, give or take.

I'm struggling with performance issues in pretty much every approach I take to this dataset. One outcome of interest is a proportion. zoib is painfully slow, even when I take a (stratified) random sample of 2% of rows -- in an hour it's only 4% done fitting my null model. Boundary values (0,1) are common in the data, ruling out "transform and just do lmer."

What general tools are available for modeling bigger datasets in R? Because of data privacy agreements I'm required to do all of the computing on-prem, so unfortunately I don't know that I can take advantage of high throughput computing on other servers, if it were even workable in this case.

#rstats #lme4 #zoib

Show thread

Erik-Jan 19h ago

@emjonaitis how many/what types of predictors do you have? If you really want to zoib that's just a really intense inference procedure.

Options: do "standard" beta regression which is maybe a little more efficiently implemented with ML (rather than Bayesian) estimation? It commonly shrinks 0 and 1 a little bit towards 0.5 as a preprocessing step so it'll be a bit less exact but it may not matter at all with this much data

Another option:

Show thread

Erin Jonaitis 18h ago

@erikjan I gravitated toward zoib because of its ability to handle clustered data. I don't believe that betareg has that capability. The fixed portion of the model is simple (the effects of interest are basically a group effect and a group x time interaction) but I don't want to disregard the longitudinal aspect of the data by e.g. selecting one random obs per cluster, because the trajectories are of interest to the investigators.

Show thread

Erik-Jan

@emjonaitis oh yeah, you're right, forgot that this is not in betareg