Mastodawn

I'm analyzing Medicare data -- my first real experience with a large dataset, where the number of observations of interest to me is in the millions. We have repeated measures/clusters to worry about, each ranging from 2 to 10 observations, give or take.

I'm struggling with performance issues in pretty much every approach I take to this dataset. One outcome of interest is a proportion. zoib is painfully slow, even when I take a (stratified) random sample of 2% of rows -- in an hour it's only 4% done fitting my null model. Boundary values (0,1) are common in the data, ruling out "transform and just do lmer."

What general tools are available for modeling bigger datasets in R? Because of data privacy agreements I'm required to do all of the computing on-prem, so unfortunately I don't know that I can take advantage of high throughput computing on other servers, if it were even workable in this case.

#rstats #lme4 #zoib

Show thread

Erik-Jan 22h ago

@emjonaitis how many/what types of predictors do you have? If you really want to zoib that's just a really intense inference procedure.

Options: do "standard" beta regression which is maybe a little more efficiently implemented with ML (rather than Bayesian) estimation? It commonly shrinks 0 and 1 a little bit towards 0.5 as a preprocessing step so it'll be a bit less exact but it may not matter at all with this much data

Another option:

Show thread

Erik-Jan

@emjonaitis procedures like the dbreg package (https://grantmcdermott.com/dbreg/) basically do the following: summarize the rows into buckets of profiles, and then perform weighted regression (weighted by how many people fall in each profile)

This might be possible for you? Even though there are many rows there might be an order of magnitude fewer unique rows. You would need to bin your continuous predictors, so this is also approximate

And not sure if this will even work with the nested structure