Mastodawn

The most annoying things in data analysis:

- R package/lib hell
- basic Python API instability
- conda solving envos for hours
- lustre (all about it)
- each R package having it's own class for genetic/trees/geospatial data storage.

I have freaking bingo of hitting all of them today at once.

Are there any I missed?

#rstats #conda #lustre #dllHell #PackageHell #DependencyHell
#geospatial #bioinformatics #hpc

Show thread

Gabor Csardi Feb 20, 2024

@tyx What exactly is R package/lib hell? Do you have some examples?

Show thread

devSJR

Feb 20, 2024

@gaborcsardi
The only thing that comes to my mind are dependencies. Unmaintained packages are a high risk for users and maintainers of dependent packages. That's why I promote #rdatatable .
@tyx

Show thread

Gabor Csardi Feb 20, 2024

@devSJR @tyx What about them?

Show thread

devSJR

Feb 20, 2024

@gaborcsardi
For example this
https://support.posit.co/hc/en-us/articles/219949047-Installing-older-versions-of-packages
@tyx

Show thread

tyx Feb 20, 2024

@devSJR @gaborcsardi Exactly!

Show thread

tyx Feb 20, 2024

@devSJR @gaborcsardi
Unless reviewer asks you to do analysis, package for which was published in 2016 and not maintained since 2019. So you either need to make it work (i.e. strip all dependencies on dead packages, and move to the new data formats see the OP list p.5 ) or, preferably, set a y2019 envo in a docker and make data go in and out of it by some stupid file-poll or netcat goldberg's.

Show thread

devSJR

Feb 20, 2024

@tyx
It also happens with current packages. For example, it happens sometimes that changes on CRAN lead to package removals which pull others with them. I was affected by this. The maintainer just did not react fact enough (bunch of weeks).

Show thread

tyx Feb 20, 2024

@gaborcsardi
Yeah, sure: package temporally removed from CRAN (as it was today, or just newer has been there as it was a week ago), depending on a C lib, which recently has had API changes (that's why it was removed, yeah). Requires a dozen of other packages:
- some of them need to be built from source (go get some libs too!)
- some are binaries, but require more recent version of said lib.
- some are of a wrong version (API changes everyone!).

Also some of them are needed for other packages in the same script. No, different version from what that first one is compatible to.
No problem just some gcc and makefile magic or fixing R package code, all together takes hours to figure out.

On a HPC the PItA is just squared by module system, which has some packages and libs (no, you can not uninstall them) and sluggish lustre (in case you decide to set a separate libpath and download all the packages anew).

All of this is manageable, sure, but eats like half of my day on stupid technical crap each time I try to reproduce someone's pipeline.

I need once to get a duct tape and baseball bat and convince our admins to set up docker.

Show thread

Gabor Csardi Feb 20, 2024

@tyx Yeah, this is all very painful, but it is also not specific to R, or is it? I imagine most software have the same issue if you use external packages.

Btw. we do have some tools that let you "time-travel", e.g. https://packagemanager.posit.co/client/#/ gives you Linux binaries since 2017, or you can use Windows binaries even from CRAN.
Another one is https://github.com/r-hub/evercran which comes with daily CRAN snapshots, and works on older Debian containers, so you also get older versions of the system deps.

Posit Package Manager

Show thread

tyx Feb 20, 2024

@gaborcsardi
>it is also not specific to R, or is it?
Well, if I were doing sysadmin stuff, I'd be ranting about blob vendor drivers and glibc versions. But I'm doing bioinformatics and data science. So I'm grumbling over relevant tools. And yes, having apt, conda and install.packages with its own versions of packages is a mess with python as well, will add to the list next time.

Thanks for the tools! I use docker and old distr repos, but evercran looks nice.

Show thread