The most annoying things in data analysis:

- R package/lib hell
- basic Python API instability
- conda solving envos for hours
- lustre (all about it)
- each R package having it's own class for genetic/trees/geospatial data storage.

I have freaking bingo of hitting all of them today at once.

Are there any I missed?

#rstats #conda #lustre #dllHell #PackageHell #DependencyHell
#geospatial #bioinformatics #hpc

@tyx What exactly is R package/lib hell? Do you have some examples?
@gaborcsardi
The only thing that comes to my mind are dependencies. Unmaintained packages are a high risk for users and maintainers of dependent packages. That's why I promote #rdatatable .
@tyx
@devSJR @gaborcsardi
Unless reviewer asks you to do analysis, package for which was published in 2016 and not maintained since 2019. So you either need to make it work (i.e. strip all dependencies on dead packages, and move to the new data formats see the OP list p.5 ) or, preferably, set a y2019 envo in a docker and make data go in and out of it by some stupid file-poll or netcat goldberg's.
@tyx
It also happens with current packages. For example, it happens sometimes that changes on CRAN lead to package removals which pull others with them. I was affected by this. The maintainer just did not react fact enough (bunch of weeks).

@gaborcsardi
Yeah, sure: package temporally removed from CRAN (as it was today, or just newer has been there as it was a week ago), depending on a C lib, which recently has had API changes (that's why it was removed, yeah). Requires a dozen of other packages:
- some of them need to be built from source (go get some libs too!)
- some are binaries, but require more recent version of said lib.
- some are of a wrong version (API changes everyone!).

Also some of them are needed for other packages in the same script. No, different version from what that first one is compatible to.
No problem just some gcc and makefile magic or fixing R package code, all together takes hours to figure out.

On a HPC the PItA is just squared by module system, which has some packages and libs (no, you can not uninstall them) and sluggish lustre (in case you decide to set a separate libpath and download all the packages anew).

All of this is manageable, sure, but eats like half of my day on stupid technical crap each time I try to reproduce someone's pipeline.

I need once to get a duct tape and baseball bat and convince our admins to set up docker.

@tyx Yeah, this is all very painful, but it is also not specific to R, or is it? I imagine most software have the same issue if you use external packages.

Btw. we do have some tools that let you "time-travel", e.g. https://packagemanager.posit.co/client/#/ gives you Linux binaries since 2017, or you can use Windows binaries even from CRAN.
Another one is https://github.com/r-hub/evercran which comes with daily CRAN snapshots, and works on older Debian containers, so you also get older versions of the system deps.

Posit Package Manager

@gaborcsardi
>it is also not specific to R, or is it?
Well, if I were doing sysadmin stuff, I'd be ranting about blob vendor drivers and glibc versions. But I'm doing bioinformatics and data science. So I'm grumbling over relevant tools. And yes, having apt, conda and install.packages with its own versions of packages is a mess with python as well, will add to the list next time.

Thanks for the tools! I use docker and old distr repos, but evercran looks nice.

@tyx unstructured Excel files with multiple headers and empty rows in the middle.

And funky characters.

@tyx this should take care of the conda solver: https://github.com/conda/conda-libmamba-solver
GitHub - conda/conda-libmamba-solver: The libmamba based solver for conda.

The libmamba based solver for conda. Contribute to conda/conda-libmamba-solver development by creating an account on GitHub.

GitHub
@paulouro
Thanks! I use Mamba where possible already. Saves tons of time.
@tyx you missed the data itself…. Life would be simpler with quality control