Mastodawn

Elio Campitelli

I developed an #RStats package to get data from a remote service. Functions allow caching results, so that they will just return the saved files if the request is the same. I tend to be conservative and set the cache option to FALSE by default, but now I'm wondering if it would be more user friendly (and friendlier to the remote API) to set the default to TRUE.

What say the R community?

Use cache by default

75%

Don't use cache by default

25%

Poll ended at Mar 21 at 6:01pm.

Elio Campitelli 2d ago

Based on the results and discussion, I'm turning cache on by default. The deciding feature was that the data itself doesn't really change; save an exceptional update to the data, a single request returns the same data every time, so it's not like there's a big rist of getting stale data.

@eliocamp When I talked the maintainer of {ranger} into setting the default of num.threads to 2 rather than "all available cores", my argument was that the default scenario should prevent users from "causing damage", which in that context meant accidentally overparallelizing and overloading a shared workstation. His argument was that people don't read the docs and complain why ranger is so slow when the default is lower.

I'd still argue in favor of "do less and make people read docs" 😬

Elio Campitelli 2d ago

@jemsu Is "doing less" caching by default or not caching by default in this context? 😅

@eliocamp Cache once and do less API calls afterwards 😅

@eliocamp I'd cache by default but convey that to the user ("Values retrieved from cache") in the hope of saving me some support questions ("Why am I getting the same results?").

Thomas Lumley 2d ago

@koantig @eliocamp I agree

@eliocamp Surface the choice to the user, make them choose, but provide a default.

Elio Campitelli 2d ago

@hye yes, the question is about the default.

Rainer M Krug 2d ago

@eliocamp @hye expiring cache after a certain time? Default cache.

Jan van der Laan 2d ago

@eliocamp It could be possible that the http headers the service returns contain information on when the sources expire (I suspect these will tend to be relatively short). Some services also support HEAD requests. These can be used to check if a resource has changed without transferring all of the data. The server then only sends the headers; the ETag or Last-Modified headers can be used to check if the resource changed.

Elio Campitelli 2d ago

@dodecadron Not in this case. But the typo of data is very stable so it shouldn't really change over time.

Harald Kliems 1d ago

@eliocamp IME with this type of package, the default tends to be `cache=FALSE`. Some packages, where the amount of data retrieved is large or slow, e.g. `tigris`, will produce a message at load to let the user know that they can turn on caching.

@eliocamp I use {httr2}’s caching by default in session. Between, I expect the user to save and reload the data if it’s necessary. https://codeberg.org/ropensci/read.abares/src/branch/main/R/retry_download.R

read.abares/R/retry_download.R at main

read.abares - Harvest data from Australian Bureau of Agricultural and Resource Economics and Sciences (ABARES) part of the Australian Department of Agriculture, Fisheries and Forestry and the Australian Bureau of Statistics (ABS) for your work in R

Codeberg.org

Elio Campitelli 1d ago

@adamhsparks Do you expect users to use the function to download the data to sow folder and then retrieve it via file paths in a different part of the code?

@eliocamp it reformats the data most of the time. I’d expect you’d use {targets} if you really needed some sort of cache beyond what I offer.

@eliocamp this was also part of the discussion during the review and the reviewers and I agreed on dropping it just due to complexity and like I said, {targets} exists.

peaoPerdido 1d ago

@eliocamp I have an internal package that brings meteorological data from a server. But since the spatial resolution is coarse (0.5 degree), and I don't need last minute data, we cache by default. Requests close by (in space) will bring the cache. And we set cache timeout to 5 days. So I guess it depends a lot on your data.

But as others mentioned, alert the user that a cache is being used