RE: https://fosstodon.org/@R_devs_news/115966889498732323
now that is very sensible
Doing numbers for living since 1998 -- currently those numbers are being applied to survey statistics. In the previous careers, it was solid state physics, process quality control, income inequality, spatial environmental statistics, and latent variable modeling.
The truth is in the code. #rstats #stata #python #datascience #ggplot #surveymethodology #econtwitter
@statstas or @skolenik at some other places.
| Personal website | https://staskolenikov.net/ |
| Google Scholar | https://scholar.google.com/citations?user=TuJeDtcAAAAJ&hl=en |
RE: https://fosstodon.org/@R_devs_news/115966889498732323
now that is very sensible
Thanks @kirill
OK more pointedly -- I see that duckplyr code boldly uses methods without namespace prefixing -- `count()` here (https://github.com/tidyverse/duckplyr/blob/fa9b12e72f234524042542039499d361c6a32b14/R/count.R#L92) and `select()` there (https://github.com/tidyverse/duckplyr/blob/fa9b12e72f234524042542039499d361c6a32b14/R/select.R#L39)... so it falls onto the S3 system to figure it out. The vignette (https://duckplyr.tidyverse.org/articles/duckdb.html) however shoves it down the throat with conflicted::conflict_prefer("filter", "dplyr") and I don't think conflicted should be used in the context of (any) package code and only be used in analytical code.
2. #' @importFrom duckplyr select filter mutate arrange count summarize
to take these functions and expect that duckplyr will figure out how to fall back onto dplyr when needed.
4/4
P.S. @kirill hope you can shed some light
If not... I can think of two relaxations of the package::function() style rule.
1. Ignore it entirely and just write dplyr / duckplyr verbs as is. This is a ticking bomb as filter or select that is not namespaced could just as well go back to stats for a time series function and MASS for God only knows what (I hope Ripley and Venables and Bates forgive me for such a reference... but that function is not really documented in its entry in MASS).
3/4
However my understanding of duckplyr approach to life and universe is that it overwrites the dplyr verbs. So if I explicitly declare dplyr:: namespacing in my package functions, I am denying duckplyr the opportunity to take over and provide that 5-10x speedup I am hoping to see. Should I expect that duckplyr::dplyr_verb() will work properly in this context?
2/maybe 4
Shouting to the void: How to properly namespace #duckdb / #duckplyr in my #rstats packages?
One of @hadleywickham core style recommendations for package development is that every external function needs to be explicitly namespaced:
function_in_my_package <- function(df, x, ...) {
df |> dplyr::mutate(xx = stringr::str_do_something(x))
# implict return
}
1/maybe 4
From https://duckplyr.tidyverse.org/articles/duckdb.html
3. use dd$fun() for functions internal to duckdb and SQL (https://cynkra.github.io/dd/reference/index.html) -- compute string distances on the server with dd$damerau_levenshtein() and dd$jaro_winkler_similarity()
4. Distinguish between "lavish" (materialze right away), "stingy" (never materialize) and "thrifty" (materialize with <1M cells) flavors of duckplyr frames (reset with read_parquet_duckdb(..., prudence = c(cells = 10000, rows = 1000) )