RE: https://fosstodon.org/@R_devs_news/115966889498732323
now that is very sensible
Doing numbers for living since 1998 -- currently those numbers are being applied to survey statistics. In the previous careers, it was solid state physics, process quality control, income inequality, spatial environmental statistics, and latent variable modeling.
The truth is in the code. #rstats #stata #python #datascience #ggplot #surveymethodology #econtwitter
@statstas or @skolenik at some other places.
| Personal website | https://staskolenikov.net/ |
| Google Scholar | https://scholar.google.com/citations?user=TuJeDtcAAAAJ&hl=en |
RE: https://fosstodon.org/@R_devs_news/115966889498732323
now that is very sensible
2. #' @importFrom duckplyr select filter mutate arrange count summarize
to take these functions and expect that duckplyr will figure out how to fall back onto dplyr when needed.
4/4
P.S. @kirill hope you can shed some light
If not... I can think of two relaxations of the package::function() style rule.
1. Ignore it entirely and just write dplyr / duckplyr verbs as is. This is a ticking bomb as filter or select that is not namespaced could just as well go back to stats for a time series function and MASS for God only knows what (I hope Ripley and Venables and Bates forgive me for such a reference... but that function is not really documented in its entry in MASS).
3/4
However my understanding of duckplyr approach to life and universe is that it overwrites the dplyr verbs. So if I explicitly declare dplyr:: namespacing in my package functions, I am denying duckplyr the opportunity to take over and provide that 5-10x speedup I am hoping to see. Should I expect that duckplyr::dplyr_verb() will work properly in this context?
2/maybe 4
Shouting to the void: How to properly namespace #duckdb / #duckplyr in my #rstats packages?
One of @hadleywickham core style recommendations for package development is that every external function needs to be explicitly namespaced:
function_in_my_package <- function(df, x, ...) {
df |> dplyr::mutate(xx = stringr::str_do_something(x))
# implict return
}
1/maybe 4
From https://duckplyr.tidyverse.org/articles/duckdb.html
3. use dd$fun() for functions internal to duckdb and SQL (https://cynkra.github.io/dd/reference/index.html) -- compute string distances on the server with dd$damerau_levenshtein() and dd$jaro_winkler_similarity()
4. Distinguish between "lavish" (materialze right away), "stingy" (never materialize) and "thrifty" (materialize with <1M cells) flavors of duckplyr frames (reset with read_parquet_duckdb(..., prudence = c(cells = 10000, rows = 1000) )
Going through some #duckdb learning through https://duckplyr.tidyverse.org/articles/large.html
1. Get a DuckDB SQL server in memory:
path_duckdb <- tempfile(fileext = ".duckdb")
con <- DBI::dbConnect(duckdb::duckdb(path_duckdb))
DBI::dbWriteTable(con, "mtcars", mtcars)
2. Try explain(), compute(), collect() (the dplyr words dispatched by dbplyr), compute_parquet() without bringing the data to memory (the result may be larger than memory)
Virtual #useR2025 has officially started, but you can still register throughout the day!
If the scheduled times are no good for your time zone, you can always catch up with them when it's a better time of the day for you. The registration confirmation contains info to access the videos. #RStats
The first video is up and the rest are queued up to be released throughout the day!
Going after #arrow #parquet #rstats #DBI beehive:
is there a mostly lazy way to make a local parquet copy of a remote SQL table? I want to avoid collect() but may be it is unavoidable. I could not find a reasonable example in the Cookbook. arrow::write_dataset() does not want to take tbl_Microsoft_SQL_Server as an input.