Stas Kolenikov

482 Followers
297 Following
748 Posts

Doing numbers for living since 1998 -- currently those numbers are being applied to survey statistics. In the previous careers, it was solid state physics, process quality control, income inequality, spatial environmental statistics, and latent variable modeling.

The truth is in the code. #rstats #stata #python #datascience #ggplot #surveymethodology #econtwitter

@statstas or @skolenik at some other places.

Personal websitehttps://staskolenikov.net/
Google Scholarhttps://scholar.google.com/citations?user=TuJeDtcAAAAJ&hl=en

2. #' @importFrom duckplyr select filter mutate arrange count summarize

to take these functions and expect that duckplyr will figure out how to fall back onto dplyr when needed.

4/4

P.S. @kirill hope you can shed some light

If not... I can think of two relaxations of the package::function() style rule.

1. Ignore it entirely and just write dplyr / duckplyr verbs as is. This is a ticking bomb as filter or select that is not namespaced could just as well go back to stats for a time series function and MASS for God only knows what (I hope Ripley and Venables and Bates forgive me for such a reference... but that function is not really documented in its entry in MASS).

3/4

However my understanding of duckplyr approach to life and universe is that it overwrites the dplyr verbs. So if I explicitly declare dplyr:: namespacing in my package functions, I am denying duckplyr the opportunity to take over and provide that 5-10x speedup I am hoping to see. Should I expect that duckplyr::dplyr_verb() will work properly in this context?

2/maybe 4

Shouting to the void: How to properly namespace #duckdb / #duckplyr in my #rstats packages?

One of @hadleywickham core style recommendations for package development is that every external function needs to be explicitly namespaced:

function_in_my_package <- function(df, x, ...) {
df |> dplyr::mutate(xx = stringr::str_do_something(x))
# implict return
}

1/maybe 4

From https://duckplyr.tidyverse.org/articles/duckdb.html

3. use dd$fun() for functions internal to duckdb and SQL (https://cynkra.github.io/dd/reference/index.html) -- compute string distances on the server with dd$damerau_levenshtein() and dd$jaro_winkler_similarity()

4. Distinguish between "lavish" (materialze right away), "stingy" (never materialize) and "thrifty" (materialize with <1M cells) flavors of duckplyr frames (reset with read_parquet_duckdb(..., prudence = c(cells = 10000, rows = 1000) )

Interoperability with DuckDB and dbplyr

Going through some #duckdb learning through https://duckplyr.tidyverse.org/articles/large.html

1. Get a DuckDB SQL server in memory:

path_duckdb <- tempfile(fileext = ".duckdb")
con <- DBI::dbConnect(duckdb::duckdb(path_duckdb))
DBI::dbWriteTable(con, "mtcars", mtcars)

2. Try explain(), compute(), collect() (the dplyr words dispatched by dbplyr), compute_parquet() without bringing the data to memory (the result may be larger than memory)

Large data

Virtual #useR2025 has officially started, but you can still register throughout the day!

If the scheduled times are no good for your time zone, you can always catch up with them when it's a better time of the day for you. The registration confirmation contains info to access the videos. #RStats

The first video is up and the rest are queued up to be released throughout the day!

The idea of resetting the #rstats OS variable named "OS" is slightly horrifying but thanks Bing Chat anyway

Going after #arrow #parquet #rstats #DBI beehive:

is there a mostly lazy way to make a local parquet copy of a remote SQL table? I want to avoid collect() but may be it is unavoidable. I could not find a reasonable example in the Cookbook. arrow::write_dataset() does not want to take tbl_Microsoft_SQL_Server as an input.