Mastodawn

Stas Kolenikov Jan 28

RE: https://fosstodon.org/@R_devs_news/115966889498732323

now that is very sensible

Stas Kolenikov Jan 27

2. #' @importFrom duckplyr select filter mutate arrange count summarize

to take these functions and expect that duckplyr will figure out how to fall back onto dplyr when needed.

4/4

P.S. @kirill hope you can shed some light

Show thread

Stas Kolenikov Jan 27

If not... I can think of two relaxations of the package::function() style rule.

1. Ignore it entirely and just write dplyr / duckplyr verbs as is. This is a ticking bomb as filter or select that is not namespaced could just as well go back to stats for a time series function and MASS for God only knows what (I hope Ripley and Venables and Bates forgive me for such a reference... but that function is not really documented in its entry in MASS).

3/4

Show thread

Stas Kolenikov Jan 27

However my understanding of duckplyr approach to life and universe is that it overwrites the dplyr verbs. So if I explicitly declare dplyr:: namespacing in my package functions, I am denying duckplyr the opportunity to take over and provide that 5-10x speedup I am hoping to see. Should I expect that duckplyr::dplyr_verb() will work properly in this context?

2/maybe 4

Stas Kolenikov Jan 27

Shouting to the void: How to properly namespace #duckdb / #duckplyr in my #rstats packages?

One of @hadleywickham core style recommendations for package development is that every external function needs to be explicitly namespaced:

function_in_my_package <- function(df, x, ...) {
df |> dplyr::mutate(xx = stringr::str_do_something(x))
# implict return
}

1/maybe 4

Show thread

Stas Kolenikov Nov 22

From https://duckplyr.tidyverse.org/articles/duckdb.html

3. use dd$fun() for functions internal to duckdb and SQL (https://cynkra.github.io/dd/reference/index.html) -- compute string distances on the server with dd$damerau_levenshtein() and dd$jaro_winkler_similarity()

4. Distinguish between "lavish" (materialze right away), "stingy" (never materialize) and "thrifty" (materialize with <1M cells) flavors of duckplyr frames (reset with read_parquet_duckdb(..., prudence = c(cells = 10000, rows = 1000) )

Interoperability with DuckDB and dbplyr

Stas Kolenikov Nov 22

Going through some #duckdb learning through https://duckplyr.tidyverse.org/articles/large.html

1. Get a DuckDB SQL server in memory:

path_duckdb <- tempfile(fileext = ".duckdb")
con <- DBI::dbConnect(duckdb::duckdb(path_duckdb))
DBI::dbWriteTable(con, "mtcars", mtcars)

2. Try explain(), compute(), collect() (the dplyr words dispatched by dbplyr), compute_parquet() without bringing the data to memory (the result may be larger than memory)

Large data

Stas Kolenikov Aug 1, 2025

Show thread

Mine Çetinkaya-Rundel Aug 1, 2025

Virtual #useR2025 has officially started, but you can still register throughout the day!

If the scheduled times are no good for your time zone, you can always catch up with them when it's a better time of the day for you. The registration confirmation contains info to access the videos. #RStats

The first video is up and the rest are queued up to be released throughout the day!

Stas Kolenikov Aug 1, 2025

The idea of resetting the #rstats OS variable named "OS" is slightly horrifying but thanks Bing Chat anyway

Stas Kolenikov Jul 16, 2025

Going after #arrow #parquet #rstats #DBI beehive:

is there a mostly lazy way to make a local parquet copy of a remote SQL table? I want to avoid collect() but may be it is unavoidable. I could not find a reasonable example in the Cookbook. arrow::write_dataset() does not want to take tbl_Microsoft_SQL_Server as an input.

Personal website	https://staskolenikov.net/
Google Scholar	https://scholar.google.com/citations?user=TuJeDtcAAAAJ&hl=en