Mastodawn

Luca Visentin Apr 30, 2023

I have a very large .csv file with a numerical matrix saved within. I need to calculate the mean of many selections of values in each row. (e.g. In each row, the mean of the values at index 1, 3, 52, 123; then, in the same line, values 2, 3, 12, 29, 67, etc...)

My file is HUGE, even in row length (a row has like, 8000+ items), so I need this to be fast. I know the indexes I need to average at the start of the computation, but don't have enough memory to load the whole file at once.

Show thread

Clemens Schmid Apr 30, 2023

@MrHedmad Sounds like an application for stream processing (https://en.wikipedia.org/wiki/Stream_processing), so in this case reading and handling files in chunks. In #haskell there are good solutions with the libraries pipes and conduit. For #rstats I can recommend the packages chunked and laF.

No idea about #rust, but I think something like this must exist in every established language in one way or the other.

Stream processing - Wikipedia

Show thread

Dr. Robert M Flight

@ClemensSchmid @MrHedmad {arrow} and parquet formats might help??

@djnavarro has some articles on her blog about using arrow https://blog.djnavarro.net/

Notes from a data witch

A blog by Danielle Navarro

Notes from a data witch