I have a very large .csv file with a numerical matrix saved within. I need to calculate the mean of many selections of values in each row. (e.g. In each row, the mean of the values at index 1, 3, 52, 123; then, in the same line, values 2, 3, 12, 29, 67, etc...)

My file is HUGE, even in row length (a row has like, 8000+ items), so I need this to be fast. I know the indexes I need to average at the start of the computation, but don't have enough memory to load the whole file at once.

I want to do this in #rust but I don't know how to do it fast enough. Any tips?

@MrHedmad If I interpret your problem correctly, you probably don't want to use Parquet or Arrow for this (as others have suggested), because your operation is fundamentaly row-based and not column-based.

You can use `csv` crate in a "streaming" fashion by reusing your `StringRecord` in every iteration of row parsing:

https://docs.rs/csv/latest/csv/struct.Reader.html#method.read_record

Something like this:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=3f6ba982764f8e97a20bf718d4546f25

It will basically only require memory for one row at a time.

1/2

#Rust #RustLang #CSV #Performance

Reader in csv - Rust

A already configured CSV reader.

@MrHedmad If you want to go even further, you might be able to split your file into chunks and have threads operate on those chunks individually. However, this will be more complicated, because how will you determine the correct boundary on where to split without resulting in invalid csv?

Here is a very interesting discussion in the Rust forum about this (with a normal txt file, though, and not csv)

https://users.rust-lang.org/t/reading-a-file-4x-faster-using-4-threads-works-threaded-is-faster/41180

2/2

Reading a file 4x faster using 4 threads (Works - threaded is faster!)

Hello I want to make a small program to search a word in a file, but I'd like to implement threads to do it faster. This is how I'd do it: With std::io::Seek I can change the file-pointer. The file-pointer is where Rust starts reading a file. Everytime a character is read/scanned it increases by one. I'd make a function like this search_for_word(word : String, start : u64, end: u64) where word is the word we're searching for, start is the position where the function will set the file-poi...

The Rust Programming Language Forum
@MrHedmad seems like you’d be seeking in from a file. Presumably you can hold a whole row in memory after seeking. Can you make an index so you know which byte to seek to when you want row X? Im guessing .h5 format file could do it but probably don’t want that complexity.
@photocyte
Reading the file one line at a time is not an issue: I can hold a line, even a few hundred/thousand lines in memory at once. The issue - I think - is subsetting the resulting vector(s) of numbers.

@MrHedmad Sounds like an application for stream processing (https://en.wikipedia.org/wiki/Stream_processing), so in this case reading and handling files in chunks. In #haskell there are good solutions with the libraries pipes and conduit. For #rstats I can recommend the packages chunked and laF.

No idea about #rust, but I think something like this must exist in every established language in one way or the other.

Stream processing - Wikipedia

@ClemensSchmid @MrHedmad {arrow} and parquet formats might help??

@djnavarro has some articles on her blog about using arrow https://blog.djnavarro.net/

Notes from a data witch

A blog by Danielle Navarro

Notes from a data witch

@MrHedmad

In #Rust I'd use the CSV crate https://docs.rs/csv/1.2.1/csv

The csv::Reader can read line by line with the records() function which return an Iterator.

Then I'd simply map() or fold() further linewise calculations.

csv - Rust

The `csv` crate provides a fast and flexible CSV reader and writer, with support for Serde.

@hirsch
Even if it doesn't work or there are better ways I still want to try it in rust. Thanks for the pointers!
@MrHedmad to speed things up further, I'd use rayon parallel iterators to run the calculation linewise on all cores.
@MrHedmad If you are comfortable writing SQL then #duckdb might be useful: https://duckdb.org/docs/data/csv/overview.html and if you are an #rstats person then you can query duckdb with #dplyr, too.
CSV Import

Examples The following examples use the flights.csv file. Read a CSV file from disk, auto-infer options: SELECT * FROM 'flights.csv'; Use the read_csv function with custom options: SELECT * FROM read_csv('flights.csv', delim = '|', header = true, columns = { 'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR', 'OriginCityName': 'VARCHAR', 'DestCityName': 'VARCHAR' }); Read a CSV from stdin, auto-infer options: cat flights.csv | duckdb -c "SELECT * FROM read_csv('/dev/stdin')" Read a CSV file into a table: CREATE TABLE ontime ( FlightDate DATE, UniqueCarrier VARCHAR, OriginCityName VARCHAR, DestCityName VARCHAR ); COPY ontime FROM 'flights.csv'; Alternatively, create a table without specifying the schema manually using a…

DuckDB
@thomas_sandmann
This looks interesting, and I'm OK with sql. But at a glance I didn't get it out needs to load the whole file in memory at once or not...
@MrHedmad duckdb tables can be stored in memory (less helpful) or be written to disk, eg be pointing the duckdb::duckdb() rstats function to a directory. Then you are only limited by your available disk space, not RAM. This example might be helpful: https://bwlewis.github.io/duckdb_and_r/taxi/taxi.html
taxi.utf8

@MrHedmad And if you prefer parquet files as a starting point, then duckdb can do the conversion as well: https://rmoff.net/2023/03/14/quickly-convert-csv-to-parquet-with-duckdb/ without reading the full file into memory.
Quickly Convert CSV to Parquet with DuckDB

Quickly Convert CSV to Parquet with DuckDB