I've got a `parmap` function implemented with about a 4x speedup on my 8 core, 16 thread machine.

I suspect I might be memory bandwidth bound, since the test compute is trivial, but I'm still pretty happy with these benchmark results.

All the way up to a 1000-element vector the OS process overhead dominates the Rust-managed benchmark, while the Alan-managed benchmark (which is noisy due to a sample size of 1) can measure from around a 100-element vector and up, and measures *only* the `map`/`parmap` operation, indicating approximately 29ns per iteration in series and an amortized 9ns per iteration in parallel.

Anyone who remembers Alan v0.1 will understand just how *vast* the performance difference is. This is due to Rust and the LLVM's compiler optimizations, but I'm still excited to see Alan being useful for high performance use cases.

Now, to build a proof-of-concept GPGPU mechanism and demo! :D

I just realized that I never turned on optimization passes in the generated binaries. Now I'm getting even better performance. :D

(Though the delta has shrunk to about 3.75x)

That last one that mentions 1 billion elements, that's 8GB of data, btw, that's it's loading, modifying, and storing in less than 5 seconds, btw, putting it at over 1.6GB/sec of computation. *That's* why I'm excited about this, since this is just on my laptop. :D

Looks like the theoretical limit is 25.6GB/s for my RAM, so still some room to #10xEngineer this. 😜

Exploring this a bit by manually editing the generated rust code to inline the function that the `map` and `parmap` functions were calling to see how it goes. Surprisingly, `map` shows a 30% performance improvement while `parmap` has zero impact.

I haven't tried to decompile the binaries to assembly to study them (yet?) but I'm wondering if the cross-crate optimization boundary problem in Rust prevented any sort of optimization (using that function as-is instead of recognizing that parts of it are dead weight and can be removed as it might be doing in the other path that is a manual for loop written for each thread being run). It's possible that it really is a RAM bandwidth issue, though, and the extra CPU burn just isn't noticeable on the `parmap` path.

💫 What am I even looking at here? 💫
MIR is decidedly less bad than LLVM-IR, but this is still a massive amount of text to describe the `muli64` function.
So while I can understand the mir layer decently well (and not at all in the llvm-ir and actual assembly level), I've confirmed that this layer doesn't do any optimization passes, not even dead code removal after intentionally leaving an unused function behind and compiling to this level, so it's not useful for figuring out if anything is being inlined or not. So I'm gonna give up on this kind of low-level analysis of performance for now.

Since today is basically exploration, I'm back on this. I did some work to modify the benchmark file to use Rayon instead of my home grown parallel map function, since surely they've put a lot more thought into this than I have, and...

I get the exact same performance.

Literally.

So I'm apparently already at the most optimal implementation one can get from a generalized parallel map function? Glad I checked that because I was going to waste a week deep diving into the performance issues here.

But I'm pretty proud that my little `parmap` Rust implementation only using the standard library is equivalent to Rayon's parallel map functionality. :) Not quite to Python's level, but Rust's stdlib is pretty batteries included (and I think all of the async/await stuff has distracted from "just use a pool of OS threads running sync-ish code")