Went back to the #HaarWavelets stuff and found a way to control the rhythmicality of the noise by scaling the variances of the seeding decorrelated gaussian random numbers. With sigma = 1 it's rhythmic, smaller is more uniform. Maybe I could control it per octave instead of all octaves together in unison.
Now I'm thinking of combining this with the FFT-based stuff: X would be the 10D (real; usually 11D but I think it's maybe best to exclude DC for this) energy per octave analysis of the same audio that Y is the 128D (complex, excluding DC) FFT analysis of, then I can simulate X with #VectorAutoRegression like the attached (without generating audio) and feed that into the #VARMA to get Y for audio output.