I'm working on a little #Uxntal optimisation algorithm that is fundamentally of little use, it's one of those "I can't help myself" things, but I think it is quite nice:
What it does is, where possible, eliminate stores and loads in an <addr> <store with keep> ... <load> sequence, regardless of the intervening sequence. So it disentangles the stack juggling needed to put the address at the top of the stack for the load from the other operations.
After that is done, I have another algorithm that can eliminate the store/load pairs altogether.
I'll explain more when I actually get it to work.