Mastodawn

I don't know why all the posts fail to summarize the results properly.

I had a similar idea at the back of my head but here is a layman explanation:

Standard attention threads the previous layers output to the next layers input. By adding residual connections to each layer, the layers learn an update rule.

There is an obvious limitation here. Only the first layer gets to see the original input and all subsequent layers only get to see the previous layer output.

With attention residuals, the idea is that you have a tiny attention operator that decides between using the original input and any of the previous layer outputs.

Official	https://
Support this service	https://www.patreon.com/birddotmakeup