0 Followers
0 Following
1 Posts

This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.
Officialhttps://
Support this servicehttps://www.patreon.com/birddotmakeup

I don't know why all the posts fail to summarize the results properly.

I had a similar idea at the back of my head but here is a layman explanation:

Standard attention threads the previous layers output to the next layers input. By adding residual connections to each layer, the layers learn an update rule.

There is an obvious limitation here. Only the first layer gets to see the original input and all subsequent layers only get to see the previous layer output.

With attention residuals, the idea is that you have a tiny attention operator that decides between using the original input and any of the previous layer outputs.