My initial thought after looking at the #GPT-4 event and browsing through the "technical report" ( technical really? most marketing...).
But with a 32k context window, the matrixes in the attention layers will be huge. What GPU and memory resources are needed...and how much power/carbon footprint?
I think this model is outdated, before it is released.
But i am not sure anyone else is training a model with this amount of data, so i guess we are stuck with it for awhile...