It is really wild how fast one implement stuff on the #GPU using just #openmp
Talking about zero to hero in a couple of weeks. There are some optimizations that involve the #simd directive that is not entirely clear to me when to use them, while #gcc also uses some very non transparent rules to map code to vector #ptx instructions but the gain is worth the pain of not understanding these two (and a few other things)