GDC 2015: "Unleash the Benefits of OpenGL ES 3.1 and Android Extension Pack (AEP) (Presented by ARM)" by Hans-Kristian Arntzen, Tom Olson of ARM https://gdcvault.com/play/1022390/Unleash-the-Benefits-of-OpenGL

This presentation spanned the gamut, from "wow that's interesting" to "omg that's so stupid"

The "wow that's interesting" part was when the presenter was discussing Mali hardware. They mentioned 2 interesting things:

1/5

Unleash the Benefits of OpenGL ES 3.1 and Android Extension Pack (AEP) (Presented by ARM)

OpenGL ES 3.1 provides a rich set of tools for creating stunning images. This talk will cover best practices for using advanced features of OpenGL ES 3.1 on ARM Mali GPUs, using recently developed examples from the Mali SDK. We'll also look at...

1. Mali doesn't use warps (where multiple threads share the same program counter). Instead, each thread has its own program counter. This is fascinating; I've never heard of a GPU working this way. And it could definitely change how you write shaders!!

2. They were saying that shared memory (which I'm assuming is LDS) is not any faster than global memory, which is *wild*. I wonder if they just *don't have LDS* and just use global memory instead...

2/5

@GDCPresoReviews Other hardware has an IP per thread. However, you still only pick one of the IPs to go fetch & execute, and all the other lanes go idle that clock. So it is almost exactly the same in practical terms as normal SIMD as done by CPUs and AMD GPUs, it's just a different way to express it.

Nvidia sometimes claim to do this, but they give so few details, and it's basically impossible to tell from the outside, since the observed perf is the same.

@TomF I’m not following. What’s the observable difference between having a PC per thread, but only moving some threads forward together at each cycle, vs having a shared PC but having some threads be disabled?

Like, if you take the shared PC model, and then pretend the disabled threads’ PC is just whatever the last instruction they executed before they got disabled, you’d end up at the former model (it seems to me)

@TomF

Thinking more, I guess the difference is, in the case of an if/else where half your threads are on one side of a branch and the other half are on the other side, you might be able to observe that the threads on both sides could be making interleaved progress.

Whereas with the shared PC model, you’d always observe half your threads would start and complete their side of the branch before the other threads ran any of their side

(Assuming you only have one warp)

@TomF

Suppose you spawn just 2 threads (pseudocode):

atomic int x = 1;
if (threadID == 0) {
atomicAdd(inout x, 1);
atomicAdd(inout x, 1);
} else {
atomicMultiply(inout x, 5);
atomicMultiply(inout x, 5);
}

I believe there are 6 possible orderings of the atomic operations, and all 6 result in different resulting values of x.

So I suppose, in a shared PC model, some of those resulting values would never occur, whereas in the other model, all possibilities could occur

@GDCPresoReviews @TomF
AABB 75
BBAA 27
ABBA 51
BAAB 35
ABAB 55
BABA 31