GDC 2015: "Unleash the Benefits of OpenGL ES 3.1 and Android Extension Pack (AEP) (Presented by ARM)" by Hans-Kristian Arntzen, Tom Olson of ARM https://gdcvault.com/play/1022390/Unleash-the-Benefits-of-OpenGL

This presentation spanned the gamut, from "wow that's interesting" to "omg that's so stupid"

The "wow that's interesting" part was when the presenter was discussing Mali hardware. They mentioned 2 interesting things:

1/5

Unleash the Benefits of OpenGL ES 3.1 and Android Extension Pack (AEP) (Presented by ARM)

OpenGL ES 3.1 provides a rich set of tools for creating stunning images. This talk will cover best practices for using advanced features of OpenGL ES 3.1 on ARM Mali GPUs, using recently developed examples from the Mali SDK. We'll also look at...

1. Mali doesn't use warps (where multiple threads share the same program counter). Instead, each thread has its own program counter. This is fascinating; I've never heard of a GPU working this way. And it could definitely change how you write shaders!!

2. They were saying that shared memory (which I'm assuming is LDS) is not any faster than global memory, which is *wild*. I wonder if they just *don't have LDS* and just use global memory instead...

2/5

The "omg that's so stupid" part was when the Google presenter was describing the Android Extension Pack. AFAICT, it's completely useless. It's a "meta extension" that is just a collection of GLES extensions. If your app requires it (rather than requiring the specific extensions you need), you're artificially limiting the number of devices you can run on. And AFAICT you get nothing in return.

3/5

There was a part where the presenter said you can declare in your app manifest that you require it, and then devices that don't have it won't see your app in the app store, but also: 1. it wasn't clear whether you could declare the *specific* extensions you require in the app manifest, and 2. Because you probably don't *actually* require *every* extension in the extension pack, you probably *want* your app to appear in the store even if the device doesn't support the AEP.

So dumb.

4/5

I got the feeling that what Google was trying to do was to guarantee that every device running the next version of Android would support the AEP, and then they realized how many devices that would cause to become incompatible with the next version of Android, and reversed course.

Anyway...

Review: 2/10 It got some points for the hardware info about Mali. But that's it.

@GDCPresoReviews Other hardware has an IP per thread. However, you still only pick one of the IPs to go fetch & execute, and all the other lanes go idle that clock. So it is almost exactly the same in practical terms as normal SIMD as done by CPUs and AMD GPUs, it's just a different way to express it.

Nvidia sometimes claim to do this, but they give so few details, and it's basically impossible to tell from the outside, since the observed perf is the same.

@GDCPresoReviews And just to be different, older Intel GPUs could work in a bizarre hybrid of BOTH modes AT THE SAME TIME. I never really understood why. But they removed that stuff and their newer GPUs do (somewhat) normal predicated SIMD now.
@TomF @GDCPresoReviews I was going to ask whether this means you can maybe duplicate the ip into "execution units" to avoid everything talking to the same wires, but then I figure you can do that regardless of how you expose the instruction pointer architecturally.
@TomF @GDCPresoReviews I believe in NVidia this is sometimes used to not merge execution as soon as possible, to avoid e.g. deadlocks between lanes in the same warp with interleaved execution of both sides. That gets a lot more complicated in explicit predicated SIMD machines.
@GDCPresoReviews @bas This comparison was a significant chunk of my job at Intel, but my conclusion was that both versions are complex, equally gnarly in the details, and neither is an obvious win/lose. Also, there are plenty of hybrid versions.

@TomF I’m not following. What’s the observable difference between having a PC per thread, but only moving some threads forward together at each cycle, vs having a shared PC but having some threads be disabled?

Like, if you take the shared PC model, and then pretend the disabled threads’ PC is just whatever the last instruction they executed before they got disabled, you’d end up at the former model (it seems to me)

@TomF

Thinking more, I guess the difference is, in the case of an if/else where half your threads are on one side of a branch and the other half are on the other side, you might be able to observe that the threads on both sides could be making interleaved progress.

Whereas with the shared PC model, you’d always observe half your threads would start and complete their side of the branch before the other threads ran any of their side

(Assuming you only have one warp)

@TomF

Suppose you spawn just 2 threads (pseudocode):

atomic int x = 1;
if (threadID == 0) {
atomicAdd(inout x, 1);
atomicAdd(inout x, 1);
} else {
atomicMultiply(inout x, 5);
atomicMultiply(inout x, 5);
}

I believe there are 6 possible orderings of the atomic operations, and all 6 result in different resulting values of x.

So I suppose, in a shared PC model, some of those resulting values would never occur, whereas in the other model, all possibilities could occur

@GDCPresoReviews @TomF
AABB 75
BBAA 27
ABBA 51
BAAB 35
ABAB 55
BABA 31
@GDCPresoReviews In the explicit model, you still need some sort of stack/queue/list of IP+lane mask, and every now and then (e.g. flow contol) you check that to see what to run next. It should be obvious that this is completely interchangeable with an IP-per-lane model. So it's "just" a question of implementation efficiency.
@GDCPresoReviews e.g. you don't just want to round-robin the IPs, because for a simple if() clause, half the lane will take it, half won't, and they'll never converge back, so you run at half speed for the rest of the shader. Obviously bad. There's papers about finding the right convergence points and suchlike.
@GDCPresoReviews caveat on this toot: be careful about outdated information. Don't assume that more modern Mali architectures work this way