Mastodawn

ataylor Dec 7, 2024

_Finally_ getting around to actually implementing tiled light culling. Which was the original reason I started this renderer... a year+ ago?

It is, of course, incredibly broken so far:

Show thread

ataylor Dec 7, 2024

Incidentally, this is what it's _supposed_ to look like (culling is off here).

Show thread

ataylor Dec 7, 2024

In classic programming fashion, the bug was _not_ in the complicated culling math (which I rewrote twice in an attempt to root out the bug). Instead, it was an bug in how I was computing the bit index of a light to add it to the visibility mask.

Works now!

Show thread

ataylor Dec 7, 2024

Heatmap view, with the light count in the scene raised to 256.

There's two culling tests, and a light is only marked visible if it passes both:
- inside the planes of the subfrustum
- inside the viewspace AABB of the subfrustum

AABB only is decent, but falls flat on edges. Frustum only is _awful_. The two together do a pretty good job, though.

Show thread

ataylor Dec 7, 2024

Performance is interesting at 1440p.

Culling enabled:
- cull pass: 0.7ms
- lighting pass: 0.8ms

Culling disabled (just filling the bitmask for active lights):
- cull pass: 0.08ms
- lighting pass: ~4ms

Definitely a win (although with fewer lights / lower resolution it's a much narrower gap). Weird that cull and lighting take pretty much the same amount of time, though.

Show thread

ataylor Dec 7, 2024

For further context, the gbuffer pass takes 2.5ms at 1440p. Which is quite an interesting reversal from what I'd expect.

Part of that is that I haven't really implemented some of the more complex shading yet (all bog-standard GGX and Lambert so far). But these are all sphere area lights, not point lights. I'm _trying_ to do expensive stuff in the lighting pass and apparently failing.

Show thread

ataylor Dec 7, 2024

Turns out even the laptop version of a 3080 is _beefy_.

Show thread

ataylor Dec 7, 2024

With 16x16 tiles at 1440p, we end up with 14400 tiles (interesting coincidence, that.)

32 4-byte words per tile to store the light bitmask, which all adds up to about 1.75MB of buffer data. Less than I expected, to be honest.

This gives us a limit of 1024 lights; we could expand this buffer easily, but the 1024 limit is _actually_ caused by the UBO storing light data. (64 byte struct, so we can only fit 1K of them.) Lots of that data could be packed, but that has other consequences.

Show thread

ataylor Dec 7, 2024

I need to stress test this more, because the entire point was to have a baseline comparison for Fancy Light Grid Structure Idea, and if the naive thing is already really fast it's hard to motivate working on that...

Show thread

ataylor Dec 7, 2024

Okay, filling out the scene with the maximum of 1024 lights makes for a more interesting performance result:

No culling:
- cull pass: 0.1ms
- lighting pass: 27ms 😬

With culling:
- cull pass: 4.5ms
- lighting pass: 2.4ms

Interesting that the culling is the bottleneck, not lighting. Definitely some inefficiencies in my implementation there, but the per-light cost should be pretty thin already.

Maybe I need to look into doing culling with raster instead?

Show thread

ataylor Dec 7, 2024

At @mjp 's suggestion, added some worst-case alpha test to really stress things. All of the spheres now do stochastic alpha discard! Leads to some truly fantastic depth complexity:

Show thread

ataylor Dec 7, 2024

Slightly different view, so the numbers aren't directly comparable to before but:

- cull pass: 4.6ms
- lighting pass: 4.2ms

Looking at the ratios, lighting definitely got worse relative to culling here. Which makes sense, we end up with 64+ lights in these tiles (which is the maximum my heatmap will display; entirely possible it's _much_ worse than that.)

Show thread

ataylor Dec 7, 2024

As an aside: the completely ad-hoc alpha test of:
`interleaved_gradient_noise(gl_FragCoord.xy + gl_FragCoord.w)`
works shockingly well at separating the noise at different depth layers. I don't feed this pass a frame index yet, but I suspect if I animated it and let TAA munch away it'd do a really good stochastic alpha fade.

Show thread

ataylor Dec 8, 2024

It's deeply satisfying to me how much I can get away with on this GPU and still hit 16ms.

Like... "every shader does discard" was a real, serious performance problem I've had to solve at work (protip: discard on phones is Very Bad), and this thing doesn't even blink. It's refreshing.

Show thread

ataylor Dec 8, 2024

Also, confirmed: this makes for _really_ nice stochastic alpha once animated:

Show thread

ataylor Dec 8, 2024

(Has the usual problem of stochastic alpha where it can only really do fade, not opacity, but still useful.)

Show thread

ataylor Dec 8, 2024

An interesting debug view: instead of total light count in the tile, displaying the per-pixel false-positive count (lights in the tile that do not pass the range check for the light.)

Show thread

ataylor Dec 8, 2024

It also occurs to me now that my heatmap would probably be more useful if it were logarithmic instead of linear.

Show thread

ataylor Dec 8, 2024

Okay, remapping with:
`log2(float(count) + 1.0) / 2.0`

This gives blue at 3, green at 15, red at 63, and white at 255:

Show thread

ataylor Dec 8, 2024

So in this worst case stress test, we have quite a few tiles with more than 255 _false positive_ lights.

Definitely some room for improvement, then!

Show thread

ataylor Dec 8, 2024

Figured out why my tile culling was broken at 1080p (where the vertical resolution is not evenly divisible by 16.)

_Two_ separate off-by-one errors, in clamping for min/max depth pass, as well as in the pixel->tile lookup for shading.

Show thread

ataylor Dec 8, 2024

There's an argument to be made that min/max depth should be computed in the same pass as culling. However, it turns out the the culling pass has some odd scaling behavior for different group sizes:

(1080p, 1024 lights)
- 32: 1.95ms
- 64: 2.75ms
- 128: 3.05ms
- 256: 2.99ms
- 1024: 2.73ms

In general, culling seems to scale poorly with increased thread count (although it peaks and starts getting better past 128?)

16x16 thread group would be 256, one of the worst cases.

Show thread

ataylor

Although, come to think of it, we only sample one channel for depth, and we're doing a reduction, so 8x8 with a single gather per thread should be sufficient. And then we can skip the bandwidth / barrier overhead of the minmax pass.

(There's additional benefits, because having the depth information in cull pass is actually really useful, enables 2.5D culling.)

Show thread

ataylor Dec 9, 2024

Implemented 2.5D culling. (Basic idea: after computing min/max depth for the tile, slice it into 32 ranges, build a bitmask of depths that are present. Build the same for lights. Only accept lights that intersect the depth mask.)

From these A/B shots (false positive heatmap), you can see that it does help. A little. But the mask for the light is very crude, just uses the min/max in view space, doesn't account for the bounding sphere.

Show thread

ataylor Dec 9, 2024

It's pretty cheap to do (costs less than I saved moving the min/max depth computation into the culling shader), so it's probably worth having. Doesn't have a huge effect on lighting pass time, though.

I think that exhausts the "typical" tiled lighting culling options I'm aware of. Moving forward, options are to rasterize light volumes or to do a 3D grid instead of 2D.

Show thread

ataylor Dec 10, 2024

Holy shit I figured out why my light culling was so slow. 4.45ms -> 0.20ms by changing a couple lines of code.

Show thread

ataylor Dec 10, 2024

Do not use a uniform buffer / CBV if every thread is going to access a different index.

One of those tiny bits of trivia you pick up over time and then forget until you step on that particular rake again and get a sharp reminder and a bruised ego.

Show thread

ataylor Dec 10, 2024

(Can you tell this is my first time implementing culling on the GPU instead of CPU?)

Show thread

ataylor Dec 10, 2024

Weirdly this also seems to make the following lighting pass faster. Which doesn't make sense to me, there's a barrier between them so there shouldn't be any execution overlap.

(Unless the driver can move that barrier across my timestamp queries... which TBH seems like something OpenGL would allow.)

Show thread

ataylor Dec 10, 2024

On further thought, these are memory barriers, not execution barriers, so it makes total sense that timestamp queries would not be affected (queries don't read any of the data being written by the shader.)

So reading two GL_TIMESTAMP queries is a top-of-pipe -> top-of-pipe timing, not top-of-pipe -> bottom-of-pipe timing. Hmm. I may need to rewrite my timers.

Show thread

ataylor Dec 10, 2024

Annoyingly, while GL_TIME_ELAPSED gives proper TOP->BOP timing, it requires using a begin/end pair, and glEndQuery doesn't take an id.

So they can't be nested. I kinda get it (the end query probably flushes the pipeline, so you'd get bad timings from the outer query) but it's kind of annoying if you want to time both individual passes and the entire frame as a whole.

Show thread

ataylor Dec 10, 2024

Okay, it's possible that I can hack around this by adding a `glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE);` before querying each end timestamp.

I'm hoping this doesn't, like, actually create a fence and instead is just signaling a shared fence with a monotonic counter. Would be horrifying to create a new fence each time just for the byproduct of an execution barrier.

Show thread

ataylor Dec 10, 2024

Okay, no, compared to GL_TIME_ELAPSED the fence has a pretty steep tax (~40us), and because the timestamp then _includes_ the fence command itself, this spoils things.

I think the answer is to do GL_TIME_ELAPSED for per-pass timing (where I care about the details), and raw timestamp queries for total frame time (where TOP->TOP is... fine, I guess.)

Show thread

ataylor Dec 10, 2024

Also, I guess if I have a GL_TIME_ELASPED around the last pass, followed by the frame time GL_TIMESTAMP query, it doesn't _matter_ that it's top-of-pipe, because the other query already drained the pipeline?

Show thread

Oblomov Dec 10, 2024

@ataylor I assume graphics doesn't have per-command profiling info like compute?

Show thread

ataylor Dec 10, 2024

@oblomov I’m not sure I understand the question. GPUs have a deep pipeline where work can overlap. Accurate timing requires knowing when timestamps get recorded relative to that work, which is true for both graphics and compute.

Show thread

Oblomov Dec 10, 2024

@ataylor with compute, at least in OpenCL, you can associate an event with each GPU command (compute kernel, memory copy etc), which can then be queried for information about the enqueue, submission, start and completion time for that particular command. If time-stamping is reliable, this gives you all the information you need for the runtimes of each command, including command overlap. Do graphics API provide comparable information?

Show thread

ataylor Dec 12, 2024

@oblomov not packaged nicely like that, but you can build what you need in more modern APIs like Vulkan and DX12.

OpenGL not so much, because the API doesn't expose constructs like "record this timestamp after all FS work preceding it has completed."