Mastodawn

John Henry Deppe Dec 10, 2024

I'm a grad student and I'm pursuing a project very similar to this expired SGI patent: https://patents.google.com/patent/US6105113A/en

I've seen you post a few things about TLBs in various places so I thought I might ask your opinion:

Why didn't the automatically-invalidated TLB come to pass?

US6105113A - System and method for maintaining translation look-aside buffer (TLB) consistency - Google Patents

A system and method for maintaining consistency between translational look-aside buffers (TLB) and page tables. A TLB has a TLB table for storing a list of virtual memory address-to-physical memory address translations, or page table entries (PTES) and a hardware-based controller for invalidating a translation that is stored in the TLB table when a corresponding page table entry changes. The TLB table includes a virtual memory (VM) page tag and a page table entry address tag for indexing the list of translations The VM page tag can be searched for VM pages that are referenced by a process. If a referenced VM page is found, an associated physical address is retrieved for use by the processor. The TLB controller includes a snooping controller for snooping a cache-memory interconnect for activity that affects PTEs. The page table entry address tag can be searched by a search engine in the TLB controller for snooped page table entry addresses. The TLB controller includes an updating module for invalidating or updating translations associated with snooped page table entry addresses. Translations in TLBs are thus updated or invalidated through hardware when an operating system changes a PTE, without intervention by an operating system or other software.

Show thread

JohnMashey Dec 10, 2024

@DumbPseudonym
Hi, I’m at conference, about to sleep, so more Tues.
I ~spec’d the MMUs for MIPS R2000/3000, R4x00/R1x000, ie working with the logic designers, who gave me most of what I wanted & I explicitly didn’t ask for this hardware, on purpose.
I think this patent was of the form “we might want to do this some day, so we better patent it so we’re protected”.
But crucial: hardware that is complex, large or in critical path needs to be justified by frequency*cost of software.

Show thread

JohnMashey

@DumbPseudonym
1/ In between sessions at #agu2024, so briefly:
Recall that aforementioned MIPS CPU TLBs only did translations, and if did not find valid PID-VPN, trapped to software to refill. The TLB hardware had no connection to cache or memory, never altered/deleted entries except by software control. Page Table Entries were normally kept in cacheable space, but not required, could be in uncached. In fact, one could use those TLBs with no normal page tables, by constructing PTEs on misses.

Show thread

JohnMashey Dec 10, 2024

@DumbPseudonym
2/ Snooping caches, if multilevel, normally obey *inclusion* property, ie any line in L1 is also in L2…Ln, and there are usually duplicate tags for Ln, so the bus agent can snoop in dups w/o impacting main pipeline.
But the TLB does NOT obey inclusion property, on purpose.
That means it would likely need bus agent to keep duplicate VPNs to be able to snoop, else every memory write by any CPU needs to stall every other CPU to check its TLB.
(More to come on OS behavior).

Show thread

John Henry Deppe Dec 10, 2024

@JohnMashey Thank you! Software-loaded TLB is what got me interested. My OS class used simulated MIPS with the soft-loaded TLB and it seemed like there was a lot one could do with it.

Now I'm trying a software solution (on RISC-V machine with hardware page walkers) where we designate virtual memory regions with madvise(), invalidate those PTEs for page faults, and then software-insert a PTE. There's no TLB heirarchy, so now we can track per-cpu TLB residence and filter shootdown with that.

Show thread

John Henry Deppe Dec 10, 2024

@JohnMashey After the software approach, I think I want to try taking a cache coherence approach and the problem of finding PTE writes is one I'm contemplating. There's got to be something better than keeping duplicate VPNs as you say.

Watching OS people struggle with TLB coherence has convinced me that we architects should try and simplify the interface for them.

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
3/ This of course is an old issue, a reasonable discussion is in 1989 thesis:
https://apps.dtic.mil/sti/tr/pdf/ADA632163.pdf
However, MIPS cpus used TLB + physical-indexed caches, so some comments don’t apply.

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
4/ Observe the fact that various companies successfully built shared memory multiprocessors using MIPS CPUs.
I’m one of the SGI Origin 3000s , which eventually reached 512p SMP nodes in Origin 3800. (Bigger ones were clustered).
https://en.m.wikipedia.org/wiki/SGI_Origin_3000_and_Onyx_3000
So, how can that be possible?
This depends on knowledge of typical program behavior & overall statistics.

SGI Origin 3000 and Onyx 3000 - Wikipedia

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
5/ With few exceptions, user programs may grow stack & heap, and maybe shared memory regions, but generally don’t shrink… and PTEs rarely change (VERY important).
-Instruction pages are read-only.
-Data pages may start Clean, but CPU1 tries Write, traps. OS changes memory PTE to Dirty, replaces copy in TLB. There’s no need to update other TLBs, as that particular inconsistency is safe, just means other CPU(s) traps unnecessarily if they happen to have copies.

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
6/ Likewise, there are no hardware-set reference bits, so they get simulated on usual way, by marking a range of memory TLBs temporarily invalid. If TLBs happen to have valid PTEs, some references can get missed, but that’s OK as effect is temporary, given TKB contention/flushing. A memory PTE has to stay unreferenced a long time before it becomes candidate for page-out (if dirty) or drop (if really clean).
COW pages require more work if implemented.

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
7/ Of course, when a process exits, unshared pages are freed, as are all such memory PTEs. TLB entries included a 6-bit tag dynamically assigned so 64 processes could share TLB. Of course, there are often more than 64 processes, but only when switching among more than 64 does OS need to flush TLB and reassign tags. Anyway, when process exits, OS doesn’t reuse that tag until after a flush.

Show thread

JohnMashey Dec 11, 2024

@DumbPseudonym
8/ Anyway, the bottom line is:
- All this depends on understanding the PTE state transitions and the frequencies under normal use
-And there are many cases where TLBs might have entries inconsistent with memory, but in ways that are safe, but might cause some redundant work (like several CPUs finding memory PTE already Dirty).
And again, it is a fact that some of the largest SMPs ever built used software TLB management!

Show thread

John Henry Deppe Dec 11, 2024

@JohnMashey Indeed! My advisor worked on NUMAchine at Toronto and I understand the MIPS flexibility was very important for that project.

Thanks for all your comments! Your perspective is really valuable to me.