intel đź’€

they added AVX512-VP2INTERSECT to tiger lake but it was slow as fuck (25 cycles per instruction)

it was faster to implement without using the instruction (https://arxiv.org/abs/2112.06342) so intel nuked it after tiger lake

AMD zen5 brought it back with speeds of 1 clock cycle per instruction

what the fuck is intel smoking

Faster-Than-Native Alternatives for x86 VP2INTERSECT Instructions

We present faster-than-native alternatives for the full AVX512-VP2INTERSECT instruction subset using basic AVX512F instructions. These alternatives compute only one of the output masks, which is sufficient for the typical case of computing the intersection of two sorted lists of integers, or computing the size of such an intersection. While the naĂŻve implementation (compare the first input vector against all rotations of the second) is slower than the native instructions, we show that by rotating both the first and second operands at the same time there is a significant saving in the total number of vector rotations, resulting in the emulations being faster than the native instructions, for all instructions in the VP2INTERSECT subset. Additionally, the emulations can be easily extended to other types of inputs (e.g. packed vectors of 16-bit integers) for which native instructions are not available.

arXiv.org
@niko absolutely classic intel
@xeno @niko guessing the original fast silicon implementation was broken so they fell back to a microcoded bodge. Which scared devs off so they scrapped it
@azonenberg @xeno i think the original implementation was just shitty silicon from what i can tell VP2INTERSECT was slow from the start
@niko @xeno what i mean is i think the original fast intel implementation was bugged and never saw the light of day, the 25 cycle version was the microcoded bandaid to work around that bug post silicon
@azonenberg @xeno ah okay yeah that's probably likely
@niko @xeno it's very common for intel to implement new instructions in a way that if they don't work under some/all conditions you can disable the hardware implementation and microcode it instead... This one probably just sucked so hard in the fallback nobody used it
@niko @azonenberg that feels like the most likely thing to me, and that would also be absolutely classic intel lol

still remember double-taking when I first read about intel’s super fancy awesome loop accelerator that had to be fused off before they even sold a single unit because it was so full of unpatchable security vulns
@xeno @azonenberg wait what i haven't heard of this one
@azonenberg @niko I don’t remember the name but it was like in the footnote of a paper; they spend god knows how much R&D basically building a seperate mini processor into each core that could keep track of different loops happening to improve branch prediction and caching

some security researchers got their hands on engineering samples and they reported so many security vulnerabilities that even intel (intel!) was just like chunck it in the bin that’s too much

It was in the footnote of a paper by some researchers who had worked on it, they were kinda thankful intel had common sense but als kinda annoyed they’re wealth of awesome cool vulns only worked on silicon nobody ever could buy

So they talked about it in a paper and were like “here’s all these cool vulns, that we have techniques for blah, blah, and blah*.” with the note “*blah never actually shipped in a consumer product due to these vulnerabilities…”