This talks about the evolution of the Slug font rendering algorithm, and it includes an exciting announcement: The patent has been dedicated to the public domain.
https://terathon.com/blog/decade-slug.html
Low-level systems stuff. Reverse engineering, security research, bit twiddling, optimisation, SIMD, uarch. 64-bit ARM enthusiast.
he/they
| Blog | https://dougallj.wordpress.com |
| http://twitter.com∕dougallj∕status∕1590357240443437057.ê.cc/twitter.html | |
| Github | https://github.com/dougallj |
| Cohost | https://cohost.org/dougall |
We released two tech talks today going over how to take advantage of the new architecture, features and associated developer tools.
Accelerate your machine learning workloads with the M5 and A19 GPUs
https://developer.apple.com/videos/play/tech-talks/111432/
https://www.youtube.com/watch?v=wgJX1HndGl0Boost
your graphics performance with the M5 and A19 GPUs

Xerox scanners/photocopiers randomly alter numbers in scanned documents Please see the “condensed time line” section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. <iframe width="700" height="394" src="https://www.youtube.com/embed/c0O6UXrOZJo" frameborder="0" allowfullscreen></iframe>
I mean you can get fancy with them, but you can write a basic implementation in <100 lines of Python that is good enough to prove various 64-bit adder circuits logically equivalent, in a fraction of a second. It feels like cheating.
https://gist.github.com/rygorous/948308f7d998e5fd4e98344687580338
Correction: two instruction NEON float prefix sum.
I guess I'm a bit out of practice, was focusing on the complex ops too much, and two seemed too good to be true.
Three instruction NEON float prefix sum. I'd wanted to abuse FCMLA (floating-point complex multiply accumulate) for non-complex arithmetic for so long, and I finally came up with something :)
With two unnecessary multiplies to save one instruction, this may only work out on Apple CPUs, but it's a bit of fun.
(For loops you can broadcast the carried value with vfmaq_laneq_f32(scan, ones, prev, 3) for three multiplies saving two instructions. LLVM fights you on that, though.)
[oops, see reply]
I've had some fun and learned a lot from the "Anthropic's Original Performance Take-Home" optimisation challenge (https://github.com/anthropics/original_performance_takehome)
The old scoreboard (https://www.kerneloptimization.fun/) required Twitter login and got overrun by python/rng-exploit submissions, so here's a new one that requires Mastodon for auth, in case anyone has been playing:
Let's do this.

Let's learn and grow. New things are cool!Links 'n' stuff down below. Lots of links.First, the "clean version." Please pass that around.https://youtu.be/Zgxb...