Dougall

@dougall
2.9K Followers
303 Following
1,045 Posts

Low-level systems stuff. Reverse engineering, security research, bit twiddling, optimisation, SIMD, uarch. 64-bit ARM enthusiast.

he/they

Bloghttps://dougallj.wordpress.com
Twitterhttp://twitter.com∕dougallj∕status∕1590357240443437057.ê.cc/twitter.html
Githubhttps://github.com/dougallj
Cohosthttps://cohost.org/dougall

We released two tech talks today going over how to take advantage of the new architecture, features and associated developer tools.

Accelerate your machine learning workloads with the M5 and A19 GPUs

https://developer.apple.com/videos/play/tech-talks/111432/

https://www.youtube.com/watch?v=wgJX1HndGl0Boost

your graphics performance with the M5 and A19 GPUs

https://developer.apple.com/videos/play/tech-talks/111431/

https://www.youtube.com/watch?v=_5yEcJfB6nk

Accelerate your machine learning workloads with the M5 and A19 GPUs - Tech Talks - Videos - Apple Developer

Discover how to take advantage of the M5 and A19 GPUs to accelerate machine learning. Find out how to use the Neural Accelerators inside...

Apple Developer
New blog post: A Decade of Slug
This talks about the evolution of the Slug font rendering algorithm, and it includes an exciting announcement: The patent has been dedicated to the public domain.
https://terathon.com/blog/decade-slug.html
@never_released @dougall @saagar @alexr @siracusa New in Xcode 26.4b3: 👋 M5 Pro/Max.
CPUFAMILY_ARM_SOTRA (H17S) contains P-cores and M-cores, and there's now a CLUSTER_TYPE_M enum to go with TYPE_E and TYPE_P.
I guess this deserves to be posted on a regular cadence for the benefit of anyone who hasn't seen it before: https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_are_switching_written_numbers_when_scanning
Xerox scanners/photocopiers randomly alter numbers in scanned documents

Xerox scanners/photocopiers randomly alter numbers in scanned documents Please see the “condensed time line” section (the next one) for a time line of how the Xerox saga unfolded. It for example depicts that I did not push the thing to the public right away, but gave Xerox a lot of time before I did so. <iframe width="700" height="394" src="https://www.youtube.com/embed/c0O6UXrOZJo" frameborder="0" allowfullscreen></iframe>

D. Kriesel
Shrunk the machine code a little here https://github.com/pkhuong/tiny_batcher The x86-64 build now clocks in at 199 bytes (224 on aarch64)! I think x86-64 might benefit from using high half byte registers, but hopefully simpler tricks can get us to <= 3 cache lines.
GitHub - pkhuong/tiny_batcher: a size-optimised sorting library for C and C++

a size-optimised sorting library for C and C++. Contribute to pkhuong/tiny_batcher development by creating an account on GitHub.

GitHub

I mean you can get fancy with them, but you can write a basic implementation in <100 lines of Python that is good enough to prove various 64-bit adder circuits logically equivalent, in a fraction of a second. It feels like cheating.

https://gist.github.com/rygorous/948308f7d998e5fd4e98344687580338

BDD implementation of some very basic circuit verification proving various 64-bit adder architectures equivalent

BDD implementation of some very basic circuit verification proving various 64-bit adder architectures equivalent - bdd_adders.py

Gist

Correction: two instruction NEON float prefix sum.

I guess I'm a bit out of practice, was focusing on the complex ops too much, and two seemed too good to be true.

Three instruction NEON float prefix sum. I'd wanted to abuse FCMLA (floating-point complex multiply accumulate) for non-complex arithmetic for so long, and I finally came up with something :)

With two unnecessary multiplies to save one instruction, this may only work out on Apple CPUs, but it's a bit of fun.

(For loops you can broadcast the carried value with vfmaq_laneq_f32(scan, ones, prev, 3) for three multiplies saving two instructions. LLVM fights you on that, though.)

[oops, see reply]

I've had some fun and learned a lot from the "Anthropic's Original Performance Take-Home" optimisation challenge (https://github.com/anthropics/original_performance_takehome)

The old scoreboard (https://www.kerneloptimization.fun/) required Twitter login and got overrun by python/rng-exploit submissions, so here's a new one that requires Mastodon for auth, in case anyone has been playing:

https://vliw-challenge.fly.dev/

GitHub - anthropics/original_performance_takehome: Anthropic's original performance take-home, now open for you to try!

Anthropic's original performance take-home, now open for you to try! - anthropics/original_performance_takehome

GitHub
You are being misled about renewable energy technology.

Let's learn and grow. New things are cool!Links 'n' stuff down below. Lots of links.First, the "clean version." Please pass that around.https://youtu.be/Zgxb...

YouTube