One of the things that I think is sad about the decimation of Twitter eng is that Twitter was doing a lot of interesting (and high ROI) eng work that, at younger companies, is mostly outsourced at great cost.

A few examples off the top of my head:

The now gutted HWENG group was so good at server design that, in a meeting with Intel, the Intel folks couldn't believe the power envelope Twitter achieved and Google thought we were lying about our costs during cloud price negotiations.

Twitter was operating long before gRPC existed, so they built Finagle. See https://kostyukov.net/posts/finagle-101/ and
https://twitter.com/vkostyukov/status/1523055814298062848 for more info on that.

Twitter still gets a lot of mileage out of owning its RPC layer.

On the odd comment that Twitter is slow in some countries due to unbatched RPCs, Twitter has world class infra for understanding why that's not the case.

Many very good engineers pitched in to realize the vision, but the vision came from one person,

Vladimir Kostyukov - Posts / Slides

whose two year tour de force of technical and organizational execution took Twitter from being yet another company that didn't know how to use tracing to one at the leading edge of what's possible. One of the more impressive things I've seen in my career

Another odd thing about the unbatched RPC comment is that Twitter created Stitch, which allows for, among other things, easy batching where appropriate.

I believe the last remaining eng who worked on that at Twitter was fired today

Manhattan, Twitter's in-house DB, has such low tail latency that it caused a problem for Twitter's attempts to move to the cloud and switch to some cloud DB.

The hosted cloud option people wanted had much higher latency but services were architected to rely on a lower latency DB.

The JVM team did a bunch of wizardry, e.g.,https://github.com/twitter/util/commit/3245a8e1a98bd5eb308f366678528879d7140f5e plus https://github.com/oracle/graal/pull/636 were worth 5% to 15% CPU usage/cost reduction for Scala services and there was a ton of work like that.

util-core: Avoid CAS operations in Futures · twitter/util@3245a8e

**Problem** Some of the compare and swap (CAS) operations done by Future could be avoided. CAS operations are the main source of overhead in our Finagle stack. **Solution** 1. Introduce `Future:...

GitHub

Although this was killed in the push to move to off-the-shelf software, Twitter built EventBus, which was much cheaper to operate than Kafka.

https://news.ycombinator.com/item?id=26643392

At one point, I took a quick look and it seemed that moving to Kafka cost ~0.5M cores and an analogous RAM.

Twitter was a real leader in cache, not just for a non-"hyperscale" company, but period.

https://twitter.com/danluu/status/1324416895013986305

Segcache, one of many great intern projects, solved a longstanding open problem in caching.

Why though? It's worth noting that Twitter built their own system (EventBus) tha... | Hacker News

Speaking of intern projects that solve longstanding open problems that have plagued the industry, there's also https://danluu.com/cgroup-throttling/, whose kernel scheduler patch fix was implemented by an intern.

There have been maybe four major public attempts at solving this issue, but none of the solutions really solved the core problem. This intern's patch just solves the problem.

The container throttling problem

When I joined Twitter, having previously worked on a major web search engine, I was astounded at how small the search team was given what they were accomplishing.

http://www.jimmylin.umiacs.io/publications/Busch_etal_ICDE2012.pdf describes the original architecture and the performance was improved to ~1s ingest times while I was at Twitter, which allowed search to be used all over while still giving users the experience they expect, e.g., to serve their profile.

Of course there's mesos+aurora. A reason the attempted migration to k8s was hard is that, despite being abandonware, there are quite a few things mesos does better than k8s, e.g., you can run a 70k node cluster, reducing cost significantly.

The team was also amazing at debugging. Kernel team was another debugging standout, e.g., this trivial looking patch, https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=66f8209547cc11d8e139d45cb7c937c1bbcce182, figured out during https://web.archive.org/web/20221111175649/https://blog.twitter.com/engineering/en_us/topics/open-source/2020/hunting-a-linux-kernel-bug, as well as subtle stuff like https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/

kernel/git/torvalds/linux.git - Linux kernel source tree

For metrics data, there's https://twitter.com/danluu/status/1266712673648926721, which unlocked hundreds of millions a year of savings, a "weekend project" amount of work because of the work data infra people had done to allow trivial ETL jobs to run over many many PB of data a day.

When I compare notes with good engineers at other companies in the same size class, they're generally impressed by the expertise that Twitter used to have, which unlocked reliability (and profit) these other companies couldn't.

Dan Luu on Twitter

“A simple way to get more value from metrics https://t.co/nVSBe61Dge”

Twitter

E.g., friends at a lot of younger web companies will say there's literally no one at the company with the kind of debugging expertise that multiple people on the kernel team had and similar issues take 20x longer to debug and end up with some hacky mitigation instead of really getting fixed properly.

It's not like those companies don't have the same issues. You're going to run into the issues if you use Linux. The only question is if you're prepared to notice and handle the issues.

I'm missing quite a few examples and this list is highly biased towards infra because I worked in infra and I'm writing this off the top of my head, but there was a lot of work that really influenced the industry in other areas as well, e.g., Bootstrap, one of the OG core FB ads engineers that a reason Twitter was considered scary in the early days was [some technical ads stuff that I don't actually care about], etc.
@danluu interesting details, thanks for sharing!
@danluu thank you for sharing so much great work that was done..... Such a shame.
@danluu
Thanks for this very interesting thread
@danluu Thanks for sharing these interesting insights. You mentioned a few times that other companies of comparable scale didn’t have the same level of expertise and experience. Do you happen to know what used to be Twitter’s trick of attracting this good people (prior to randomly firing everyone, I get that _that_ isn’t a good talent acquisition strategy ;))?
@Xjs I think a lot of it just comes from the company being old and scaling up before it was possible to use off-the-shelf software to scale, which required a lot of expertise. A lot of younger companies don't develop that kind of expertise because they think they don't need it if they use off-the-shelf components, which I don't think is really true, although you can definitely get away with it in the sense that the business can be very successful.
@danluu As someone who's used Mesos (Marathon mostly), Nomad, and k8s, I still think Mesos is the closest to getting it right.

@danluu was the blast radius of 70k clusters with a single control plane a problem?

Right now #HashiCorp Nomad is tough to scale above 10k nodes. While there are some not very complicated improvements to allow higher scale, we haven’t prioritized them as we’re worried folks will regret it during incidents.

Nomad’s scheduling failing does *not* effect running work, but if the Nomad failure is correlated with another system failure then losing your control plane is huge.

@danluu From what I can see, the finagle pool-shifting trick is basically what the Scala stdlib futures (and the follow-on elaborations like cats-effect IO/ContextShift) accomplish with the implicit ExecutionContext.

Hit the cgroup throttling issue in 2018 at HPE (Scala/Akka on Mesos), but worked around it with VM sizing, partly because of a rigorous segregation of I/O from CPU tasks.

@leviramsey Yeah, companies also get around it with core pinning and one very large company ($500B+) worked around it by mostly avoiding colocaiton so they wouldn't need isolation, but I think it's a bit odd that the kernel scheduler's CPU isolation design really only makes sense for batch workloads when it's so common to use Linux + cgroups for servers (and, in principle, end users might want cgroups as well).
@danluu
Do you have more background on the tracing stuff? Was this RPC specific?
@realn2s There's a write-up of some of the work at https://danluu.com/tracing-analytics/.
A simple way to get more value from tracing

The tech crowd might want to note that Dan Luu is on Mastodon:
@danluu

#DanLuu

@danluu @dredmorbius the people actually on mastodon might already have noticed.

@dplattsf FWIW, it was news to me when I posted that.

Mentions are a chief discovery mechanism on Mastodon.

(And I've been here since 2016, across several instances.)

@danluu I've been interviewing a bunch of ex-Twitter infra people recently, and I'm genuinely staggered at how good they all are, solving problems I didn't expect them to be aware of, never mind able to solve, without much bigger engineering teams. It's a reminder how good motivation and leadership matters to engineers.
@bigvalen @danluu Twitter’s existential problems were not engineering related, they were related to psychology. STEM > humanities
@paninid @bigvalen @danluu it seems as if an equally if not more plausible conclusion is STEM <<< humanities, if you insist on a conclusion of this form at all.
@danluu Whew, that's next level! Thanks for sharing! 🙌
@danluu thanks for this background, super educational for me!
@danluu Excellent thread on how well Twitter infrastructure engineers solved a series of really thorny problems.
@danluu @ceejbot oh shit, it just occurred to me that the Twemoji project is probably headless, now, to say nothing for Bootstrap.
@danluu this is such a beautiful stream of posts. celebrating the great stuff is amazing, under-done. thanks for sharing.
@danluu on the upside, the cracking of the Twitter egg means a lot of talent is going to go to some good companies & start some. Can’t wait to see what birdie alumni come up with.

this is an interesting thread! because i've always been curious what all those engineers were doing for the last decade after we left

a lot of it sounds like writing or migrating new services & platforms to eke out more performance (some things never change)

@robey we should have coffee sometime
@danluu maybe the fediverse can hire some of them with the excess that we're collecting for our server/admin costs.
@opencollective
https://social.coop/@bmann/109354481926521228
Boris Mann (@[email protected])

The @[email protected] platform is a key tool for pooling funds to drive shared action and support. Doing a search for Mastodon shows you many instances using it already https://opencollective.com/search?q=mastodon My home instance @SocialCoop uses it, and also then supports other projects on the platform. They've now launched a project to collect funds to become more #fediverse friendly https://opencollective.com/engineering/projects/fediverse-friendly-open-collective

social.coop

@danluu @kashaar eng? engineering?

running a social media site?

@danluu And the #recsys research of course! "#RecService: Distributed Real-Time Graph Processing at #Twitter"
https://www.usenix.org/system/files/conference/hotcloud18/hotcloud18-paper-grewal.pdf

Everyone likes to dunk on "the algorithm" here (me included), but the /explore page is exactly that.

@danluu
What Elon Musk is doing at Twitter is at best chaos engineering, at worst an out of control garbage collection.
I think Twitter can recover. But Elon Musk is a fever and fevers can kill or cripple.
I think Elon Musk is in principle very encouraging of long term high ROI investment. The only question is if he can afford those investments as those have happened on other peoples dime in all his companies.
@danluu were they running it on z/OS?