Thank you Crowdstrike for helping to illustrate that Open Source is not the problem.
@bagder very challenging situation ... distributed system design is hard.
@jimfuller @bagder is "test patches in a lab before rolling them out to millions of systems worldwide" really a distributed systems problem
@privateger @bagder I believe so ... distributed systems is not just about how the systems operates but also how rolling upgrades are applied to such systems ... in this case I would have expected containment/mitigation of a bad patch
@jimfuller @bagder It seems they've tried to do that actually. It stopped rolling out a few minutes after problems were reported, but it was far too late at that point.

@privateger @bagder @jimfuller

if that's true it wasn't a "rollout", or at least not a controlled one. A rollout would be turning off updates after small measured increments and checking that things were still going well before proceeding with the next chunk (increment size doesn't need to be constant—often isn't—but does need to start small).

If you're combating an active 0-day attack you might be justified going full-throttle right off the bat, but do so knowing you're rolling the dice.

@dveditz @privateger @bagder sounds like it was virus def that triggered some regression, of course no matter how rare that might cause a problem, does not explain not testing the batch against a 'canary set' of hosts and progressive rolling upgrade ... guessing the problem here is they (CS/M$) think a certain part of their system is totally safe