Here you go.
@mvyrmnd @nixCraft I agree, there needs to be a business continuity plan. Here’s my problem: Ths problem was introduced by a famous, global app brought down by a low level control file. This can happen at any level in the stack. It can happen with recovery sites, too.
It reveals channels of attack to any organization or business and government in general.
Given the amount of work produced by electronic systems, it is mostly impossible for large organizations to…
@mvyrmnd @nixCraft more likely, I think Microsoft will need to set up some automatic filter to test these changes before simply using their update channel to blindly deliver partner code.
The important take-away from this has to be the need for:
1. Automatically test changes environmentally
2. Quickly, automatically remove bad code with prior code
3. The revelation of an attack channel to every customer
@nixCraft
You're right, and disasters are not always like Crowdstrike.
I was once at one of the better known Fortune 500 IT companies who had been prioritizing new features over bug fixes for many years, and as a result their reliability had deteriorated and was starting to impact their reputation and sales.
So far, nothing surprising, but the really surprising outcome was that someone managed to convince top management of the nature of the long term mistake, and they entered a 1.5 year cycle of doing literally nothing but bug fixes -- no new features handled at *all* -- which caused much weeping and gnashing of teeth and much general complaining. But they stuck with it.
And these days they're back to having a top tier reputation. As Fortune 500 goes; YMMV.
The point being that product bugs sometimes should be considered in the category of "outage" even when no single bug is severity 1.
“By failing to prepare, you are preparing to fail.”
While the closest he got to electronics was running about in thunderstorms, Ben Franklin’s advice is salient. Preparation is the difference between an emergency and a disaster
On a much smaller scale, I'm in the process of prepping to upgrade a VPS. The script provided has a "dry-run" implementation that let's you see the errors that need fixed before the final upgrade takes place.
Tech infrastructure really needs a method like this as well.
I guess you are right…
@nixCraft Except DR plan usually don't take in account bad software problem, and don't prevent this kind of incident.
A common strategy for DR it's to have another site with the storage sincronized in near real time. That save you from problem caused by a destroyed data center (fire, flood, bomb) but being a mirror, means that software problems are not mitigated at all.
A better solution is making a rolling deploy, where you install software update & patch on a few server, and after some time if everything is ok, install on more server.
@nixCraft
I know some places whose DR plan is "it's on the cloud so it's fine because if one computer goes wrong the others can still access the cloud"
Yeah...