The CrowdStrike IT outage is a good reminder that if you don't have a disaster recovery (DR) plan in place, there will be consequences. There will be many meetings and discussions about the need for DR, but by the end of the year, it will likely be forgotten amidst the usual job cuts, new priorities, and questions about IT budgets. This cycle will continue until another IT outage strikes. I speak the truth and nothing else. If I'm wrong, correct me below. #sysadmin #IT
As CTO of a major fintech firm ACME Corp, this is my disaster recovery plan. Oh, you expected me to hire more staff and build an actual plan? That's adorable. Perhaps you'd also like a unicorn to handle our cybersecurity? 😉
@nixCraft Amusingly, it's misspelled. Possibly indicating how much effort went into it?
@MyYeeHaa haha. sharp observation. it is good that some people still get humour
@nixCraft @MyYeeHaa looks like misspelled twice and corrected once. And hey, prompt engineering is hard work. /s
@nixCraft @MyYeeHaa Also, it was not opened. But the glass is full.
@MyYeeHaa @nixCraft that's what you get by picking such poor plans. Recover from drinking, not by drinking.
@nixCraft a lot of companies, not only small ones, shortcut most of the control steps. Quality and testing are used to sell but not to secure and improve the software. IMO, AI will not change that. Unfortunately, it is not only about software...
@nixCraft O man, I enjoy the picture, that just too perfect 😂 I feel many organizations around the world have that recovery strategy 😂 😂
@nixCraft
Duly noted and stolen.
@nixCraft disaster recovery absolutely, but in an incident like this you also need it's best friend, a Business Continuity Plan. What does the business do while you enact the DRP?
@mvyrmnd @nixCraft That's a Bizness Cruelty Plan. The workers will be laid off until security improves.

@mvyrmnd @nixCraft I agree, there needs to be a business continuity plan. Here’s my problem: Ths problem was introduced by a famous, global app brought down by a low level control file. This can happen at any level in the stack. It can happen with recovery sites, too.

It reveals channels of attack to any organization or business and government in general.

Given the amount of work produced by electronic systems, it is mostly impossible for large organizations to…

@mvyrmnd @nixCraft manual procedures. You can make the argument the individual departments can continue with manual procedures, but the reality is otherwise. Even with detail instructions, given the lack of retention of trained employees, the accuracy of written instructions as an artifact stuck in time given business change and the need to properly communicate with all the various parts, I don’t see it working. If anything, it will create a worse situation.
@mvyrmnd @nixCraft this kind of change is invisible to every organization. It was made worse by requiring people to be hands-on to correct the problem. If you use bitlocker, you’ll need the key retrieval code and then to return the bitlocker code the user has to enter by hand. These are long, difficult keys and it takes a pile of time to go through local security to get return that code into the proper hands. It’s also publishing this secure code via email and people.

@mvyrmnd @nixCraft more likely, I think Microsoft will need to set up some automatic filter to test these changes before simply using their update channel to blindly deliver partner code.

The important take-away from this has to be the need for:
1. Automatically test changes environmentally
2. Quickly, automatically remove bad code with prior code
3. The revelation of an attack channel to every customer

@nixCraft Also, don't use products that tell you the damage they're going to do.
@nixCraft err…TESTED DR plan?
@bls @nixCraft “How many Oracle licenses do we need to buy for that?”

@nixCraft
You're right, and disasters are not always like Crowdstrike.

I was once at one of the better known Fortune 500 IT companies who had been prioritizing new features over bug fixes for many years, and as a result their reliability had deteriorated and was starting to impact their reputation and sales.

So far, nothing surprising, but the really surprising outcome was that someone managed to convince top management of the nature of the long term mistake, and they entered a 1.5 year cycle of doing literally nothing but bug fixes -- no new features handled at *all* -- which caused much weeping and gnashing of teeth and much general complaining. But they stuck with it.

And these days they're back to having a top tier reputation. As Fortune 500 goes; YMMV.

The point being that product bugs sometimes should be considered in the category of "outage" even when no single bug is severity 1.

@nixCraft considering how often internal IT stuff with my employer will be broken for months if not years, I'm pretty sure the disaster recovery plan is just "fuck it we ball"

@nixCraft

“By failing to prepare, you are preparing to fail.”

While the closest he got to electronics was running about in thunderstorms, Ben Franklin’s advice is salient. Preparation is the difference between an emergency and a disaster

@nixCraft not just a plan, but a physical thing you can find in the world. Your plan will do you no good if it is also BSOD'd/Encrypted. Also it has to be tested. Oh you plan to recover your 10 TB of data to a 2TB server across town over a 10mb back plane? Or recover from tapes you've overwritten twice a year for the last 10 years? Seen it seen it seen it
@nixCraft The eyes of CEO's glaze over when DR is mentioned, because it won't have a short term impact on their stock options. #IT

@nixCraft

On a much smaller scale, I'm in the process of prepping to upgrade a VPS. The script provided has a "dry-run" implementation that let's you see the errors that need fixed before the final upgrade takes place.

Tech infrastructure really needs a method like this as well.

@nixCraft A lot of organizations have disaster recovery plans but don't test them, have disaster recovery drills, or otherwise reinforce the need for practice and vigilance. They just write the plans and present them and go back to business as usual.
@nixCraft It's complicated because this is... not a scenario you'd generally plan for. There's a good chance you wanted to protect your disaster recovery plan and put CrowdStrike on your backup infrastructure!

@nixCraft

I guess you are right…

@nixCraft You also need a secondary service to use, and not just relying on one security service. If one service goes down, then if you had another security service, the best case scenario is it wouldn't be as impactful like it has been.
@nixCraft Managers are reactive, not proactive. If you want them to get excited about a disaster plan you have to burn down the building across the street. This is why this was going to happen no matter what -- humans suck.
@TheKeymaker @nixCraft well, the problem are decisionmakers that are at best #TechIlliterates if not absolutely egoistic morons!
@nixCraft there will be some dead reckoning here. This is not just one person but a whole team. who though this was a good thing? It will be a blood bath? Did they employ people on merit, those with the skills? this type of screw up is unforgivable. But of course a certain tech journalist who's name has been banned on Mastodon. He says even Linux and mac aren't immune to this. And we have had a close call with the xz hack.
@nixCraft I only know about the band Disaster Area.
What kind of music does Disaster Recovery play?

@nixCraft Except DR plan usually don't take in account bad software problem, and don't prevent this kind of incident.

A common strategy for DR it's to have another site with the storage sincronized in near real time. That save you from problem caused by a destroyed data center (fire, flood, bomb) but being a mirror, means that software problems are not mitigated at all.

A better solution is making a rolling deploy, where you install software update & patch on a few server, and after some time if everything is ok, install on more server.

@nixCraft also, like backups that are never tested for recovery, a DR plan that never gets tested is not much better than not having any plan.

@nixCraft
I know some places whose DR plan is "it's on the cloud so it's fine because if one computer goes wrong the others can still access the cloud"

Yeah...

@nixCraft I was thinking about this. Wouldn’t the DR systems also be protected by CrowdStrike and have the same issue.
@nixCraft
Hurry, activate the DR plan. Oops, crowdstrike is installed there too and took it down immediately after it was brought up.