Oh look, the asswipes at clownstrike shat out an analysis when everyone's going to DEF CON for hacker summer camp.

Ain't it -nice- to have an activity for those of us at home away from the cosplay-as-a-Problem convention?

Let's take a look at their whining and excuses for why they fucked up everyone's day a few weeks back.

https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

So this starts out with "AI" right up front and, boy howdy, that is -not- a promising beginning.

And then we get to this description of the sensor activity:

So what's that -mean- anyway?

So you have the sensor - that is, the specific bit of software from CS that is installed on your endpoint.

And what that sensor does is correlate "context from its local graph store" - telemetry events that it's got in a database - into "behaviors and indicators of attack"

So it matches behavioral patterns.

And then it talks about "Rapid Response Content" that gets delivered from 'the cloud' to provide behavioral definitions "without requiring sensor code changes"

..........wait what.

Why would you need your -sensor- to have a code change to update what it looks for. This is why we have configuration files and definition files as a concept.

Anyway, this 'Rapid Response Content' is about "behavioral heuristics" which --

Oh hey! As it happens, my actual literal job is making that specific thing for a competitor.

So I can tell you that the way that -I- do this is to look at the behavior of the malware under analysis, and chart out what it does - what files it accesses, what devices it hits, what signals it gets involved with, what system calls it makes, what libraries it's linked to, etc. - and then look into the context of what a -normal- workflow would look like in related areas, and then find the differences.

And from those differences I make a nice lil YAML file that gets sent to the sensor, that it uses to make those kinds of correlations between events and "things to be worried about".

There's, y'know, a whole-ass testing process involved before that happens, but we'll get there.

The "Rapid Response Content" is the same shape as this - it gets sent to the endpoints "via the cloud" (a set of servers) and it changes the behavior of the endpoint sensor to look for specific behaviors.

"We have a pipeline and there's a category tag"
The next para talks about their internal numbering schema. They're not using semver or anything legible to outsiders, which means they have to explain why "Channel 291" is being referred to - in this case, it's the tag for windows named pipe equivalents.

The root cause is that they made it expect 21 arguments, but only gave it 20.

This was not caught in testing because their test environment did not represent the same conditions as their anticipated customer environment, and they put in a wildcard for the missing parameter instead of having a testing path that would validate that parameter.

Left unsaid, but very the fuck pertinent, is that the rest of us who give a fuck have this nasty habit of checking that something we're going to load into a process....has the -right fucking format- for the process, and then we use this cute little concept called an "error message" to let the operator know if -something is the fuck missing-.

Innovative, I know. Top-right quadrant thinking.

Oh, whoops, my mistake:

They -allowed- wildcards in the 21st field initially, and then -disallowed- them but didn't test -that- change.

Nice touch with putting the dates in; y'all are at least compatible with the disaster podcast convention that shit starts getting serious when there's a timestamp, so, credit where it's due.

"And then we pushed an update that triggered the consequences of our prior fuckup in failing to bounds check, failing to lint configurations, failing to understand that a config file could be corrupted or wrong and providing an error handling mechanism, and failing to actually test our shit"
Who -wrote- this shit? I've seen -ciphertext- that's clearer than this shit
So, in summary, the shit I said above and they pinky-swear it can't happen again.

I'm....I'm gonna have to sit with this one for a moment.

Because what it says about their development processes is -fucking fascinating-

So what they're saying here is that the -sensor binary-, at the -time of compilation-, did not validate that the definitions file had the correct number of fields.

But

............you don't -do- that at compile time.

Before the compile as part of your overall process for adding code, to make sure that everything that this code connects to has been adjusted, yeah, that's...that's how software review works.

On execution, when you're -loading- the definitions file, having it check that it's got the number that it was expecting, yes, I was screaming at that up above.

But neither of those are at compile time. Why are they bringing up compile time.

Also, this is not one finding. This is -multiple- findings:

1. the actual lack of validation
2. the lack of effective review process exposed by this, where an invalid state was not caught during the development of the new type
3. the lack of effective testing that did not include, e.g., invalid configuration files -to test such a mechanism in normal operative contexts-

Three is more than one, guys.

WHY IS THIS A MITIGATION

YOU ARE NOT DOING IR RIGHT NOW. THIS IS A POST-INCIDENT REVIEW.

MITIGATIONS ARE DURING THE INCIDENT. POST-INCIDENT FINDINGS GET FUCKING

R E M E D I A T I O N S

YOU ARE USING THE WRONG WORDS FOR WHAT YOU ARE DOING

blah blah they made a patch so it bothers to lint its inputs hooray this does nothing to address the process problems that led to this fuckery you utter dipshits but at least someone told you lint exists moving on
yeah you know there are languages where this problem just doesn't happen?

The phrase "input pointer array" appears in the next para, which means "we are doing silly shit with C++ because we're leet yo"

Languages that don't make you do your own fucking pointer math exist for a fucking reason.

Their 'mitigation' here is to bother to check that they're still in allocated memory, something which is only a problem by their choice.

Oh boy, -test coverage-

So they talk about how their test cases weren't broad enough in the next para, and they promise swearsie-realsie that they'll put in test scenarios that "better reflect production usage"

Buuuut I don't see one -really fucking obvious standout test case- that, given the context above, really the fuck ought to be separated out:

They say nothing about whether they're gonna test the -failure- of the sensor.

If you ain't testing with invalid inputs and other abuses to bound the behavior of your binary, then you're not testing its full envelope of behavior and you cannot assert anything meaningful about its suitability for production.

Car manufacturers do crash tests to make sure you don't fucking impale your face on the steering column; this is the exact same fucking principle.

There's a -lot- of fascinating subtlety and discussion to be had around testing generally,

but this is kindergarten level horseshit. Maybe when they stop eating the crayons we can talk about the more interesting bits.

"a"?

So there's a logic error here alright but it sure the fuck ain't with their agent's parsing, which....this is repeating items 1 and 2, but from a different level of abstraction.

This is turd-polishing.

More to the point:

Why the everliving fuck are you hard-coding a specific number of channels into your fucking agent,

when 'channels' are a tagging convention and have no pertinence to the detection logic,

and you could just -fucking allocate the resources to hold the content based on the configuration itself-

You -utter- -assholes-

You are -creating a problem for yourself- and then -doubling down on doing it wrong-

Anyway seeing as this "finding" is a dupe of 1 and 2 combined, the 'mitigations' are the same horseshit; this is clearly here to pad out the numbers and has no actual merit.
I wonder if they had an "ai" write it and then made an intern take out the Nigerianisms.
this is a dupe of 3.

.......

Mister Holmes, sir, we have a -mystery- on our hands!

Why, just this morning the lad Simpkins came into Scotland Yard with the most astonishing tale and -

Mister Holmes, the mudlarks are in an absolute uproar, you must have heard from the Irregulars -

London's entire sewer system has been -scoured utterly bare-

There is -no shit-, Sherlock!

yeah they don't even try to dress this one up

Problem is, they completely fail to talk at all about staged deployment for -any other part of the product- so uh.

Also as one of their mitigations they're deigning to allow customers whether to accept the new content.

You know, the -base expectation- from -literally everyone else-

But only about this 'channel' content. Not anything with the actual definitions or the agent binary itself; none of that is mentioned at all.

Completely the fuck missing.

So, see, what this -looks- like they're saying is that they've got third parties in to review the code and process.

But those are two separate clauses.

They have two third-parties in to review the -sensor code-

-and-

They are conducting a review of process.

But they are not actually -saying- that the third parties are involved in the process review at all - only the code review.

Perhaps someone ought to ask them to clear that the fuck up.

It's that sticky "we" there, y'see?

"We" -could- be implied to mean the set of crowdstrike, vendor 1, and vendor 2.

But "we" can also refer to crowdstrike the company, or to the personnel of that company.

"We" is one of those words that has -very- tricky scope to it, and can be used to lie to you right to your face.

This whole technical details section is exec-pandering crap.

-this- little fucker is funny tho, 'cuz it implies that if you have an input that cannot be parsed with regular expressions, clownstrike can't handle it.

The next part appears to be an extract from some guy at MS's blog about this shit -

https://www.microsoft.com/en-us/security/blog/2024/07/27/windows-security-best-practices-for-integrating-and-managing-security-tools/

whiiiich pads out the last half of the document and since it isn't clownstrike's work, but just shit they lifted from someone else's blog, doesn't matter

Windows Security best practices for integrating and managing security tools | Microsoft Security Blog

We examine the recent CrowdStrike outage and provide a technical overview of the root cause.

Microsoft Security Blog

So yeah, only the first six pages have any content on them; two of the findings are duplicates and are just there to pad for length; they -missed- a bunch of other findings; they are committed to a known-broken sensor operations regime and have no clear plans to fix the underlying architectural issues exposed -by- this; and they don't have anyone left in the place who can fucking write worth a damn.

Complete fucking clownshoes. If I were their customer I would be calling for their literal, heart-ripped-from-chest, blood for this.

Also:

Who the everliving fuck -audited- this pile of shit?

Who signed off that -this- was suitable for deployment to federal computers?

Who the fuck did their audit and why the fuck did they not catch -any- of this?

That "compile time" thing earlier is still bugging the shit out of me, especially because the para following doesn't talk about compilation at all.

........y'all.

I -really- suspect that this document was an LLM summary of some collation of internal documents that got some light editing.

Y'all, they didn't even respect us the fuck enough to have an actual human write this out for us.
Not to mention half the fucking document is -someone else's work-

......who has a very uncomfortable looking smile on his blog profile, damn. That must hurt to make that expression.

......there's nothing on here about licensing; anyone know if MS lets you just, like, steal half a blogpost to crap into your document to fill out space?

.....yeah, looking back, that fixation on defining why "channel 291" was pertinent?

A person wouldn't keep referring to it by number, but would use a pronoun phrase. "the updated channel" or similar.

And then with that mishmash in that other para with the whole 20 vs 21 thing.....

Yeah, that's not how a person would write that at all. Certainly not someone who actually understood what was going on.

@munin I have audited things knowing full well that they’re just gonna accept-risk as-designed bla-bla-bla all of it and then tell people they had it audited

@0xabad1dea

.......starting to have this creeping feeling that the standards I've held myself to in this area are way the fuck higher than ......standard.

@munin we have two different kinds of customers, those who genuinely want to shake the bugs out of their products before they ship and those who want to legally say “it was audited”. I really hate the latter but I don’t have much control over it
@0xabad1dea @munin yup. that pattern hasn't changed in the past decade and I don't expect it to change in the next decade either. on the one hand I'm glad compliance driven security makes them at least do the bare minimum with some sort of SLA on critical findings, but on the other hand it's deeply frustrating that the bar is so low.
@0xabad1dea @munin even within the same company I have the two same types of internal customers. Some are noncollaborative and will delay sending you material, source code/cleartext firmware or credentials to make your work more difficult.