Oh look, the asswipes at clownstrike shat out an analysis when everyone's going to DEF CON for hacker summer camp.

Ain't it -nice- to have an activity for those of us at home away from the cosplay-as-a-Problem convention?

Let's take a look at their whining and excuses for why they fucked up everyone's day a few weeks back.

https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

So this starts out with "AI" right up front and, boy howdy, that is -not- a promising beginning.

And then we get to this description of the sensor activity:

So what's that -mean- anyway?

So you have the sensor - that is, the specific bit of software from CS that is installed on your endpoint.

And what that sensor does is correlate "context from its local graph store" - telemetry events that it's got in a database - into "behaviors and indicators of attack"

So it matches behavioral patterns.

And then it talks about "Rapid Response Content" that gets delivered from 'the cloud' to provide behavioral definitions "without requiring sensor code changes"

..........wait what.

Why would you need your -sensor- to have a code change to update what it looks for. This is why we have configuration files and definition files as a concept.

Anyway, this 'Rapid Response Content' is about "behavioral heuristics" which --

Oh hey! As it happens, my actual literal job is making that specific thing for a competitor.

So I can tell you that the way that -I- do this is to look at the behavior of the malware under analysis, and chart out what it does - what files it accesses, what devices it hits, what signals it gets involved with, what system calls it makes, what libraries it's linked to, etc. - and then look into the context of what a -normal- workflow would look like in related areas, and then find the differences.

And from those differences I make a nice lil YAML file that gets sent to the sensor, that it uses to make those kinds of correlations between events and "things to be worried about".

There's, y'know, a whole-ass testing process involved before that happens, but we'll get there.

The "Rapid Response Content" is the same shape as this - it gets sent to the endpoints "via the cloud" (a set of servers) and it changes the behavior of the endpoint sensor to look for specific behaviors.

"We have a pipeline and there's a category tag"
The next para talks about their internal numbering schema. They're not using semver or anything legible to outsiders, which means they have to explain why "Channel 291" is being referred to - in this case, it's the tag for windows named pipe equivalents.

The root cause is that they made it expect 21 arguments, but only gave it 20.

This was not caught in testing because their test environment did not represent the same conditions as their anticipated customer environment, and they put in a wildcard for the missing parameter instead of having a testing path that would validate that parameter.

Left unsaid, but very the fuck pertinent, is that the rest of us who give a fuck have this nasty habit of checking that something we're going to load into a process....has the -right fucking format- for the process, and then we use this cute little concept called an "error message" to let the operator know if -something is the fuck missing-.

Innovative, I know. Top-right quadrant thinking.

Oh, whoops, my mistake:

They -allowed- wildcards in the 21st field initially, and then -disallowed- them but didn't test -that- change.

Nice touch with putting the dates in; y'all are at least compatible with the disaster podcast convention that shit starts getting serious when there's a timestamp, so, credit where it's due.

"And then we pushed an update that triggered the consequences of our prior fuckup in failing to bounds check, failing to lint configurations, failing to understand that a config file could be corrupted or wrong and providing an error handling mechanism, and failing to actually test our shit"
Who -wrote- this shit? I've seen -ciphertext- that's clearer than this shit
So, in summary, the shit I said above and they pinky-swear it can't happen again.

I'm....I'm gonna have to sit with this one for a moment.

Because what it says about their development processes is -fucking fascinating-

So what they're saying here is that the -sensor binary-, at the -time of compilation-, did not validate that the definitions file had the correct number of fields.

But

............you don't -do- that at compile time.

Before the compile as part of your overall process for adding code, to make sure that everything that this code connects to has been adjusted, yeah, that's...that's how software review works.

On execution, when you're -loading- the definitions file, having it check that it's got the number that it was expecting, yes, I was screaming at that up above.

But neither of those are at compile time. Why are they bringing up compile time.

Also, this is not one finding. This is -multiple- findings:

1. the actual lack of validation
2. the lack of effective review process exposed by this, where an invalid state was not caught during the development of the new type
3. the lack of effective testing that did not include, e.g., invalid configuration files -to test such a mechanism in normal operative contexts-

Three is more than one, guys.

WHY IS THIS A MITIGATION

YOU ARE NOT DOING IR RIGHT NOW. THIS IS A POST-INCIDENT REVIEW.

MITIGATIONS ARE DURING THE INCIDENT. POST-INCIDENT FINDINGS GET FUCKING

R E M E D I A T I O N S

YOU ARE USING THE WRONG WORDS FOR WHAT YOU ARE DOING

blah blah they made a patch so it bothers to lint its inputs hooray this does nothing to address the process problems that led to this fuckery you utter dipshits but at least someone told you lint exists moving on
yeah you know there are languages where this problem just doesn't happen?

The phrase "input pointer array" appears in the next para, which means "we are doing silly shit with C++ because we're leet yo"

Languages that don't make you do your own fucking pointer math exist for a fucking reason.

Their 'mitigation' here is to bother to check that they're still in allocated memory, something which is only a problem by their choice.

Oh boy, -test coverage-

So they talk about how their test cases weren't broad enough in the next para, and they promise swearsie-realsie that they'll put in test scenarios that "better reflect production usage"

Buuuut I don't see one -really fucking obvious standout test case- that, given the context above, really the fuck ought to be separated out:

They say nothing about whether they're gonna test the -failure- of the sensor.

If you ain't testing with invalid inputs and other abuses to bound the behavior of your binary, then you're not testing its full envelope of behavior and you cannot assert anything meaningful about its suitability for production.

Car manufacturers do crash tests to make sure you don't fucking impale your face on the steering column; this is the exact same fucking principle.

There's a -lot- of fascinating subtlety and discussion to be had around testing generally,

but this is kindergarten level horseshit. Maybe when they stop eating the crayons we can talk about the more interesting bits.

"a"?

So there's a logic error here alright but it sure the fuck ain't with their agent's parsing, which....this is repeating items 1 and 2, but from a different level of abstraction.

This is turd-polishing.

More to the point:

Why the everliving fuck are you hard-coding a specific number of channels into your fucking agent,

when 'channels' are a tagging convention and have no pertinence to the detection logic,

and you could just -fucking allocate the resources to hold the content based on the configuration itself-

You -utter- -assholes-

You are -creating a problem for yourself- and then -doubling down on doing it wrong-

Anyway seeing as this "finding" is a dupe of 1 and 2 combined, the 'mitigations' are the same horseshit; this is clearly here to pad out the numbers and has no actual merit.
I wonder if they had an "ai" write it and then made an intern take out the Nigerianisms.
this is a dupe of 3.

.......

Mister Holmes, sir, we have a -mystery- on our hands!

Why, just this morning the lad Simpkins came into Scotland Yard with the most astonishing tale and -

Mister Holmes, the mudlarks are in an absolute uproar, you must have heard from the Irregulars -

London's entire sewer system has been -scoured utterly bare-

There is -no shit-, Sherlock!

yeah they don't even try to dress this one up

Problem is, they completely fail to talk at all about staged deployment for -any other part of the product- so uh.

Also as one of their mitigations they're deigning to allow customers whether to accept the new content.

You know, the -base expectation- from -literally everyone else-

But only about this 'channel' content. Not anything with the actual definitions or the agent binary itself; none of that is mentioned at all.

Completely the fuck missing.

So, see, what this -looks- like they're saying is that they've got third parties in to review the code and process.

But those are two separate clauses.

They have two third-parties in to review the -sensor code-

-and-

They are conducting a review of process.

But they are not actually -saying- that the third parties are involved in the process review at all - only the code review.

Perhaps someone ought to ask them to clear that the fuck up.

It's that sticky "we" there, y'see?

"We" -could- be implied to mean the set of crowdstrike, vendor 1, and vendor 2.

But "we" can also refer to crowdstrike the company, or to the personnel of that company.

"We" is one of those words that has -very- tricky scope to it, and can be used to lie to you right to your face.

This whole technical details section is exec-pandering crap.

-this- little fucker is funny tho, 'cuz it implies that if you have an input that cannot be parsed with regular expressions, clownstrike can't handle it.

The next part appears to be an extract from some guy at MS's blog about this shit -

https://www.microsoft.com/en-us/security/blog/2024/07/27/windows-security-best-practices-for-integrating-and-managing-security-tools/

whiiiich pads out the last half of the document and since it isn't clownstrike's work, but just shit they lifted from someone else's blog, doesn't matter

Windows Security best practices for integrating and managing security tools | Microsoft Security Blog

We examine the recent CrowdStrike outage and provide a technical overview of the root cause.

Microsoft Security Blog
@munin I mean, just to be clear, C++ hasn't made you do your own pointer math for like 13 years, or more if one is competent. My point here is they'd find ways to fuck this up in any language. Because they fired hundreads to replace them with AI, fundamental process of incompetence.

@masonbially @munin They've added a lot of new abstractions on top of pointer math, sure. But they haven't meaningfully reduced the risk.

Some examples: std::span::operator[] does UB instead of bounds checking. Same for std::string_view. And both the iterator-based std::copy as well as std::ranges::copy do UB if the output buffer is too small. This new stuff may look nicer but it's exactly as dangerous as pointer math.

@muvlon @masonbially

hey so the pointer math is not the actual issue here; the actual issue is that they made an architectural choice to make the execution of their binary dependent on a fixed integer value hardcoded into the binary, instead of loading options in a way that did not introduce the possibility -of- desynching.

It's a language-independent fundamental architectural situation, showing that they are not coding this to professional standards.

@munin @masonbially There's many layers of fuckup here, as you've detailed very well. But I do think one way they could've fucked up less was using a bounds-checked access and recovering from the error as opposed to yolo-ing it with C++ and getting an unrecoverable BSOD.
@munin smells like they fired all their SREs and they're having someone else do their job
@munin Maybe they're using it to sound cool when they mean "PMAI (prevent)

@munin I read that weirdness as "the sensor binary is compiling the loaded config files [into runtime bytecode/JIT code]" since a lot of AVs do that. And that compiler can validate that detail.

EDIT: after reading the rest of it, I'm not sure I can infer anything from what it says anymore. It really does read like whoever (or whatever) wrote it had no goddamn clue what they were writing about...

@becomethewaifu

this is why I suspect llm horseshit. individual bits -seem- to make sense but there's no cohesion.