Mastodawn

Fi 🏳️‍⚧️Aug 6, 2024

Oh look, the asswipes at clownstrike shat out an analysis when everyone's going to DEF CON for hacker summer camp.

Ain't it -nice- to have an activity for those of us at home away from the cosplay-as-a-Problem convention?

Let's take a look at their whining and excuses for why they fucked up everyone's day a few weeks back.

https://www.crowdstrike.com/wp-content/uploads/2024/08/Channel-File-291-Incident-Root-Cause-Analysis-08.06.2024.pdf

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

So this starts out with "AI" right up front and, boy howdy, that is -not- a promising beginning.

And then we get to this description of the sensor activity:

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

So what's that -mean- anyway?

So you have the sensor - that is, the specific bit of software from CS that is installed on your endpoint.

And what that sensor does is correlate "context from its local graph store" - telemetry events that it's got in a database - into "behaviors and indicators of attack"

So it matches behavioral patterns.

And then it talks about "Rapid Response Content" that gets delivered from 'the cloud' to provide behavioral definitions "without requiring sensor code changes"

..........wait what.

Why would you need your -sensor- to have a code change to update what it looks for. This is why we have configuration files and definition files as a concept.

Anyway, this 'Rapid Response Content' is about "behavioral heuristics" which --

Oh hey! As it happens, my actual literal job is making that specific thing for a competitor.

So I can tell you that the way that -I- do this is to look at the behavior of the malware under analysis, and chart out what it does - what files it accesses, what devices it hits, what signals it gets involved with, what system calls it makes, what libraries it's linked to, etc. - and then look into the context of what a -normal- workflow would look like in related areas, and then find the differences.

And from those differences I make a nice lil YAML file that gets sent to the sensor, that it uses to make those kinds of correlations between events and "things to be worried about".

There's, y'know, a whole-ass testing process involved before that happens, but we'll get there.

The "Rapid Response Content" is the same shape as this - it gets sent to the endpoints "via the cloud" (a set of servers) and it changes the behavior of the endpoint sensor to look for specific behaviors.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

"We have a pipeline and there's a category tag"

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

The next para talks about their internal numbering schema. They're not using semver or anything legible to outsiders, which means they have to explain why "Channel 291" is being referred to - in this case, it's the tag for windows named pipe equivalents.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

The root cause is that they made it expect 21 arguments, but only gave it 20.

This was not caught in testing because their test environment did not represent the same conditions as their anticipated customer environment, and they put in a wildcard for the missing parameter instead of having a testing path that would validate that parameter.

Left unsaid, but very the fuck pertinent, is that the rest of us who give a fuck have this nasty habit of checking that something we're going to load into a process....has the -right fucking format- for the process, and then we use this cute little concept called an "error message" to let the operator know if -something is the fuck missing-.

Innovative, I know. Top-right quadrant thinking.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

Oh, whoops, my mistake:

They -allowed- wildcards in the 21st field initially, and then -disallowed- them but didn't test -that- change.

Nice touch with putting the dates in; y'all are at least compatible with the disaster podcast convention that shit starts getting serious when there's a timestamp, so, credit where it's due.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

"And then we pushed an update that triggered the consequences of our prior fuckup in failing to bounds check, failing to lint configurations, failing to understand that a config file could be corrupted or wrong and providing an error handling mechanism, and failing to actually test our shit"

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

Who -wrote- this shit? I've seen -ciphertext- that's clearer than this shit

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

So, in summary, the shit I said above and they pinky-swear it can't happen again.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

I'm....I'm gonna have to sit with this one for a moment.

Because what it says about their development processes is -fucking fascinating-

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

So what they're saying here is that the -sensor binary-, at the -time of compilation-, did not validate that the definitions file had the correct number of fields.

But

............you don't -do- that at compile time.

Before the compile as part of your overall process for adding code, to make sure that everything that this code connects to has been adjusted, yeah, that's...that's how software review works.

On execution, when you're -loading- the definitions file, having it check that it's got the number that it was expecting, yes, I was screaming at that up above.

But neither of those are at compile time. Why are they bringing up compile time.

Also, this is not one finding. This is -multiple- findings:

1. the actual lack of validation
2. the lack of effective review process exposed by this, where an invalid state was not caught during the development of the new type
3. the lack of effective testing that did not include, e.g., invalid configuration files -to test such a mechanism in normal operative contexts-

Three is more than one, guys.

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

WHY IS THIS A MITIGATION

YOU ARE NOT DOING IR RIGHT NOW. THIS IS A POST-INCIDENT REVIEW.

MITIGATIONS ARE DURING THE INCIDENT. POST-INCIDENT FINDINGS GET FUCKING

R E M E D I A T I O N S

YOU ARE USING THE WRONG WORDS FOR WHAT YOU ARE DOING

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

blah blah they made a patch so it bothers to lint its inputs hooray this does nothing to address the process problems that led to this fuckery you utter dipshits but at least someone told you lint exists moving on

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

yeah you know there are languages where this problem just doesn't happen?

Show thread

Fi 🏳️‍⚧️Aug 6, 2024

The phrase "input pointer array" appears in the next para, which means "we are doing silly shit with C++ because we're leet yo"

Languages that don't make you do your own fucking pointer math exist for a fucking reason.

Their 'mitigation' here is to bother to check that they're still in allocated memory, something which is only a problem by their choice.

Show thread

Mason Bially

@munin I mean, just to be clear, C++ hasn't made you do your own pointer math for like 13 years, or more if one is competent. My point here is they'd find ways to fuck this up in any language. Because they fired hundreads to replace them with AI, fundamental process of incompetence.

Show thread

Wanja Aug 9, 2024

@masonbially @munin They've added a lot of new abstractions on top of pointer math, sure. But they haven't meaningfully reduced the risk.

Some examples: std::span::operator[] does UB instead of bounds checking. Same for std::string_view. And both the iterator-based std::copy as well as std::ranges::copy do UB if the output buffer is too small. This new stuff may look nicer but it's exactly as dangerous as pointer math.

Show thread

Fi 🏳️‍⚧️Aug 9, 2024

@muvlon @masonbially

hey so the pointer math is not the actual issue here; the actual issue is that they made an architectural choice to make the execution of their binary dependent on a fixed integer value hardcoded into the binary, instead of loading options in a way that did not introduce the possibility -of- desynching.

It's a language-independent fundamental architectural situation, showing that they are not coding this to professional standards.

Show thread

Wanja Aug 9, 2024

@munin @masonbially There's many layers of fuckup here, as you've detailed very well. But I do think one way they could've fucked up less was using a bounds-checked access and recovering from the error as opposed to yolo-ing it with C++ and getting an unrecoverable BSOD.