Mastodawn

Demetrius Nov 18

Chatting with a friend about Cloudflare's intermittent outages today, they brought up an interesting point: How many organizations have started relying on Cloudflare to do basic security blocking and tackling stuff, like stopping SQL injection attacks at the edge? Maybe your devs were lazy at blocking this stuff in the past b/c CF was the control layer to compensate for that.

You might say well okay but if CF is down, so are the sites relying on them, and that's true. But a lot of organizations will switch CF off during these times to keep their sites and services reachable and running. And my friend's point was that for those organizations, they might want to take a closer look at the traffic they received during this eight-hour outage window or whatever, and I think that's sound advice.

BrianKrebs Nov 18

Can someone please explain what this means?

"A Cloudflare spokesperson said the "root cause" of the outage was an automatically generated configuration file used to manage threat traffic that "grew beyond an expected size of entries," which triggered a crash in the software system that handles traffic for several of its services."

https://www.cnbc.com/2025/11/18/cloudflare-down-outage-traffic-spike-x-chatgpt.html

BrianKrebs Nov 19

Cloudflare just put up a blog post on today's incident.

https://blog.cloudflare.com/18-november-2025-outage/

Cloudflare outage on November 18, 2025

Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected.

The Cloudflare Blog

@briankrebs So the top line is their automated process for creating block lists from Clickhouse had a bug that duplicated entries, which broke because the list was capped at 200 entities. Then, another component that relied on that list being returned failed to gracefully handle a 5xx error.

DCoder 🇱🇹❤🇺🇦Nov 18

@briankrebs
I can’t wait to find out how they reinvented strcpy for the 21st century.

EndlessMason Nov 18

@dcoderlt @briankrebs
Nah, it's gonna be a block list that gets memory mapped before the service forks workers combined with an auto-reload of the file causing workers to get so huge because of cow they get oomkilleded

@EndlessMason @dcoderlt @briankrebs
If it was because of cow they'd be memory (out of) killed. mookilled if you will.

EndlessMason Nov 19

@noodle @dcoderlt @briankrebs
Thanks for the offer, but I definitely won't.

EndlessMason Nov 19

@dcoderlt @briankrebs
Block list (ish)
File doubled in size causing
Unexpected worker death

I'm gonna call it "decent"

jamesb (MW1CGG)Nov 18

@briankrebs Their CSV file grew to 65536 rows? 🤣

@jamesb @briankrebs 😂😂 that's Excel

@jamesb @briankrebs json file over 4GB

BrianKrebs Nov 18

@cinebox @jamesb lol so....tired: too big to fail; wired: 4gigs to fail?

Erik Ableson Nov 18

@briankrebs The disk is full?

My translation is that control and operation of the Internet through single chokepoints is a terrible idea, antithetical to the original purpose of the Internet: a means of communication which could survive nuclear attack.

We need to put the Internet into the hands of the people again. Decentralize it and remove Big Tech from the picture altogether.

#Internet
#Cloudflare

Token Sane Person Nov 18

@Quasit @briankrebs Trouble is, many Internet services have unlimited economies of scale and/or network effects. As long as that is true, business will naturally gravitate to the biggest operator. Decentralisation means getting rid of those economic forces.

@tokensane @briankrebs

Exactly my point. Those forces have already set off a global extinction event, and we're not immune to that.

No humans = No Internet

I think capitalism has run it's course, frankly. It's a choice between extinction and an end to capitalism at this point. Seems to me it's a no-brainer!

Token Sane Person Nov 19

@Quasit @briankrebs I see lots of talk about the end of capitalism, but never anything about what comes after.

@briankrebs they use some kind of automation / rules / ML etc to generate blocking configuration, and they automatically just deploy it. It generated something so big, it crashed the deployment target system. That system had no protection or validation against loading large configs, and the producing system had no checks on output size

wrosecrans Nov 18

@ottobackwards @briankrebs Or for example, the validator choked and refused to distribute updated configs, and then nodes started stopping because they were set to not serve traffic if their blocklist is so old it might be a security problem to serve with it.

Every line of defense is a potential cause of problems in a complex enough system. Hard to be super specific guessing from the outside, but a lot of "obvious" mitigations just wind up moving the breakpoint left or right with the same end.

@briankrebs buffer overflow lol

@briankrebs from the hip, it sounds like a logging recursion, where the act of logging something was logged in such a way as to exponentially grow the file. Like remembering that you remembered remembering etc.

@Innomen @briankrebs

I did that once on AWS. Caught it before it got too out of hand…

@wiredog @briankrebs

It's arguably easier to do than not. The whole concept of memory is simpler than selectively forgetting. Recording is easier than editing, logically.

thinkberg Nov 18

@briankrebs haha, sounds like the bug in traefik, where SSL certs of a certain chain length would crash it, which almost killed our vaccination cert release. They knew about it but only fixed it when we put a large premium on it.

@briankrebs DevOps, when a mistake takes down 5000 servers to minimize lack of consistency between them

Mary Branscombe Nov 18

@briankrebs it's hardly the first time a log file getting too big blew up a server

Nu Modular Nov 18

@briankrebs PTFB v2

Someone hit the wrong Freakin' Button!

cauZation Nov 18

@numodular @briankrebs lol...

BREAKING News!..

Too much is too much, and too little is too little.

Now, back to our latest autocratic disintegration...

@briankrebs file got big and went boom and took all the RAM?

Cat 🐈🥗 (D.Burch)

@briankrebs "Our use of vibe coding finally bit us in the butt"

@catsalad @briankrebs TBH I'm not so sure, but only because the idea the enterprise programmers need AI to write shit code is giving them too much credit IMO :D

Orb 2069 Nov 18

Abstract all you want, MAX_INT still matters.

Moriel 🏳️‍⚧️Nov 18

Sounds like maybe an out of memory error when allocating memory for the data being read. Maybe?

Serge Keller 👾Nov 18

@briankrebs They messed up.

SarcastiCanadian Nov 18

@briankrebs Translation: CF has adopted the strategy of making statements without saying anything.

("When I've got nothing to say, my lips are sealed." Talking Heads, Psycho Killer)

schrotthaufen Nov 18

@briankrebs Some guesses, because that statement isn’t really narrowing it down: Inefficient handling of large files caused an out of memory situation? Memory corruption, because of static buffer size? Config parser didn’t check if it can handle the whole file, and used partial data?

Mark Shane Hayden Nov 18

@briankrebs probably a malicious botnet that was leveraging Cloudflare services was detected by Cloudflare's threat detection services so Cloudflare started trying to block Cloudflare and they had some sore of endless recursion of Battling Cloudflare Services err....flare up 🤡 🔥

@briankrebs Their k8s configmap exceeded the 1mb etcd limit. 😂

@briankrebs three possibilities:

1. The software started detecting and tracking additional threats, and the automated process which produced the set of threats didn't have a size limit (or the size limit was smaller than the implementation's capacity).

2. The software for detecting and tracking threats had a bug which flagged a bunch of non-malicious software as threats. The outage then proceeded as (1).

3. The software that detects and tracks threats had a bug that caused it to double-, triple-, or more report on the threats. There were no more threats than before, but the threat file size grew to the point where the threat-blocking components couldn't read them.

I'm also interested in which component couldn't handle the threat list -- was this standard server boxes with cloudflare software, or e.g. edge network devices being asked to load a set of firewall rules that didn't fit in ASICs / SRAM?

@briankrebs regular expression turned out to be not very regular, i.e. the crowdstrike thing

i don't think that's true, but it would be hilarious

Wilfried Klaebe Nov 18

@briankrebs Violation of the "no arbitrary limits" rule?

@briankrebs 65536 evil IPs? (or an int overflow of IPs but that would be wild.)

Most likely a random chosen value in a database field that was reached.

Either way i'm sure you'll have an article on it soon that will explain it to us in a few days.

Adrian Cockcroft Nov 18

@briankrebs I’d say that the root cause is propagating config files globally without staging them and verifying that they work, and rolling back changes that don’t work.

Michael Kohne Nov 18

@briankrebs I'm gonna guess that someone parses that file by reading into a fixed size buffer, and no one else knew that until they got to big. Oh, and also the failure mode was...less than optimal.

Sophie Schmieg Nov 18

@briankrebs service that was designed for less than 2^n entries achieved 2^n entries, where n was a reasonable choice at the time. Usually due to running out of integers of a certain size, or finding collisions in hash functions that were not supposed to collide with reasonable probability.

It happens with some frequency when more internet traffic then you could conceivably imagine runs through your servers. Ideally you catch that before bad stuff happens.

Craig Stuntz Nov 18

@sophieschmieg @briankrebs This feels like CloudFlare having a "Crowdstrike moment." I guess we can wait for the postmortem, though.

@briankrebs technocratic marketing speak for "it's DNS."

Hippasus500 aka jwn2 Nov 18

What it means is the expecting Cloudfare to do your security for your organization is like expecting the .01% wealthiest people to keep hospital emergency room open — not to get political or anything…<sigh>

Doug Whitfield [Minneapolis]Nov 18

@briankrebs I suspect no one outside of CloudFlare can but my assumption is...ran out of disk space

@briankrebs 65356^2 flow table entries I suppose

David Cohen Nov 18

@briankrebs Somebody screwed up.

Matt Hall Nov 18

... They filled up the disk the service was running on?

That's how that reads to me and that's kinda funny.

Michael ☕️Nov 18

@briankrebs
This is why you don't use FAT32 on your servers

Loren Kohnfelder Nov 18

@briankrebs I think the one thing we can deduce from just this with certainty is that "the software system that handles traffic for several of its services" was not properly tested. If there is a fixed "expected size of entries" then you must define and test the case of exceeding that (ignore the excess? raise a priority alert? etc.).

I AM BANKSY ☕ / 🗑‍🔥Nov 18

@briankrebs
"Someone was editing a zone file by hand and suffered a tragic clipboard paste just before saving. Shit got real."

GradientDescent Nov 19

@briankrebs their postmortem is up now: https://blog.cloudflare.com/18-november-2025-outage/

Cloudflare outage on November 18, 2025

Cloudflare suffered a service outage on November 18, 2025. The outage was triggered by a bug in generation logic for a Bot Management feature file causing many Cloudflare services to be affected.

The Cloudflare Blog

GradientDescent Nov 19

@briankrebs having read through it, I'll try my best to explain...

They configure their fleet with a file that contains feature weights for a model that's built to classify and handle traffic from bots (a feature of cloudflare is that they can curtail traffic from bots before it hits your website).

GradientDescent Nov 19

@briankrebs They were updating a backend database that has the feature weights used to build that config file, and a normally fine query (that wasn't expecting the backend update) resulted in the file getting bloated from the 60 or so features to above 200. To complicate matters, their error handling code for when the features file grew above 200 entries didn't safely/gracefully handle the error. Instead, it would simply kill the process that was trying to read the file.

🐦‍🔥nemo™🐦‍⬛ 🇺🇦🍉Nov 18

@briankrebs I'm no network pro by any means. This was the first thing which came up to my mind at once. When I grasped what happened after a friend, pointed out to me that cloudflare is down five min. into it.

edit: corrected spelling errors due to dyslexia #meaculpa

unattributed 𓂃✍︎Nov 18

@briankrebs This also makes the case for a separation of concerns. IE. content delivery, connectivity and security are all concerns, and while there is overlap, it is advantageous to design these systems such that they aren't completely interdependent on each other.