@vatine @mttaggart
In this specific case the new config file was twice the size as normal, and exceeded the size limit isabot is willing to load.
With your setup you have two states in production:
- all new instances of the service fail to start
- old instances of the service with just "lol, which ever rules were loaded at the time".
The old instances with the old rules are now handling more and more of the traffic as time goes on, any debugging attempt will take out a running instance, and any release that involves restarting the isabot service causes outages where it lands. A rollback of that change won't restore the the target of that rollout because the file still is too big.
If you roll out the service that generates the config the running instance might pick up the new config, but only if it's small enough again.
This sounds more difficult to drive than it going down hard and saying why it went down.
Also, having your config generation routine fail to ship updates to prod because some pipeline "kinda guesses this might be a problem" relies on massive amounts of hindsight for a start, and presents a huge risk of config propagation stalling because some version/config mismatch between prod and the pipeline. It sounds like one of the most "can we just turn it off please" false-positive laden annoyance factories imaginable.