Jacob

@jhscott
311 Followers
75 Following
55 Posts
Exceptional
* the intended vibe here is that anything that pages frequently should be automated away

A take on pages.

Pages should be
1) Urgent
2) High Context
3) Exceptional

Urgency
* Should be a real problem that cannot be handled by less obtrusive means
* Should be sent to the right responder

High Context
* Should have a clear purpose/reason for existing
* Should be actionable / have runbooks most of the time and whenever possible ( Systems are full of unknown unknowns and some pages will correspond to novel failures where no a priori guidance is possible.)

wdyt?

1/2

@sheetpima

Is there a "Scaling Cake Rule"* where you must either
1) overprovision to cover your maximum bursty spike or
2) accept capacity failures during scaleups

or is are there techniques to dodge this tradeoff?

* e.g. you can't eat and have

@sheetpima I'm generally curious about how folks handle autoscaling of spiky "online" (a/jk/a "Serving" from Google's Autopilot paperhttps://dl.acm.org/doi/pdf/10.1145/3342195.3387524 ) services.

AFAICT it probably takes O(minutes) to scale up to a spike... are folks just comfortable with eating failures while that scaling happens?

@sheetpima
With autoscaling, after 15 minutes:
* A may scale down because it takes less resources to serve fast failing requests
* B will almost certainly scale down because it has gotten no traffic

Now at minute 30, when X recovers, both A and B are underscaled, and recovery likely takes significantly longer than the static case.

@sheetpima

Lets say A and B both have 10 k8s pods running steady state, and consider the following incident: X goes hard down for 30m.

So for 30m, A will make no requests to B, all requests to A will fail,.

Under static scaling:
* Both A and B will run with 10 pods.
* After 30m recovery will be straightforward (in this simple example, obviously cases where it is not)

@sheetpima
🤣 Pete it was on my list to ping you on this so perfect. Sorry about my Mastodon latency lol.

I definitely don't want tight coupling!

But I am trying to reason about a potential regression in autoscaling compared to static scaling.

Consider a simple architecture. Service B is a stateless service with no dependencies. Service A depends on B and DB X (queries X before calling B).

@dustyburwell I don't believe so -- is there a specific chapter you think I should look at?

🤔 where can I learn more about advanced topics in autoscaling? For example:
* Service A makes calls to Service B, B has no other clients
* I want to move both to autoscaling
* If A sees 50% less traffic during US evenings, I'd like A and B to scale down to save money
* If A has an incident, I want to avoid B scaling down and creating a thundering herd during recovery

@marcbrooker any suggestions? RIP old Twitter...