Blog post now live
Creating a kubernetes autoscaling operator that responds to UPS events
This has been a really interesting project, thanks for all the feedback!
Blog post now live
Creating a kubernetes autoscaling operator that responds to UPS events
This has been a really interesting project, thanks for all the feedback!
Now that I have a good number of services running on my Talos Kubernetes cluster and backups sorted, I wanted a solution to power failures. For a long time, I’ve used a UPS to give enough time to shutdown servers gracefully. With services, databases and storage all on the same cluster, I wanted to investigate scaling down services and bringing them back up again when power is restored. This should happen in a specified order - deployments & statefulsets first followed by databases and final storage.
Some final thoughts on this thread now I have recovered from the shock of it actually working:
NUT reports an UPS run time of 24 mins. It really means 6 minutes, even though the load does not increase dramatically when scaling down.
The low battery warning /shutdown message from the UPS occurs too late so now we scale down as soon as the power goes.
The time to scale down has been reduced from 3min30sec to about a minute by doing more in parallel.
Successful test today 😃
I have some tweaking to do with the scale down / up event timings but everything worked as planned. Probably time for a blog post…
2. Scaling back up is configured to wait until the battery has recharged and reaches 15mins runtime. The UPS takes about 3 hours to recharge to this level.
Thinking about options for this one while I wait for the UPS to recharge.
We could scale back up as soon as the UPS comes back online or provide an override endpoint to trigger scaling back up <- this is how my nut-exporter mock works for testing
Today's learnings...
1. The UPS is connected to a Synology NAS which exposes the status to the network with nut-exporter. When the UPS issues a low battery, the NAS goes into suspend mode before the operator can collect the status.
Improved approach: If nut-exporter as gone away and the last status was 'on battery' scale down the cluster
Today should be the day the UPS gets unplugged as a final test. But there's a couple more tests to run first:
- full scale down without rook
- full scale down including rook
- switch off the UPS 😱
I've made assumptions about the states the UPS should present when the power goes out (ie On Battery, Shutdown or Forced Shutdown) and the code should be able to handle various scenarios but it will be interesting to see what actually happens..... 🤞
Ok thanks to the suggestion from @nicr9 I have completely refactored the code to use a ups-scaler Custom Resource Definition instead of labels and annotations.
Most of the work was in refactoring the tests as in addition to new API calls, the order of the previous calls all changed...
Unbelievably, I just deployed to the cluster and successfully completed a scale down / scale up for five test workloads at the first attempt. Beer o'clock now 🍻
Nope. Annotations disappear too.
All it takes is an operator restart 😢
Guess I’ll have to see if the labels can be added from the helm chart
Fortunately there looks to be an easy option - use annotations instead of labels. I’m not using the label selector functionality so it doesn’t really make a difference.
But first I need to spin up a test cluster and recreate the issue. If it is the operator, a simple restart would trigger removing the labels…….
One observation from today’s test that I need to figure out:
The rook operator removed custom labels from the ceph-exporter and csi-provisioner deployments when it was restarted. The annotations were untouched. Need to work out is this is by design or not…..
Would it matter if these #rook #ceph deployments are not scaled down?