Как строить отказоустойчивые кластеры Kubernetes: краткий разбор от команды VK Cloud

Миграция в облако и переход к микросервисной архитектуре сделали Kubernetes (k8s) де-факто стандартом для управления контейнерами. По данным 2025 года, технологию уже применяют 60% крупных российских компаний, а ещё 15% планируют внедрение в будущем. Причем 59% компаний называют отказоустойчивость ключевым критерием при выборе Kubernetes, но лишь единицы реализуют его на практике. Проблема кроется в недооценке системных рисков — от отсутствия резервирования control plane до некорректных таймингов readiness-проб, пропускающих «полуживые» поды в балансировщик. В этой статье мы кратко разберем ключевые принципы проектирования и эксплуатации отказоустойчивых кластеров, типовые сценарии сбоев и рекомендации по исключению рисков на всех уровнях.

https://habr.com/ru/companies/vktech/articles/1042084/

#vk_cloud #kubernetes #отказоустойчивость #high_availability #devops #etcd #storage #statefulset #gitops #backup

Как строить отказоустойчивые кластеры Kubernetes: краткий разбор от команды VK Cloud

Миграция в облако и переход к микросервисной архитектуре сделали Kubernetes (k8s) де-факто стандартом для управления контейнерами. По данным 2025 года, технологию уже применяют 60% крупных российских...

Хабр
Possibly Dirty
#Linux #K8s #etcd

Friends don't let friends run production etcd on SATA disks (or basically anywhere other than local NVMe)!

I was deploying 50 kubernetes virtual clusters (with vcluster) on top of a bare metal cluster running Talos.

All the nodes had NVMe disks, except one, which had SATA SSD... And it did not go well at all.

I was expecting to see a difference, but not that big. The SATA disks were saturated while the NVMe were hovering between 1-5% of I/O load.

The SATA disks were so overloaded, that I had to kick out their node from the etcd cluster (because the API server was extremely slow).

Oh well, it'll be 100% NVMe on this cluster from now on!

#etcd #kubernetes #vcluster #ssd #nvme #homelab

Ваш Kubernetes упал: найдёте root cause за 15 минут?

Вторник, 14:00. Кластер Kubernetes перестал отвечать, команда в панике, а вам нужно за 15 минут найти первопричину. В этой статье пройдём диагностику реального отказа вместе с SRE: увидим логи, манифест etcd и ошибки, которые совершают даже опытные инженеры. Попробуйте сначала решить задачу сами, а потом сверьтесь с пошаговым разбором и проверьте, насколько вы готовы к такому инциденту.

https://habr.com/ru/companies/otus/articles/1031260/

#Kubernetes #etcd #kubelet #SRE #DevOps #productionинцидент #отказ_кластера #root_cause #control_plane #runbook

Ваш Kubernetes упал: найдёте root cause за 15 минут?

Всем привет, меня зовут Сергей Прощаев. Я Tech Lead и руководитель направления Java | Kotlin‑разработки в FinTech & E‑commerce, а ещё преподаю на курсах разработки и архитектуры...

Хабр

We've released #etcd v3.7.0-beta.0!

https://etcd.io/blog/2026/etcd-370-beta/

This release includes RangeStream queries and more.

It also represents several milestones for our project: second regular annual release, our first beta in years, and the addition of long-requested user-visible features instead of just focusing on stability.

Please test it out and let us know how it works for you!

#kubernetes #CloudNative #database

Announcing etcd v3.7.0-beta.0

SIG-Etcd announces the availability of the first beta release of etcd v3.7.0. This new version of the popular distributed database and key Kubernetes component includes the long-requested RangeStream feature, as well as a refactoring and cleanup of multiple legacy components and interfaces. v3.7 will deliver improved security, better operational reliability, and an improved experience for working with large resultsets. First, however, the project needs users to test the beta. You can find v3.7.0-beta.0 here:

etcd

Распределенное KV-хранилище на базе etcd

Я постараюсь, не углубляясь в технические дебри, в научно-популярном ключе рассказать о распределенных KV-хранилищах: что это вообще такое, где применяется и почему мы выбрали именно etcd.

https://habr.com/ru/articles/1025994/

#etcd #KVхранилища #инфраструктура

Распределенное KV-хранилище на базе etcd

Недавно передо мной встал вопрос выбора системы хранения инфраструктурных данных для небольшого проекта . Объем — несколько тысяч записей, основные требования — система должна быть распределенной,...

Хабр

AI at the edge is an infrastructure puzzle. Red Hat is helping solve it by contributing llm-d to the #CNCF, establishing "well-lit paths" for AI-RAN orchestration with SoftBank. 🐧

This is about optimization—making inference a first-class citizen alongside traditional containers.

Proud to see Red Hat continuing our legacy of open-source leadership, from #Kubernetes and #etcd to #KEDA and now #llmd.

Read more: https://www.redhat.com/en/blog/how-llm-d-brings-critical-resource-optimization-softbanks-ai-ran-orchestrator

#RedHat #AI #OpenSource #KubeCon #CloudNative

How llm-d brings critical resource optimization with SoftBank’s AI-RAN orchestrator

In Red Hat’s latest collaboration with SoftBank Corp., we have integrated llm-d into SoftBank’s AI-RAN orchestrator, AITRAS.

Red Hat is contributing llm-d to the #CNCF, turning fragmented AI into modular, interoperable microservices. 🐧

The goal? Make AI inference a first-class citizen in the same cloud-native environment as your traditional apps.

I love how Red Hat continues to fuel the #OpenSource ecosystem. From our roots in #Kubernetes and #etcd to newer projects like #KEDA and #CRI-O, we’re committed to building "well-lit paths" for everyone.

#RedHat #KubeCon #CloudNativeCon #AI #llmd

https://www.redhat.com/en/blog/why-were-contributing-llm-d-cncf-standardizing-future-ai?sc_cid=701f2000000txokAAA&utm_source=bambu&utm_medium=organic_social

Why we’re contributing llm-d to the CNCF: Standardizing the future of AI

Red Hat is contributing llm-d to the Cloud Native Computing Foundation (CNCF) as a Sandbox project to standardize high-performance, distributed AI inference serving within the cloud-native stack. This contribution aims to bridge the capabilities gap between AI experimentation and production by providing a specialized data-plane orchestration layer that maximizes infrastructure efficiency and enables flexible deployment on any choice of hardware.

#etcd is #k8s 's key-value store where all cluster state lives including secrets. By default secrets are only base64 encoded in etcd, not encrypted. If someone gets etcd access (backup file, snapshot, direct port access) they get all your secrets in plaintext. You can enable encryption at rest for etcd, but most people don't set it up and it's still inside your #cluster
#agenix decrypts the .age file → feeds the #secretbox key to kube-apiserver → apiserver uses it for etcd. The failure happened at the agenix layer (wrong key in the .age file), not in secretbox itself.
RBAC defeats: A compromised pod, a stolen kubeconfig, a rogue user — anyone who tries to read secrets through the Kubernetes API without sufficient permissions. They hit the apiserver, RBAC says no, they get a 403.
secretbox defeats: Someone who bypasses the API entirely — steals the etcd data directory, takes an etcd snapshot from a backup, reads etcd directly over its client port without going through kube-apiserver. RBAC never runs in this scenario because the attacker never talked to kube-apiserver.
The critical insight: secretbox does nothing if the attacker has API access, and RBAC does nothing if the attacker has disk access. They cover completely non-overlapping attack surfaces.
problem hit here would have been identical with #SQLite — the encryption layer is in kube-apiserver, not in the storage backend. But the operational simplicity of SQLite would have made recovery easier since inspecting and backing up the database is much more straightforward than #etcd snapshot management.
#kubernetes