Daniele Polencic
Daniele Polencic

Why etcd breaks at scale in Kubernetes

February 2026


Why etcd breaks at scale in Kubernetes

You might use Kubernetes for years without ever needing to think about etcd.

But once your cluster grows large enough, etcd can quickly become your main concern.

Why does etcd have trouble at scale?

What breaks, and why?

And how did the teams running the world's largest Kubernetes clusters handle it?

From one control plane to many

In Kubernetes, only the API server communicates directly with etcd.

The scheduler, controller manager, kubelet, kubectl, and your operators all communicate through the API server.

Only the API server reads from and writes to the database.

Control plane diagram showing only the API server talking directly to etcd while scheduler and controller manager communicate through the API server.

etcd is the API server's private backend: everything else sees Kubernetes through the API.

So, do you really need etcd?

If you have a single API server running on a single machine, the answer is no.

You could store your cluster state in SQLite, PostgreSQL, or even a flat file on disk.

The API server would read and write from it, and everything would work.

So why use etcd at all?

Because production clusters need high availability, and that means running multiple API servers.

If one API server crashes or goes offline for maintenance, another one takes over, and the cluster keeps running.

But this introduces a new problem.

If you have three API servers, they all need to read and write the same data.

Three control-plane nodes each with an API server and separate local databases crossed out to show why independent copies cannot be used.

You can't give each one its own private database, or they'd disagree about the cluster's state.

You need a shared database that all three API servers can talk to, and that database needs to stay consistent even when things go wrong.

Three control-plane nodes where each API server connects to its local etcd and the etcd instances replicate with each other to form shared state.

What does "consistent" mean here?

Imagine a PersistentVolume is available in the cluster.

Two controllers, connected to different API servers, both read it and see it's free.

Both try to bind it to different PersistentVolumeClaims.

Without consistency, both writes succeed and two pods think they own the same disk.

And this is exactly what etcd solves.

etcd is a distributed key-value store that uses the Raft consensus algorithm to keep multiple nodes in sync.

  • A worker asks for a persistent volume while two controllers on different API servers are present in the control plane.A worker asks for a persistent volume while two controllers on different API servers are present in the control plane.
    1/3

    A worker asks for a persistent volume while two controllers on different API servers are present in the control plane.

  • Both controllers observe the same volume as available and race to claim it at the same time.Both controllers observe the same volume as available and race to claim it at the same time.
    2/3

    Both controllers observe the same volume as available and race to claim it at the same time.

  • Both controllers attempt to bind their persistent volume claims to the same persistent volume, illustrating a consistency conflict.Both controllers attempt to bind their persistent volume claims to the same persistent volume, illustrating a consistency conflict.
    3/3

    Both controllers attempt to bind their persistent volume claims to the same persistent volume, illustrating a consistency conflict.

How does Raft work?

An etcd cluster elects a single leader node and all writes go to the leader.

The leader appends the write to its log, replicates it to the follower nodes, and only commits the write once a majority of nodes confirm they've persisted it.

If the leader crashes, the remaining nodes hold an election and pick a new leader.

As long as a majority of nodes are online, the cluster continues to function.

This gives you two things:

  1. Strong consistency, all clients always see the same data.
  2. High availability, the cluster survives node failures.

That guarantee is what makes multiple API servers possible.

  • A client sends a write to the leader in a three-node etcd cluster; the value is received but not committed yet.A client sends a write to the leader in a three-node etcd cluster; the value is received but not committed yet.
    1/3

    A client sends a write to the leader in a three-node etcd cluster; the value is received but not committed yet.

  • The leader replicates the new key-value entry to both follower nodes.The leader replicates the new key-value entry to both follower nodes.
    2/3

    The leader replicates the new key-value entry to both follower nodes.

  • Followers acknowledge the replicated value and, after majority confirmation, the leader commits it to the log.Followers acknowledge the replicated value and, after majority confirmation, the leader commits it to the log.
    3/3

    Followers acknowledge the replicated value and, after majority confirmation, the leader commits it to the log.

If you want to see Raft in action, including how to build and break a multi-node etcd cluster yourself, check out How etcd works with and without Kubernetes.

But consensus has costs and those costs are where the problems start.

The costs of consensus

etcd provides strong consistency, but every design decision that ensures consistency also imposes a ceiling on how far it can scale.

In a Raft cluster, there's always exactly one leader.

Every write request, no matter which node receives it, is forwarded to the leader.

Raft diagram where a client write goes to one follower, gets forwarded to the leader, and then enters the replication cycle.

The leader appends the write to its log, sends it to the follower nodes, and waits for a majority to confirm they've persisted it before committing.

What does this look like in practice?

It means write latency is at least one network round-trip to the followers plus a disk fsync on each node.

And it means write throughput is bounded by what one node can handle.

Single Raft leader replicating one write to many followers, illustrating how fan-out grows as the cluster gets larger.

Adding more etcd nodes doesn't increase write capacity.

In fact, it makes it worse because the leader now has to replicate to more followers.

This is a key trade-off: you cannot scale writes horizontally in a Raft cluster.

For a typical Kubernetes cluster, this is fine: the API server writes metadata: pod specs, deployment definitions, and config maps.

These are small and relatively infrequent.

But in a cluster with tens of thousands of objects constantly churning, a single leader becomes a bottleneck.

The database lives in a single file

etcd stores all its data in bbolt, a B+ tree key-value store backed by a single file on disk.

etcd recommends a maximum database size of 8 GiB, and the default backend quota is just 2 GiB.

Individual values are capped at 1.5 MiB per request, and each key-value pair can be at most 1 MiB.

That means a single Kubernetes object (a Secret, a ConfigMap, a CRD instance) can't exceed 1 MiB when serialized.

If you've ever wondered why large ConfigMaps or Secrets get rejected, this is the wall they hit.

Why are these limits so strict?

Because Raft replicates everything.

When a follower falls too far behind (or a new node joins), the leader has to send it a full snapshot of the database.

If the database is 8 GiB, that snapshot is 8 GiB.

The larger the database, the longer snapshots take, the slower recovery is, and the higher the risk that something goes wrong during the transfer.

The database size limit is not arbitrary. It comes directly from how Raft manages replication and recovery.

For typical Kubernetes metadata (pod specs, service definitions, configmaps), a few gigabytes is generous.

But add CRDs, large secrets, many namespaces, and high churn, and you start bumping into that ceiling.

Every mutation creates a new revision

etcd uses multiversion concurrency control (MVCC).

Every time you write a key, etcd doesn't overwrite the old value: it creates a new revision of the entire dataset.

This is how resourceVersion works in Kubernetes.

Every object has a revision number, and controllers use it to watch for changes, resume watches after restarts, and detect conflicts during updates.

It's also the mechanism that makes rollbacks possible: Kubernetes can look back at previous revisions to know what changed.

But old revisions do not automatically go away.

Every write adds a new revision.

If you have 10,000 pods and each one gets updated once per minute, that's 10,000 new revisions per minute piling up in the database.

API server applies a pod manifest to etcd, which stores multiple historical pod revisions under MVCC.

This is why etcd requires compaction.

Compaction tells etcd to discard all revisions older than a certain point. Without it, the database grows monotonically regardless of how many keys you actually have.

And even after compaction, the space isn't immediately reclaimed.

bbolt uses copy-on-write pages internally: when a page is freed, the space is marked as reusable, but the file doesn't shrink.

This is why etcd also requires defragmentation: a separate operation that rebuilds the database file to reclaim the freed space.

If compaction can't keep up with the mutation rate, the database grows faster than you can shrink it. Eventually, it hits the backend quota.

And when it hits the quota, etcd goes into alarm mode: no more writes until you recover space.

That means your entire Kubernetes control plane stops accepting changes.

No new pods, no scaling, no deployments.

Watch events come from the leader

Kubernetes controllers don't poll the API server.

They open long-lived watch connections and receive events as objects change.

Under the hood, the API server maintains watch connections to etcd.

Controller manager watch connection to the API server receives change events sourced from etcd for a deployment.

When a key changes, etcd streams the event to every watcher interested in that key range.

What happens if you have thousands of watchers?

Every time a pod status updates, etcd needs to evaluate which watchers care about that key and send them the event.

More objects multiplied by more controllers multiplied by more namespaces means more work for the leader on every single write.

At large scale, the leader can spend more time distributing watch events than actually processing writes.

What breaks at scale

These design decisions, Raft consensus, single-file storage, MVCC revisions, and watch fan-out, are all reasonable trade-offs for a metadata store.

But they compound at scale.

What does it look like when problems start to appear?

None of these issues are bugs, but the results of the system's design.

For most Kubernetes clusters (hundreds of nodes, moderate churn), none of this matters. etcd handles it comfortably.

But for the people operating at 10,000+ nodes?

These are the walls they hit.

The first escape hatch: Kine and k3s

The first project to seriously question the etcd dependency was k3s, Rancher's lightweight Kubernetes distribution.

K3s needed to run on edge hardware, IoT devices, and single-node setups where operating a three-node etcd cluster was impractical.

But how can you remove etcd from Kubernetes when the API server is built to communicate with it?

The answer was Kine ("Kine is not etcd").

Kine is a shim that implements a subset of the etcd API and translates requests to a relational database: SQLite, PostgreSQL, MySQL/MariaDB, or NATS.

Kine sits between the Kubernetes API server and SQL backends such as SQLite, PostgreSQL, and MySQL.

The key insight was simple but powerful: the Kubernetes API server doesn't talk to etcd's internals, but to the etcd gRPC API.

If you implement that API in front of a different database, the API server doesn't know the difference.

Kine only implements a subset of the etcd API, though.

It covers the access patterns Kubernetes actually uses, but it's not a general-purpose etcd replacement.

And you lose some of etcd's properties:

For edge deployments and small clusters, this is a perfectly reasonable trade-off.

But the bigger insight is the pattern: you can decouple Kubernetes from etcd by reimplementing the etcd API, without touching Kubernetes itself.

That same pattern is exactly what the hyperscale cloud providers used.

What the hyperscalers built

In 2025, AWS announced support for clusters with up to 100,000 worker nodes.

Stock etcd can't handle 100,000 nodes.

This is a consequence of everything we've been discussing so far: the single-leader write path, the database size constraints, and the watch fan-out.

In their technical deep dive, AWS describes what they rebuilt:

But they kept the etcd API.

Why did they do this?

Because the alternative is worse.

The Kubernetes API server's storage layer is defined in k8s.io/apiserver/pkg/storage/interfaces.go, and that interface is built on top of etcd's model: revision-based operations, watch semantics, and prefix queries.

Changing that interface means forking the API server, and maintaining a fork of Kubernetes is an enormous ongoing cost.

It's cheaper to rewrite etcd than to rewrite Kubernetes.

So AWS did what Kine did, but at a different scale: they built a new storage engine that speaks the etcd API, and plugged it in where etcd used to be.

Google took a different approach to the same problem.

When they announced 65,000-node GKE clusters, they described replacing open-source etcd with a Spanner-based key-value backend that implements the etcd API.

Spanner is Google's globally distributed database.

It doesn't have bbolt's single-file size limit, or a single-leader write bottleneck.

It's designed for exactly the kind of throughput and durability that etcd caps out on.

They later pushed this to 130,000 nodes using the same approach.

But even with Spanner as the backend, Google still lists strict constraints for clusters of this size: no cluster autoscaler, headless services limited to 100 pods, and one pod per node.

Why is that?

Because the storage layer is only one bottleneck.

Replacing etcd with Spanner fixes the database ceiling, but the API server itself still has processing limits.

It still has to serialize and deserialize every object, evaluate admission controllers, and distribute watch events to every interested controller.

Beyond that, the scheduler processes pods one at a time, the kubelet reports back on every status update, and the network has bandwidth limits of its own.

Even with an infinitely scalable database, the rest of the system has its own ceilings.

That's why 65,000-node clusters come with operational restrictions that other clusters don't.

Why upstream Kubernetes is still coupled to etcd

If Kine, EKS, and GKE all managed to swap the backend, why can't upstream Kubernetes just make storage pluggable?

Here's what the API server's storage interface actually looks like (trimmed for readability):

interfaces.go

type Interface interface {
    Versioner() Versioner

    Create(ctx context.Context, key string, obj, out runtime.Object, ttl uint64) error

    Delete(ctx context.Context, key string, out runtime.Object,
        preconditions *Preconditions, validateDeletion ValidateObjectFunc,
        cachedExistingObject runtime.Object, opts DeleteOptions) error

    Watch(ctx context.Context, key string, opts ListOptions) (watch.Interface, error)

    Get(ctx context.Context, key string, opts GetOptions, objPtr runtime.Object) error

    GetList(ctx context.Context, key string, opts ListOptions, listObj runtime.Object) error

    GuaranteedUpdate(ctx context.Context, key string, destination runtime.Object,
        ignoreNotFound bool, preconditions *Preconditions,
        tryUpdate UpdateFunc, cachedExistingObject runtime.Object) error

    // TODO: Remove when storage.Interface will be separate from etc3.store.
    // Deprecated: Added temporarily to simplify exposing RequestProgress for watch cache.
    RequestWatchProgress(ctx context.Context) error

    GetCurrentResourceVersion(ctx context.Context) (uint64, error)

    CompactRevision() int64
}

Every operation is revision-aware:

The interface was designed for etcd. It works as an abstraction, but it's an etcd-shaped one.

But it's not just about code:

Changing the storage backend isn't just a code change.

The entire ecosystem needs to support the change.

So how did the cloud providers work around this?

They own the full stack and they compile their own API server binaries, wire in their own storage implementations, and run their own conformance suites.

For everyone running upstream Kubernetes, etcd is the only supported backend.

Kubernetes is slowly loosening the coupling

The community hasn't ignored the pressure.

Recent Kubernetes releases include significant work to reduce the API server's direct reliance on etcd.

Historically, every strongly consistent read hit etcd.

Even a simple kubectl get pods could generate etcd traffic.

Since Kubernetes 1.31, the API server can serve consistent reads from its in-memory watch cache when it can verify the cache is fresh.

Same consistency guarantees, but the read never reaches etcd.

Large LIST operations buffer the entire response in memory before sending it.

For a cluster with 50,000 pods, that's a massive allocation.

Since Kubernetes 1.33, the API server can stream list responses incrementally, reducing memory spikes and improving concurrency.

The pattern is clear: Kubernetes is building a thicker layer between the API server and etcd.

Each release reduces the amount of direct etcd traffic generated by the API server. etcd is still there, but it handles less of the load.

So, does Kubernetes still need etcd?

After all this, what is the practical answer?

If you're running upstream Kubernetes, yes. etcd is the only supported storage backend, and the API server is deeply coupled to its semantics. That won't change anytime soon.

If you're running a managed service like GKE or EKS, the provider may have already replaced etcd's internals with something that scales further. You're using Kubernetes without being limited by etcd's ceilings.

If you're running k3s or similar distributions, Kine gives you the option to swap etcd for a relational database, with the understanding that it's a subset implementation.

For most clusters, etcd handles the workload comfortably. If you're pushing past that, the options exist. They're just not upstream yet.

Let me know when you publish another article like this.

You are in!