I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works. Explaining this feature and my particular problem and speculative fix is the goal of this post.
The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version 1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on certain nodes weren’t working. I was able to reproduce this behavior with the following steps.
kube-dns Pods and a Service on the cluster that provide DNS and configures
kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s
docs here. First I get the
kube-dns‘ Service’s Cluster IP. This is the IP address to
which DNS queries from Pods are sent.
1 2 3
Then I make DNS queries against the Cluster IP from a Pod running on a broken node.