How Kubernetes Routes IP Packets to Services’ Cluster IPs


I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works. Explaining this feature and my particular problem and speculative fix is the goal of this post.

The Problem

The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version 1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on certain nodes weren’t working. I was able to reproduce this behavior with the following steps.

Kubernetes schedules kube-dns Pods and a Service on the cluster that provide DNS and configures kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s docs here. First I get the kube-dns‘ Service’s Cluster IP. This is the IP address to which DNS queries from Pods are sent.

kubectl --context my-gke-cluster -n kube-system get services kube-dns
NAME       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   <none>        53/UDP,53/TCP   666d

Then I make DNS queries against the Cluster IP from a Pod running on a broken node.

My Hints and Solutions to the First Three Levels of Over the Wire Vortex


I recently found more wargames at Here are my hints and solutions for the first three levels of Vortex. The levels are cumulative. We have to beat the previous level in order to access the next.

Vortex Level 0 -> Level 1

Hint 1: how much data Connect to the host and port and read all the bytes you can. How many bytes do you get?

Hint 2: endianess “…read in 4 unsigned integers in host byte order” means the bytes are already in host byte order or little-endian. If your system is also little-endian, you don’t need to do anything special when interpreting the bytes.

Hint 3: expected reply How many bytes is each integer? What is the sum of all four?

My Solution to Exploit Exercises Protostar Final2 Level


This is an explanation of Protostar level Final2. I wrote a solution in April without an explanation. I read it last night and had to spend half a day to understand it again. So next time I’ll write the explanation while it’s still fresh in my head.

The level’s description is

Remote heap level :)
Core files will be in /tmp.
This level is at /opt/protostar/bin/final2

How to Analyze Mobile App Traffic and Reverse Engineer Its Non-Public API


Have you ever wanted to analyze the traffic between a mobile app and its servers or reverse engineer a mobile app’s non-public API? Here’s one way.

The basic principle is to proxy the traffic from the app through a computer you control on which you can capture and analyze traffic. If the app you’re interested in is using an unencrypted protocol like HTTP, this is pretty easy. Just run a proxy on your computer and configure your mobile device to proxy network traffic through your computer’s IP.

How to Exploit Dlmalloc Unlink(): Protostar Level Heap3


While stuck inside during social distancing, I’ve been making my way through LiveOverflow’s awesome Youtube playlist “Binary Exploitation / Memory Corruption.” His videos are structured around a well known series of exploit exercises here called “Protostar.” I took the time to truly understand each one before moving onto the next as the exercises build on each other. For the past several days I’ve been trying to understand the “Heap3” level, a relatively complex level that requires manipulating the heap to redirect code execution to an arbitrary function. After rewatching the video many times and reading numerous other online explanations, I finally understand! That moment of understanding feels so gratifying.

Many other resources already explain the exploit well, but I’m writing my own explanation to reinforce my understanding and to celebrate.

How to Expose a Localhost-only Endpoint on GKE


In my previous post I wrote about how to load test GKE Workload Identity. In this post I’ll describe how to get metrics from gke-metadata-server, the part of Workload Identity that runs on your GKE clusters’ nodes. This solution is a temporary workaround until GKE provides a better way to get metrics on gke-metadata-server.

Gke-metadata-server runs as a K8s DaemonSet. It exposes metrics about itself in Prometheus text-based format. I want to have an external scraper make HTTP requests to periodically collect these metrics. Unfortunately, the Prometheus HTTP server only listens on the Container’s localhost interface. So how can we expose these metrics, i.e. make the HTTP endpoint available externally?

tl;dr lessons learned

  • socat is awesome.
  • If something you need is running on a computer you control, you can always find a way extract info from it if you’re resourceful enough.

3 Levels of Load Testing GKE Workload Identity


I manage multitenant Google Kubernetes Engine (GKE) clusters for stateless backend services at work. Google recently graduated GKE’s Workload Identity (WI) feature to generally available (GA). When my team used WI during its beta stage, it seemed to fail when there were more than 16 requests per second (RPS) on one GKE node to retrieve Google access tokens.

Before we knew about this low RPS failure threshold, we told many internal engineering teams to go ahead and use the feature. In hindsight, we should’ve load-tested the feature before making it generally available internally especially since it wasn’t even GA publicly.

My efforts to load test WI have grown more sophisticated over time. This post describes the progression. It’s like the “4 Levels of …” Epicurious Youtube videos. The goal here is to find out at what RPS WI starts to fail and to try to learn some generalizable lessons from load testing vendor-managed services.

tl;dr lessons learned

  • always load test new features above and beyond what you expect your production load will be
  • use proper load testing tools and not bash for loops

Becoming a Better Public Speaker


At the beginning of this year I set a goal of becoming a better public speaker and more visible in both tech and other broader causes I believe in. I’m happy to say that in the last two months I gave three talks! Two were prepared talks with slides at tech conferences. The other was an unprepared conversation on a podcast. These were all technical and related to my work at Spotify. Outside of Spotify, I spoke for one minute at a mock political town hall in front of about 30 people and at a public policy forum for ~15 minutes in front of roughly the same number of people. But more on that later. Here are my technical talks. These talks wouldn’t be possible without the help, feedback, and moral support from my Spotify colleagues.

1. Keynote at KubeCon + CloudNativeCon Europe 2019 in Barcelona on May 22, 2019

“How Spotify Accidentally Deleted All its Kube Clusters with No User Impact”

2. Kubernetes Podcast from Google on April 23, 2019

“Spotify, with David Xia”. Listen on Spotify here.

3. Joint talk with Google at Google Next SF on April 11, 2019

“GKE Usage Metering: Whose Line Item Is It Anyway?”

More About Nginx DNS Resolution Than You Ever Wanted to Know


This is a post about Nginx’s DNS resolution behavior I didn’t know about but wish I did before I started using Kubernetes (K8s).

Nginx caches statically configured domains once


I moved a backend service foo from running on a virtual machine to K8s. Foo’s clients include an Nginx instance running outside K8s configured with this upstream block.

upstream foo {

server {

  location ~* /_foo/(.*) {
    proxy_pass https://foo/$1;

K8s Pods can be rescheduled anytime so their IPs aren’t stable. I’m supposed to use K8s Services to avoid caching these ephemeral Pod IPs. But in my case because of interoperability reasons I was registering Pod IPs directly as A records for I started noticing that after my Pod IPs changed either because of rescheduling or updating the Deployment, Nginx started throwing 502 Bad Gateway errors.

Root Problem

Nginx resolves statically configured domain names only once at startup or configuration reload time. So Nginx resolved once at startup to several Pod IPs and cached them forever.


Using Upstream Kubernetes Java Models Is Much Better Than Raw YAML


It’s been a while since I blogged about something tech related, but I had some free time today.

Recently, I’ve been trying to refactor an internal Spotify deployment tool my team built and maintains. This deployment tool takes Kubernetes (k8s) YAML manifests, changes them, and essentially runs kubectl apply. We add metadata to the k8s manifests like labels.

Right now this tool receives the input YAML as strings, converts them to Jackson ObjectNodes, and manipulates those ObjectNodes. The disadvantage of this is that there’s no k8s type-safety. We might accidentally add a field to a Deployment that isn’t valid or remove something from a Service that’s required.

My refactor uses upstream k8s model classes from kubernetes-client/java which are themselves generated from the official Swagger spec. Here’s a helpful Yaml utility class that deserializes YAML strings into concrete classes and can also serialize them back into YAML strings. So helpful.

Unfortunately, there’s some bugs in the YAML (de)serialization that prevent me from finishing this effort.

Nonetheless, it’ll be much nicer to change k8s resources in a type-safe way instead of parsing and rewriting raw YAML strings.