3 Levels of Load Testing GKE Workload Identity

|

I manage multitenant Google Kubernetes Engine (GKE) clusters for stateless backend services at work. Google recently graduated GKE’s Workload Identity (WI) feature to generally available (GA). When my team used WI during its beta stage, it seemed to fail when there were more than 16 requests per second (RPS) on one GKE node to retrieve Google access tokens.

Before we knew about this low RPS failure threshold, we told many internal Spotify engineering teams to go ahead and use the feature. In hindsight, we should’ve load-tested the feature before making it generally available internally especially since it wasn’t even GA publicly.

My efforts to load test WI have grown more sophisticated over time. This post describes the progression. It’s like the “4 Levels of …” Epicurious Youtube videos. The goal here is to find out at what RPS WI starts to fail and to try to learn some generalizable lessons from load testing vendor-managed services.

tl;dr lessons learned

  • always load test new features above and beyond what you expect your production load will be
  • use proper load testing tools and not bash for loops

My specific GKE cluster configuration

  • GKE masters and nodes running version 1.15.9-gke.22
  • regional cluster in Google Cloud Platform (GCP) (not on-premise)
  • 4 GKE nodes that are n1-standard-32 GCE instances in one node pool
  • each node is configured to have a maximum of 32 Pods
  • cluster and node pool have WI enabled

High level of what Workload Identity is and how it works

Workloads on GKE often need to access GCP resources like PubSub or CloudSQL. In order to do so, your workload needs to use a Google Service Account (GSA) key that is authorized to access those resources. So you end up creating keys for all your GSA’s and copy-pasting these keys into Kubernetes Secrets for your workloads. This is insecure and not maintainable if you are a company that has dozens of engineering teams and hundreds of workloads.

So GCP offered WI which allows a Kubernetes Service Account (KSA) to be associated with a GSA. If a workload can run with a certain KSA, it’ll transparently get the Google access token for the associated GSA. No manual copy-pasting GSA keys!

How does this work? You have to enable WI on your cluster and node pool. This creates a gke-metadata-server DaemonSet in the kube-system namespace. gke-metadata-server is the entrypoint to the whole WI system. Here’s a nice Google Cloud Next conference talk with more details.

gke-metadata-server is the only part of WI that is exposed to GKE users, i.e. runs on machines you control. It’s like the Verizon FiOS box in your basement. You control your house, but there’s a little box that Verizon owns and operates in there. All other parts of WI run on GCP infrastructure that you can’t see. When I saw failures with WI, it all seemed to happen in gke-metadata-server. So that’s what I’ll load test.

Here’s the gke-metadata-server DaemonSet YAML for reference. As of the time of this writing the image is gke.gcr.io/gke-metadata-server:20200218_1145_RC0. You might see different behavior with different images.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  creationTimestamp: "2019-10-15T17:04:40Z"
  generation: 8
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: gke-metadata-server
  name: gke-metadata-server
  namespace: kube-system
  resourceVersion: "138588210"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/gke-metadata-server
  uid: e06885d8-ef6d-11e9-88c9-42010a8c0110
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: gke-metadata-server
  template:
    metadata:
      annotations:
        components.gke.io/component-name: gke-metadata-server
        components.gke.io/component-version: 0.2.21
        scheduler.alpha.kubernetes.io/critical-pod: '"''"'
      creationTimestamp: null
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        k8s-app: gke-metadata-server
    spec:
      containers:
      - command:
        - /gke-metadata-server
        - --logtostderr
        - --token-exchange-endpoint=https://securetoken.googleapis.com/v1/identitybindingtoken
        - --identity-namespace=[redacted].svc.id.goog
        - --identity-provider-id=https://container.googleapis.com/v1/projects/[redacted]/locations/asia-east1/clusters/[redacted]
        - --passthrough-ksa-list=kube-system:container-watcher-pod-reader,kube-system:event-exporter-sa,kube-system:fluentd-gcp-scaler,kube-system:heapster,kube-system:kube-dns,kube-system:metadata-agent,kube-system:network-metering-agent,kube-system:securityprofile-controller,istio-system:istio-ingressgateway-service-account,istio-system:cluster-local-gateway-service-account,csm:csm-sync-agent,knative-serving:controller
        - --attributes=cluster-name=[redacted],cluster-uid=[redacted],cluster-location=asia-east1
        - --enable-identity-endpoint=true
        - --cluster-uid=[redacted]
        image: gke.gcr.io/gke-metadata-server:20200218_1145_RC0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 54898
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: gke-metadata-server
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/kubeconfig
          name: kubelet-credentials
          readOnly: true
        - mountPath: /var/lib/kubelet/pki/
          name: kubelet-certs
          readOnly: true
        - mountPath: /var/run/
          name: container-runtime-interface
        - mountPath: /etc/srv/kubernetes/pki
          name: kubelet-pki
          readOnly: true
        - mountPath: /etc/ssl/certs/
          name: ca-certificates
          readOnly: true
      dnsPolicy: Default
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/os: linux
        iam.gke.io/gke-metadata-server-enabled: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: gke-metadata-server
      serviceAccountName: gke-metadata-server
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pki/
          type: Directory
        name: kubelet-certs
      - hostPath:
          path: /var/lib/kubelet/kubeconfig
          type: File
        name: kubelet-credentials
      - hostPath:
          path: /var/run/
          type: Directory
        name: container-runtime-interface
      - hostPath:
          path: /etc/srv/kubernetes/pki/
          type: Directory
        name: kubelet-pki
      - hostPath:
          path: /etc/ssl/certs/
          type: Directory
        name: ca-certificates
  templateGeneration: 8
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Level 1

What kind of load am I putting on gke-metadata-server? Since this DaemonSet exists to give out Google access tokens, I’ll send it HTTP requests asking for such tokens.

I built a Docker image with the following Dockerfile.

1
2
FROM google/cloud-sdk
ENTRYPOINT while true; do for i in {1..20}; do curl -X GET https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=$(gcloud auth print-access-token) & done; wait; done;

Then I created the following K8s Deployment YAML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
apiVersion: apps/v1
kind: Deployment
metadata:
  name: wi-test
  namespace: [K8S_NAMESPACE]
spec:
  replicas: 7
  selector:
    matchLabels:
      app: wi-test
  template:
    metadata:
      labels:
        app: wi-test
    spec:
      nodeSelector:
        kubernetes.io/hostname: [NODE-NAME]
      containers:
      - image: my-docker-image
        name: workload-identity-test

I ran seven of these Pods on a single node (see the nodeSelector above) to target a single instance of gke-metadata-server.

This isn’t a great test because there’s a lot of extra work performed by the Container in running gcloud to print a Google access token (there may be bottlenecks in this gcloud command itself which is Python code), curling the googleapis.com endpoint to get the token info (originally done to verify the token was valid). And there’s probably bottlenecks in using a shell to do this. All in all, this implementation doesn’t really let you specify a fixed RPS. You’re at the mercy of how fast your Container, shell, gcloud, and the network will let you execute this. I also wasn’t able to run more Pods on a single node because I was hitting the max 32 pods per node limit. There were already a bunch of other GKE-system level workloads like Calico that took up node capacity.

Level 2

Apply this one Pod

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cat <<EOF | kubectl --context [CONTEXT] apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: wi-test
  namespace: [K8S_NAMESPACE]
spec:
  containers:
  - image: google/cloud-sdk
    name: wi-test
    command: [ '/bin/bash', '-c', '--' ]
    args: [ 'while true; do sleep 30; done;' ]
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      readOnlyRootFilesystem: false
    resources:
      limits:
        cpu: 2
        memory: 4G
      requests:
        cpu: 2
        memory: 4G
EOF

Then kubectl exec in and run this command.

1
for i in {1..N}; do gcloud auth print-access-token & done; wait;

Everything seemed to work fine when N was 100. When N was 200 I got a few errors like the below. They look like client-side errors and not server ones though.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
ERROR: gcloud failed to load: No module named 'ruamel.yaml.error'
gcloud_main = _import_gcloud_main()
import googlecloudsdk.gcloud_main
from googlecloudsdk.api_lib.iamcredentials import util as iamcred_util
from googlecloudsdk.api_lib.util import exceptions
from googlecloudsdk.core.resource import resource_printer
from googlecloudsdk.core.resource import yaml_printer
from googlecloudsdk.core.yaml import dict_like
from googlecloudsdk.core import yaml_location_value
from ruamel import yaml
from ruamel.yaml.main import * # NOQA
from ruamel.yaml.error import UnsafeLoaderWarning, YAMLError # NOQA

This usually indicates corruption in your gcloud installation or problems with your Python interpreter.

Please verify that the following is the path to a working Python 2.7 or 3.5+ executable:
/usr/bin/python3

If it is not, please set the CLOUDSDK_PYTHON environment variable to point to a working Python 2.7 or 3.5+ executable.

If you are still experiencing problems, please reinstall the Cloud SDK using the instructions here:
https://cloud.google.com/sdk/

ERROR: gcloud failed to load: cannot import name 'opentype' from 'pyasn1.type' (/usr/bin/../lib/google-cloud-sdk/lib/third_party/pyasn1/type/__init__.py)
from google.auth.crypt import _cryptography_rsa
import cryptography.exceptions


File "/usr/bin/../lib/google-cloud-sdk/lib/gcloud.py", line 67, in main
File "/usr/bin/../lib/google-cloud-sdk/lib/gcloud.py", line 48, in _import_gcloud_main
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 33, in <module>
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib

gcloud does not synchronize between processes with concurrent invokations. It sometimes writes files to disk. So this is also not a great load test because it still doesn’t let you achieve a specific RPS and has client-side bottlenecks.

Level 3

Use a proper HTTP load testing tool. A colleague told me about vegeta. It’s a seemingly good tool, but, more importantly, its commands are amazing. vegeta attack ....

I first start a golang Pod that just busy-waits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
$ cat <<EOF | kubectl --context [CONTEXT] apply -f -
> apiVersion: v1
> kind: Pod
> metadata:
>   name: wi-test
>   namespace: [NAMESPACE]
> spec:
>   containers:
>   - image: golang:latest
>     name: wi-test
>     command: [ '/bin/bash', '-c', '--' ]
>     args: [ 'while true; do sleep 30; done;' ]
>     resources:
>       limits:
>         cpu: 2
>         memory: 4G
>       requests:
>         cpu: 2
>         memory: 4G
> EOF

pod/wi-test created

Then I get a shell in it.

1
2
3
4
5
6
7
8
9
10
11
12
kubectl --context [CONTEXT] -n [NAMESPACE] exec -it wi-test bash

Defaulting container name to wi-test.
Use 'kubectl describe pod/wi-test -n [NAMESPACE]' to see all of the containers in this pod.

root@wi-test:/go# go get github.com/tsenart/vegeta
root@wi-test:/go# vegeta -version

Version:
Commit:
Runtime: go1.14.1 linux/amd64
Date:

Let’s throw some load on WI! my-gsa@my-project.iam.gserviceaccount.com is the GSA associated with the KSA your workload runs as.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 10 -duration=5s | vegeta report

Requests      [total, rate, throughput]         50, 10.20, 10.20
Duration      [total, attack, wait]             4.904s, 4.9s, 4.168ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.168ms, 6.137ms, 5.039ms, 9.591ms, 10.444ms, 31.452ms, 31.452ms
Bytes In      [total, mean]                     25300, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:50
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 1000 -duration=5s | vegeta report
Requests      [total, rate, throughput]         5000, 1000.20, 127.51
Duration      [total, attack, wait]             31.175s, 4.999s, 26.176s
Latencies     [min, mean, 50, 90, 95, 99, max]  101.972ms, 11.003s, 7.652s, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     2011350, 402.27
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           79.50%
Status Codes  [code:count]                      0:1025  200:3975
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 100 -duration=5s | vegeta report
Requests      [total, rate, throughput]         500, 100.20, 98.40
Duration      [total, attack, wait]             5.081s, 4.99s, 91.244ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.805ms, 106.449ms, 59.058ms, 306.334ms, 372.519ms, 506.703ms, 601.534ms
Bytes In      [total, mean]                     253000, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:500
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 500 -duration=5s | vegeta report
Requests      [total, rate, throughput]         2500, 500.20, 43.29
Duration      [total, attack, wait]             34.072s, 4.998s, 29.074s
Latencies     [min, mean, 50, 90, 95, 99, max]  10.56ms, 12.579s, 756.03ms, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     746350, 298.54
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           59.00%
Status Codes  [code:count]                      0:1025  200:1475
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 250 -duration=5s | vegeta report
Requests      [total, rate, throughput]         1250, 250.22, 28.52
Duration      [total, attack, wait]             34.996s, 4.996s, 30s
Latencies     [min, mean, 50, 90, 95, 99, max]  8.331ms, 6.347s, 376.419ms, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     504988, 403.99
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           79.84%
Status Codes  [code:count]                      0:252  200:998
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 200 -duration=5s | vegeta report
Requests      [total, rate, throughput]         1000, 200.20, 28.28
Duration      [total, attack, wait]             32.43s, 4.995s, 27.435s
Latencies     [min, mean, 50, 90, 95, 99, max]  9.985ms, 2.739s, 188.509ms, 797.058ms, 30s, 30s, 30s
Bytes In      [total, mean]                     464002, 464.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           91.70%
Status Codes  [code:count]                      0:83  200:917
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 150 -duration=5s | vegeta report
Requests      [total, rate, throughput]         750, 150.20, 146.53
Duration      [total, attack, wait]             5.118s, 4.993s, 125.078ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.747ms, 224.285ms, 171.325ms, 460.236ms, 549.18ms, 682.161ms, 892.25ms
Bytes In      [total, mean]                     379500, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:750
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 175 -duration=5s | vegeta report
Requests      [total, rate, throughput]         875, 175.20, 24.46
Duration      [total, attack, wait]             34.097s, 4.994s, 29.103s
Latencies     [min, mean, 50, 90, 95, 99, max]  3.704ms, 1.687s, 231.652ms, 708.672ms, 2.432s, 30s, 30s
Bytes In      [total, mean]                     422004, 482.29
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           95.31%
Status Codes  [code:count]                      0:41  200:834
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 165 -duration=5s | vegeta report
Requests      [total, rate, throughput]         825, 165.20, 23.61
Duration      [total, attack, wait]             34.6s, 4.994s, 29.606s
Latencies     [min, mean, 50, 90, 95, 99, max]  3.483ms, 558.613ms, 222.111ms, 531.49ms, 622.473ms, 11.851s, 30s
Bytes In      [total, mean]                     413402, 501.09
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           99.03%
Status Codes  [code:count]                      0:8  200:817
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

After more bisection, I found that this specific instance of gke-metadata-server starts to fail around 150RPS. When it does fail, p99 latency skyrockets from less than 1 second to 30 seconds. This is usually a sign of a rate limiter or quota.

How have you tried load testing WI or other GKE features? What’re your favorite load testing tools for these cases, and what interesting behavior have you found?


Becoming a Better Public Speaker

|

At the beginning of this year I set a goal of becoming a better public speaker and more visible in both tech and other broader causes I believe in. I’m happy to say that in the last two months I gave three talks! Two were prepared talks with slides at tech conferences. The other was an unprepared conversation on a podcast. These were all technical and related to my work at Spotify. Outside of Spotify, I spoke for one minute at a mock political town hall in front of about 30 people and at a public policy forum for ~15 minutes in front of roughly the same number of people. But more on that later. Here are my technical talks. These talks wouldn’t be possible without the help, feedback, and moral support from my Spotify colleagues.

1. Keynote at KubeCon + CloudNativeCon Europe 2019 in Barcelona on May 22, 2019

“How Spotify Accidentally Deleted All its Kube Clusters with No User Impact”

2. Kubernetes Podcast from Google on April 23, 2019

“Spotify, with David Xia”. Listen on Spotify here.

3. Joint talk with Google at Google Next SF on April 11, 2019

“GKE Usage Metering: Whose Line Item Is It Anyway?”


More About Nginx DNS Resolution Than You Ever Wanted to Know

|

This is a post about Nginx’s DNS resolution behavior I didn’t know about but wish I did before I started using Kubernetes (K8s).

Nginx caches statically configured domains once

Symptoms

I moved a backend service foo from running on a virtual machine to K8s. Foo’s clients include an Nginx instance configured with this upstream block.

1
2
3
4
5
6
7
8
9
10
11
12
upstream foo {
  server foo.example.com.;
}

server {
  ...

  location ~* /_foo/(.*) {
    proxy_pass https://foo/$1;
    ...
  }
}

K8s Pods can be rescheduled anytime so their IPs aren’t stable. I’m supposed to use K8s Services to avoid caching these ephemeral Pod IPs. But in my case because of interoperability reasons I was registering Pod IPs directly as A records for foo.example.com.. I started noticing that after my Pod IPs changed either because of rescheduling or updating the Deployment, Nginx started throwing 502 Bad Gateway errors.

Root Problem

Nginx resolves statically configured domain names only once at startup or configuration reload time. So Nginx resolved foo.example.com. once at startup to several Pod IPs and cached them forever.

Solution


Using Upstream Kubernetes Java Models Is Much Better Than Raw YAML

|

It’s been a while since I blogged about something tech related, but I had some free time today.

Recently, I’ve been trying to refactor an internal Spotify deployment tool my team built and maintains. This deployment tool takes Kubernetes (k8s) YAML manifests, changes them, and essentially runs kubectl apply. We add metadata to the k8s manifests like labels.

Right now this tool receives the input YAML as strings, converts them to Jackson ObjectNodes, and manipulates those ObjectNodes. The disadvantage of this is that there’s no k8s type-safety. We might accidentally add a field to a Deployment that isn’t valid or remove something from a Service that’s required.

My refactor uses upstream k8s model classes from kubernetes-client/java which are themselves generated from the official Swagger spec. Here’s a helpful Yaml utility class that deserializes YAML strings into concrete classes and can also serialize them back into YAML strings. So helpful.

Unfortunately, there’s some bugs in the YAML (de)serialization that prevent me from finishing this effort.

Nonetheless, it’ll be much nicer to change k8s resources in a type-safe way instead of parsing and rewriting raw YAML strings.


Internet Meme Role Models

|

I’m compiling a list of Internet meme role models. Here’s what I have so far. These people — they must be real human beings — must have either gone out of their way to do the right thing in a smart manner, something courageous with bonus points for being funny, or just be ridiculous. And they must be memeified. Exceptions will be made for exceptional but not memeified people like Hilde Lysiak.

Who else deserves to be on this list?

  • Snack Man defused a violent fight on the subway between a couple before anyone got hurt. He did it by standing in between them and munching on Pringles and Gummy Bears.
  • Ben Innes asked the EgyptAir Flight 181 hijacker who comandeered the airplane with a fake explosive belt if he could take a selfie with him. “I thought, why not? If he blows us all up it won’t matter anyway.”
  • Hilde Lysiak is ten years old and the writer and publisher of Orange Street News. She doesn’t take shit from anyone.
  • Salt Bae is…well you just have to see for yourself.

Now Max Knows How to Make a Latte

|

I work at a music company but am more interested in politics and history. Artists visit our office often.

So I, often being ignorant of their fame, have casually interacted with them or criticized their milk steaming techniques when they’re using the office’s $20K espresso machine.

Only later am I told by their posse, “Did you know that was Mark Ronson/Bebe Rhexa/Max Martin/etc?”

“It’s OK,” I say. “Now Max knows how to make a real latte.”


Making Dumplings With My Grandparents

|

Whenever I go back home to my parents’ house near Boston, if my maternal grandparents are there, they make hundreds of dumplings for me. I try to help out. We make everything from scratch including the skins. I’m good at rolling the skins but have much to learn on all other parts of the process. I’m becoming better at packing and closing the dumplings now though.

I’ve come to cherish this little tradition more and more. I need to plan my next trip to Boston!


Four Fascinating and Weird People

|

Here are the stories of four fascinating and weird people that will make you laugh, be inspired, or cringe. Chang and Eng Bunker were conjoined twins who married two sisters and were slave-owners on the side of the Southern Confederacy. Rose Wilder Lane is the daughter of the author who wrote the Little House childrens books, a founding member of the American Libertarian movement, and just all around boss ass bitch. John Harvey Kellogg was the inventor of corn flakes, doctor, zealous anti-masturbation campaigner, and eugenicist.


Useful Site for TLS Server Test

|

My home server’s hard disk’s partition map was somehow corrupted. So I’m serving this website from Digital Ocean for now instead of my apartment. While rewriting the nginx server configs, I found this useful site that tests your server’s TLS configuration. It’ll give you a grade and warn you of weak encryption, key exchange protocols, cipher suites, etc.

Mozilla’s TLS configuration generator is useful for providing secure defaults.

I’m proud to say this site has an A.