Benchmarking Kafka and Google Cloud Pub/Sub Latencies

|

I’m helping a recently acquired team at work figure out if they can migrate from Kafka to Google Cloud Pub/Sub. Part of the exploration was figuring out the change in latencies, if any, from switching.

The team’s production setup is like this.

  • They paid an external company called Confluent to run a managed Kafka cluster in AWS Oregon.
  • This is the same region where this team ran all their backend services. Part of their migration also involves switching their workloads from AWS Oregon to GCP us-central1. If they choose to migrate to Pub/Sub, their services will be publishing and subscribing to messages across cloud providers and regions. So my latency benchmarks took that into account.
  • All their services are written in Golang.
  • Services run as containers in AWS Elastic Container Service.

I defined latency as the time elapsed from when a message is published and when it’s received by a subscriber. I didn’t count the extra time it takes for a subscriber to acknowledge the message. I used Golang and the same upstream libraries for Kafka and Pub/Sub that they used or would use, respectively, in production. I published messages of various sizes at various rates from AWS EC2 instances in Oregon for five minutes. At the same time, five Google Compute Engine instances in us-central1 subscribed to these messages (pull-based) as fast as possible with an initial burn-in period of one minute. I didn’t measure the latency until the burn-in period elapsed to avoid any effects on latency that may arise from using a new topic or subscription or not enough messages flowing through the messaging service. This ensured I more closely mimicked message latency in production. I always took the percentile summary of the subscriber with the second highest p99 latency. I created new Pub/Sub or Kafka topics for each series in the graphs below. Kafka topics always had eight partitions.

I took some inspiration from a blog post titled “Benchmarking Message Queue Latency” and also found the following GCP post “Testing Cloud Pub/Sub clients to maximize streaming performance.” The latter linked to the code used to benchmark Pub/Sub. Unfortunately, after trying that tool many times and finding it wasn’t documented well or had various issues like this, I gave up and wrote my own simple latency benchmarker in Golang. This was probably better anyways to ensure I was using the same language and client libraries as the team I was helping.

My full results are in this Google sheet. The benchmarking code is at github.com/davidxia/cloud-message-latency.


Notes on Michael Lewis’ the Premonition

|

Last week I finished reading Michael Lewis’ The Premonition. The following parts of the book (with page numbers) stood out to me.

The U.S. Centers for Disease Control (CDC) is portrayed as a risk-averse bureaucracy that wants to study disease and not take strong measures to control disease. Sometimes this interest conflicts with local health officials who want to save lives and see strong measures as necessary even if not all the evidence is available yet. Health officials are always firefighting and can’t wait for more data. Lewis compared them to platoon leaders during battle. (page 40)

Deadly mistakes are often result from the combination of systemic and human failures. Lewis tells the story of a Veterans Affairs (VA) patient who was accidentally boiled alive in an Atlanta VA hospital. The hospital heated water to a specific temperature hot enough to kill certain bacteria but not hot enough to scald people. Bathtub faucets had a special valve that prevented water that was too hot from emerging. The water heating mechanism was broken, however. So the nurses compensated by adjusting the valve to a hotter temperature. Then one day, plumbers fixed the heating mechanism without telling the nurses. Normally a patient would tell the nurses when the water was too hot. But the nurses happened to be bathing one patient who was an older man with mental health problems. He always screamed no matter what. The nurses didn’t think anything was wrong when he screamed this time. “An hour later, the man’s skin was peeling away, and he was dying of thermal burns.” (67) This is a powerful story. Unfortunately, I’m unable to find corroborating news articles, and Lewis doesn’t have references or footnotes.

Why and how people learn.

…people don’t learn what is imposed upon them but rather what they frely seek, out of desire or
need. For people to learn, they need to want to learn… “People in an organization learn,” said
Carter. “They’re learning all kinds of things. But they aren’t learning what you are teaching them.
You go to a formal meeting. The important conversation is not in the meeting. It’s in the halls
during the breaks. And usually what’s important is taboo. And you can’t say it in the formal
meeting.”

72-3

Mushroom Foraging in Waltham

|

I collected mushrooms this morning with Gloria and my parents. It rained a lot last week in Waltham, MA. Many mushrooms had sprung up in Prospect Hill Park nearby. The temperature forecast for today said the highs would be 32°C. So I wanted to go out in the morning while it was not too hot. Over the objections of my mother, Gloria, my father, and I picked different mushrooms and carried them back to the car. Next time we’ll bring a basket and a trowel. I took them back only to identify them. I am definitely neither experienced nor confident enough to eat any wild mushrooms I find.

Do not use this post as any basis for consuming mushrooms yourself. Some mushrooms are extremely poisonous and can be fatal if ingested.

Here’s photos of what we found and my amateur guess at what they are.


How to Install Grpcio Pip Package on Apple M1

|

I spent a long time figuring out how to install the latest grpcio Pip package (version 1.37.1) on my Apple M1 Macbook.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
pip install grpcio

Looking in indexes: https://pypi.org/simple, https://artifactory.spotify.net/artifactory/api/pypi/pypi/simple
Collecting grpcio
  Downloading https://artifactory.spotify.net/artifactory/api/pypi/pypi/packages/packages/a0/d6/d04c6550debe23e2eaef0d9c4adccbb6e20d8cce6da40ae989fe8836e287/grpcio-1.37.1.tar.gz (21.7 MB)
     |████████████████████████████████| 21.7 MB 143 kB/s
Requirement already satisfied: six>=1.5.2 in ./.virtualenvs/spotify/lib/python3.9/site-packages (from grpcio) (1.12.0)
Building wheels for collected packages: grpcio
  Building wheel for grpcio (setup.py) ... error
  ERROR: Command errored out with exit status 1:
...

  third_party/zlib/gzlib.c:252:9: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          LSEEK(state->fd, 0, SEEK_END);  /* so gzoffset() is correct */
          ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:252:9: note: did you mean 'fseek'?
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h:162:6: note: 'fseek' declared here
  int      fseek(FILE *, long, int);
           ^
  third_party/zlib/gzlib.c:258:24: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          state->start = LSEEK(state->fd, 0, SEEK_CUR);
                         ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:359:9: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
      if (LSEEK(state->fd, state->start, SEEK_SET) == -1)
          ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:400:15: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          ret = LSEEK(state->fd, offset - state->x.have, SEEK_CUR);
                ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:496:14: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
      offset = LSEEK(state->fd, 0, SEEK_CUR);
               ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  5 errors generated.

...

  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Users/dxia/.virtualenvs/spotify/include -I/Users/dxia/.pyenv/versions/3.9.1/include/python3.9 -c /var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/tmpyvic7ha6/a.c -o None/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/tmpyvic7ha6/a.o
  Traceback (most recent call last):
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/unixccompiler.py", line 117, in _compile
      self.spawn(compiler_so + cc_args + [src, '-o', obj] +
    File "/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/src/python/grpcio/_spawn_patch.py", line 54, in _commandfile_spawn
      _classic_spawn(self, command)
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/ccompiler.py", line 910, in spawn
      spawn(cmd, dry_run=self.dry_run)
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/spawn.py", line 87, in spawn
      raise DistutilsExecError(
  distutils.errors.DistutilsExecError: command '/usr/bin/clang' failed with exit code 1

...

    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/dxia/.virtualenvs/spotify/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/setup.py'"'"'; __file__='"'"'/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-record-n4ihfdh1/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dxia/.virtualenvs/spotify/include/site/python3.9/grpcio Check the logs for full command output.

Fixed by setting the following (I use fish shell). I found the first four environment variables in this Github comment. The second two I knew to add because I was seeing errors about the compiler not being able to find the openssl.h and re.h header files.

1
2
3
4
5
6
set -x GRPC_BUILD_WITH_BORING_SSL_ASM ""
set -x GRPC_PYTHON_BUILD_SYSTEM_RE2 true
set -x GRPC_PYTHON_BUILD_SYSTEM_OPENSSL true
set -x GRPC_PYTHON_BUILD_SYSTEM_ZLIB true
set -x CPPFLAGS "-I"(brew --prefix openssl)"/include -I"(brew --prefix re2)"/include"
set -x LDFLAGS "-L"(brew --prefix openssl)"/lib -L"(brew --prefix re2)"/lib"

What I Recently Learned About Docker Networking and Debugging Networking Issues in General

|

This is a story about how debugged a confounding local development environment issue, what I learned about Docker in the process, and the generally applicable debugging strategies and techniques that helped me ultimately solve it. Skip to the end if you only want to read the debugging strategies and techniques. The overall story, however, will illustrate how they applied in this specific case.

Problem Statement and Use Case

A data infrastructure team at work provides a tool for starting a data pipeline job from a local development environment. Let’s call this tool foo. This tool depends on gcloud and docker. It creates a user-defined Docker network, runs a utility container called bar connected to that network, and then runs another container called qux that talks to bar to retrieve Oauth tokens from Google Cloud Platform (GCP).

Most developers run foo on their local workstations, e.g. Macbooks. But I have the newer Macbook with the Apple M1 ARM-based chip. Docker Desktop on Mac support for M1s was relatively recent. I didn’t want deal with Docker weirdness. I also didn’t have a lot of free disk space on my 256GB Macbook and thus didn’t feel like clogging up my drive with lots of Java, Scala, and Docker gunk.

So I tried running foo on a GCE VM configured by our Puppet configuration files. I ran foo, I got this error.


How Kubernetes Routes IP Packets to Services’ Cluster IPs

|

I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works. Explaining this feature and my particular problem and speculative fix is the goal of this post.

The Problem

The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version 1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on certain nodes weren’t working. I was able to reproduce this behavior with the following steps.

Kubernetes schedules kube-dns Pods and a Service on the cluster that provide DNS and configures kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s docs here. First I get the kube-dns‘ Service’s Cluster IP. This is the IP address to which DNS queries from Pods are sent.

1
2
3
kubectl --context my-gke-cluster -n kube-system get services kube-dns
NAME       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.178.64.10   <none>        53/UDP,53/TCP   666d

Then I make DNS queries against the Cluster IP from a Pod running on a broken node.


My Hints and Solutions to the First Three Levels of Over the Wire Vortex

|

I recently found more wargames at overthewire.org. Here are my hints and solutions for the first three levels of Vortex. The levels are cumulative. We have to beat the previous level in order to access the next.

Vortex Level 0 -> Level 1

Hint 1: how much data Connect to the host and port and read all the bytes you can. How many bytes do you get?

Hint 2: endianess “…read in 4 unsigned integers in host byte order” means the bytes are already in host byte order or little-endian. If your system is also little-endian, you don’t need to do anything special when interpreting the bytes.

Hint 3: expected reply How many bytes is each integer? What is the sum of all four?


My Solution to Exploit Exercises Protostar Final2 Level

|

This is an explanation of Protostar level Final2. I wrote a solution in April without an explanation. I read it last night and had to spend half a day to understand it again. So next time I’ll write the explanation while it’s still fresh in my head.

The level’s description is

Remote heap level :)
Core files will be in /tmp.
This level is at /opt/protostar/bin/final2


How to Analyze Mobile App Traffic and Reverse Engineer Its Non-Public API

|

Have you ever wanted to analyze the traffic between a mobile app and its servers or reverse engineer a mobile app’s non-public API? Here’s one way.

The basic principle is to proxy the traffic from the app through a computer you control on which you can capture and analyze traffic. If the app you’re interested in is using an unencrypted protocol like HTTP, this is pretty easy. Just run a proxy on your computer and configure your mobile device to proxy network traffic through your computer’s IP.


How to Exploit Dlmalloc Unlink(): Protostar Level Heap3

|

While stuck inside during social distancing, I’ve been making my way through LiveOverflow’s awesome Youtube playlist “Binary Exploitation / Memory Corruption.” His videos are structured around a well known series of exploit exercises here called “Protostar.” I took the time to truly understand each one before moving onto the next as the exercises build on each other. For the past several days I’ve been trying to understand the “Heap3” level, a relatively complex level that requires manipulating the heap to redirect code execution to an arbitrary function. After rewatching the video many times and reading numerous other online explanations, I finally understand! That moment of understanding feels so gratifying.

Many other resources already explain the exploit well, but I’m writing my own explanation to reinforce my understanding and to celebrate.