I’m helping a recently acquired team at work figure out if they can migrate from Kafka to Google Cloud Pub/Sub. Part of the exploration was figuring out the change in latencies, if any, from switching.
The team’s production setup is like this.
- They paid an external company called Confluent to run a managed Kafka cluster in AWS Oregon.
- This is the same region where this team ran all their backend services. Part of their migration also involves switching their workloads from AWS Oregon to GCP us-central1. If they choose to migrate to Pub/Sub, their services will be publishing and subscribing to messages across cloud providers and regions. So my latency benchmarks took that into account.
- All their services are written in Golang.
- Services run as containers in AWS Elastic Container Service.
I defined latency as the time elapsed from when a message is published and when it’s received by a subscriber. I didn’t count the extra time it takes for a subscriber to acknowledge the message. I used Golang and the same upstream libraries for Kafka and Pub/Sub that they used or would use, respectively, in production. I published messages of various sizes at various rates from AWS EC2 instances in Oregon for five minutes. At the same time, five Google Compute Engine instances in us-central1 subscribed to these messages (pull-based) as fast as possible with an initial burn-in period of one minute. I didn’t measure the latency until the burn-in period elapsed to avoid any effects on latency that may arise from using a new topic or subscription or not enough messages flowing through the messaging service. This ensured I more closely mimicked message latency in production. I always took the percentile summary of the subscriber with the second highest p99 latency. I created new Pub/Sub or Kafka topics for each series in the graphs below. Kafka topics always had eight partitions.
I took some inspiration from a blog post titled “Benchmarking Message Queue Latency” and also found the following GCP post “Testing Cloud Pub/Sub clients to maximize streaming performance.” The latter linked to the code used to benchmark Pub/Sub. Unfortunately, after trying that tool many times and finding it wasn’t documented well or had various issues like this, I gave up and wrote my own simple latency benchmarker in Golang. This was probably better anyways to ensure I was using the same language and client libraries as the team I was helping.
With my specific test parameters, Kafka p99 latencies are 100-200ms and much lower than Pub/Sub latencies. In the worst case scenarios, Pub/Sub latencies were almost an order of magnitude higher. Pub/Sub p99 latencies were approximately 0.5-1 seconds at the team’s current publisher throughput which is relatively low at about 1KB/s. At higher throughputs the latencies dropped to 300-400ms. This conforms to Google’s documentation and generally accepted knowledge that Pub/Sub performs faster at higher message volumes. According to one of that team’s engineers, this latency is acceptable for all messages except for one which can be changed to a direct service-to-service request.
It was also interesting to see that message delivery was pretty evenly spread out over five subscribers with Pub/Sub. Kafka often had a few consumers that received twice as many messages as their peers.
After I finished benchmarking, I found PerfKitBenchmarker, an open source benchmarking tool used to measure and compare cloud offerings. It looks promising, but I haven’t tried it out yet.