Spotify’s Helios in a Nutshell

I work at Spotify on backend infrastructure. In this context, infrastructure is the shared plumbing and platform on which various Spotify systems run. In particular, I work on an open source tool called Helios. This project and, more importantly, my team are pretty awesome.

What is Helios?

Helios is a Docker orchestration framework. This means it’s a tool used to manage your Docker containers across a fleet of hosts.

Why did Spotify create Helios?

Let’s say you have 20 hosts, and you want to run a Docker container named “hello-world” on each of them. You’ve installed the Docker daemon on each. Now you SSH into each of them and run docker run hello-world. Doing this 20 times is tedious.

So you use cluster SSH or fabric and run it once. That works.

Your application grows and you need environmental variables, exposed ports, and mounted volumes. You Docker command becomes longer:

docker run --env="FOO=BAR" --publish=80:8080 --volume="/etc/default/config:/etc/default/config"

You need to remember this long command so you save it to a file in your code repository. This works OK.

You notice your Docker containers sometimes crash when you start them. You run watch 'docker ps' to see which containers have crashed on which hosts. You have to tail the logs on those hosts to figure out what went wrong. Hm, this is becoming hard to manage.

One day, one of your hosts restarts. You don’t notice that the container is no longer running on that host until several days later. Maybe it’s time to think of a better solution.

These and many other reasons are why Spotify created Helios. Helios makes it easy to deploy to multiple hosts, and Helios keeps track of which containers are running where and will restart containers if they crash.

How does Helios work?

At a high level, Helios is made of a command line tool (CLI), a master, ZooKeeper, agent, and jobs. A job is a Docker image bundled with configuration like environmental variables, exposed ports, mounted volumes, etc. Jobs are stored in ZooKeeper. The CLI lets humans interact with the master. A Helios master receives commands like “deploy” and writes data to ZooKeeper, a distributed file system. Helios agents run Docker, periodically ask ZooKeeper what they should do, and carry out the corresponding Docker actions like running and stopping containers.

The lifecycle of a Helios command

I’ll use my drawing below to illustrate the life cycle of the Helios deploy command.

A human user uses the CLI to deploy a Helios job to three agents.
The master checks the job is valid and writes data saying that job should be deployed to those three agents.
The agents periodically ask ZooKeeper what jobs they should deploy.
The agent process on those hosts tell the local Docker daemon process to create and start a container as specified in the job.
The Docker daemon creates and starts the container.

Here’s the life cycle of the helios status command.

A human user runs helios status -j JOB_ID.
The master asks ZooKeeper what the status of that job is and on which hosts the job is deployed.
Meanwhile, the three agents have periodically been asking their local Docker daemon if that corresponding container is running.
Docker daemon checks if the container is running and relays that information back to the Helios agent.
Agents one and three see that the container is running. Agent two sees that the container isn’t running. They write the job’s status to ZooKeeper.
The master queries ZooKeeper for these statuses and reports them back to the CLI user.

Hopefully this was a gentle introduction to Helios. If you find it fits your use case or have questions, drop us a line at github.com/spotify/helios.

David Xia