Improving zero downtime on Kubernetes

Date posted
7 February 2020
Reading time
10 Minutes
Daryl Porter

Improving zero downtime on Kubernetes

A few months ago, we began a piece of work on my last project setting out to accomplish two things:

  1. Improve the speed of our release process
  2. Allow us to run multiple versions of the application concurrently

These were two limitations of our initial deployment and release design and so in sharing this blog I am hoping the lessons learnt will be able to help you avoid the same issues.

Before jumping into the solution, I'll quickly describe our starting point.

Please note: This is a simplified view of the infrastructure, we'll only describe elements relevant to the zero downtime aspect of the solution.

Configuration ? before

 /></figure>



<strong>Design decisions</strong><br>
<ul>
<li>Separate namespaces for web tier (app-publishing) & application tier (blue & green)</li>
<li>Application namespaces (blue/green) would be immutable and therefore would only ever contain a single release</li>
<li>Application deployed via single Helm chart</li>
<li>Use the <a href=Nginx Ingress controller to manage Kubernetes Ingress resources. Issues with this configuration

1. Ingress controller reloads cause 1 second of dropped traffic
The redirection mechanism, illustrated in red in the above diagram, which redirected live traffic between the 'blue' and 'green' namespaces meant an update to the Ingress resource in app publishing to point to the cold blue/green re director service resource, as shown below.
 /></figure>



<p>Updating the Ingress resource to point at a new service means the Ingress controller would need to perform a reload of the Nginx configuration, which incurred on average <strong>one</strong> <strong>second</strong> of dropped connection, resulting in a 500 error (this and other causes of reloads can be found <a href=here).

This meant, in order to perform zero downtime releases, we had to remove Region1 from Traffic Manager, update it while offline, add it back into Traffic Manager & repeat for Region2. This led to the next issue with this setup.

2. Unable to support multiple versions
The application needs to be able to route user traffic in two scenarios:
  1. New users landing on the homepage should be routed to the latest stable version of the application.
  2. Existing user sessions requesting version specific assets should be routed to the requested version.

In the current design, we have no way of routing based on requested version and only have one Ingress flow for both scenarios of traffic.

This meant during the time in which Region1 was updated and re-introduced to Traffic Manager, and Region2 was being drained (typically between 5 10 minutes), existing user sessions on Region2 would have no route to the version specific assets on Region1, resulting in a 404.

3. Slow release process
The last issue with this set up was the amount of time it took to perform a release. Although the only manual step in the release process as mentioned above was the removing/re-adding of regions from Traffic Manager, waiting for connections to drain before performing the switch to the new version drastically slowed down our release process, which meant redirecting to a new, already tested release, could take up to 20 minutes.

The fix
The plan for resolving this was quite simple and can be summarised by three key points:

1. Remove the need for Nginx Ingress controller reloads during redirection.
The root of our redirection issues is the need for Nginx Ingress controller reloads, and the 1 second of downtime this incurs. We therefore need to ensure all Ingress resources remain static, meaning their downstream config cannot change after initial configuration. In doing this, we need to find another way to redirect traffic. This is where label selectors come in.

Label selectors allow us to route traffic to target pods within a single namespace without incurring a connection drop. This was the key to unlocking zero downtime deployments without Traffic Manager manipulation.

In order to facilitate this, we had to move away from blue/green namespaces to a single production namespace, meaning our immutable component of release was no longer the entire namespace and was instead the Helm release.

2. New default Ingress & service resources
We would introduce a new 'default' Ingress flow to handle new user traffic. The Ingress resources within this flow would remain static and we would instead utilise label selectors for zero downtime redirection.

3. Restructure Ingress resources
To facilitate the second use case of traffic routing, we would utilise URL path based routing to introduce new version specific Ingress flows for existing user sessions.

Configuration ? after

 /></figure>



<strong>Key points:</strong><br>
<ul>
<li>Dual region & Traffic Manager still exist, not shown above as now irrelevant to zero downtime</li>
<li>Single production namespace</li>
<li>Simplified design, no services referencing resources in different namespaces</li>
<li>Default Ingress flow (left) handles first use case of routing traffic landing on the services homepage</li>
<li>Version specific Ingress handles second use case of routing traffic that is requesting version specific assets based on the URL path</li>
<li>Ingress resources remain static and no longer update during redirection</li>
<li>Redirection (in red) is now a label selector change using the<a href= kubectl patch command, via a CI job (shown below).
 /></figure>



<strong>The result<br /></strong>
<ol>
<li>Manual step in redirection process has been removed.</li>
<li>Redirecting to a new version now takes under <strong>1 second </strong>with no connection drop (the CI job which runs this takes under 5 seconds).</li>
<li>New versions can be deployed to production namespace at any time and tested offline before any live traffic is directed to them.</li>
<li>We can support an <strong>unlimited</strong> number of concurrent versions.</li>
</ol>

About the author

Daryl Porter