How Enabling a Health Check on our Azure App Service Caused >30s Response Times

by Tom Hyde | 31/03/2021

TL;DR; If you use containers in Azure App Service and want to use the Health Check feature, be careful not redirect http to https inside your container.

What happened

The other day, we released what appeared to be a fairly low-risk update to an ASP.net Core web application to an Azure App Service. All seemed well, but after a couple of minutes, average response times increased from the usual handful of milliseconds to >30s. It seemed fairly clear that something must be wrong with this code, so I duly redeployed the old version of the application, and response times returned to their normal, healthy, level.

We then identified some features in the release which shouldn't cause an issue, but we allowed ourselves to believe that they could have led to this issue. So, we created a new build with only a few features, which surely couldn't have such an impact. We deployed this and response times went through the roof once more. What could have been causing this?

We compared config between the staging slot which we were swapping with production to see if there were any differences which could be causing the problem. Nothing. Except, Health Checks were enabled on the staging slot, and not on the production slot.

In 2020, Microsoft added a new Health Check feature to Azure App Services. In the Azure Portal, we are told

Health check increases your application's availability by removing unhealthy instances from the load balancer. If your instance remains unhealthy, it will be restarted.

Great! If some unknown bug causes my application to become unresponsive, Azure can restart the poorly instance, while users continue to be directed to healthy instances, leaving my application available, and us to diagnose and fix the problem when we can get around to it.

The health check feature should simply poll a health check endpoint, which we specify, on the application each minute. It is to be considered "unhealthy" if it returns a non-2xx response. The application already has a health check endpoint which we monitor, so this should be really easy.

So, to confirm that this was the issue, without deploying any new code, we enabled the health check feature on the production slot. Sure enough, within a few minutes, average response times shot up to over 30 seconds. Leaving this for some time, we could see that this was cyclical - response times would be "fine" for a couple of minutes, before shooting up again, recovering to a normal level, before going bad once more.

This does feel like the app service is considered unhealthy and restarted - but the health check monitoring did not report anything unhealthy, nor was the app service instance restarted.

Diving into Docker

The application is running in a Docker container, so we looked at the container logs. Here we could see, with a cadence equal to the response time spikes we were seeing

Container for XXXXX site XXXXX is unhealthy, recycling site.

But whenever we manually navigated to the health check endpoint, all was well, with a 200 response code. What was causing Azure to determine that the application was unhealthy?

Looking into the Application Insights logs, we could see that our "manual" hits of the health check endpoint were responding 200, exactly as we would expect, however, there were also requests to the health check with 307 response code coming from localhost.

The obvious difference between these requests was the external ones were via HTTPS, wheras the "localhost" requests were via HTTP. This seemed odd, since the documentation said

If the site is HTTPS-Only enabled, the Health check request will be sent via HTTPS.

We did have the HTTPS-Only setting enabled. Since the request was coming from inside the App Service, Azure was not making the request via HTTPS, since it would only be made via HTTP once relayed to the application by the reverse proxy. This wouldn't seem to matter, since all requests, at the point they are received by the container, are over HTTP, regardless of whether the original request to the App Service was HTTP or HTTPS.

This caused the application to return an HTTP 307, since we are using the UseHttpsRedirection middleware. This works just fine for requests coming externally, since it is being used in the request pipeline after UseForwardedHeaders. This means that the UseHttpsRedirection middleware can see that the original request was over HTTPS, and knows that there is no need to redirect. Not so for the health check requests coming from Azure. Despite the fact that all requests would be received by the container over HTTP, it mattered that it wasn't aware of an original request over HTTP.

The health check returning 307 caused Azure to determine that the application was unhealthy, and so restarted it. During this time, requests were not responded to until the new container was up and running, hence the increased average response times during this period.

In order to get the application to work well with the health check feature in Azure App Services, we will remove the UseHttpsRedirection middleware, since HTTPS redirection is enforced at App Service level anyway.

It took a lot of time to find that this site-killing issue, which cropped up when deploying new code, was caused by a feature meant to stop the site from going down and staying down!

Conclusion

When you are not using containers with App Service, all of the above would have worked just fine; when it's the App Service itself that does the healtcheck, the https setting is respected.
However, when you use containers, it looks like the Healthcheck settings is "passed down" to the container management layer. At that layer, there is only http traffic so the container management layer will always try to test the containers over http. This makes sense, but it is very easy to get caught out by and it took a lot of effort to get to the bottom of this.


Share this article

You Might Also Like

Explore more articles that dive into similar topics. Whether you’re looking for fresh insights or practical advice, we’ve handpicked these just for you.

AI Isn’t Magic: Why Predictive Accuracy Can Be Misleading

by Frans Lytzen | 15/04/2025

One of the biggest misconceptions in AI today is how well it can actually predict things – especially things that are rare. This is most directly applicable to Machine Learning (as they are just statistical models) but the same principle applies to LLMs. The fundamental problem is the same and AI is not magic. In reality, AI’s predictive power is more complicated. One of the key challenges? False positives—incorrect detections that can significantly undermine the value of AI-driven decision-making. Let’s explore why this happens and how businesses can better understand AI’s limitations.

From Figma Slides to Svelte Page in Under an Hour – How I Accidentally Proved My Own Point

by Marcin Prystupa | 10/04/2025

A quick case study on how I went from a Figma presentation to a working Svelte page in less than an hour – with the help of AI and some clever tooling.

Embracing the European Accessibility Act: A NewOrbit Perspective

by George Elkington | 12/03/2025

As the European Accessibility Act (EAA) approaches its enforcement date on June 28, 2025, businesses must prioritise accessibility to ensure compliance and inclusivity. The EAA sets new standards for software, e-commerce, banking, digital devices, and more, aiming to make products and services accessible to all, including people with disabilities and the elderly. Non-compliance could lead to significant penalties across the EU. At NewOrbit, we believe that accessibility is not just a legal requirement—it’s good design. Take advantage of our free initial review to assess your compliance and stay ahead of the deadline.

Contact Us

NewOrbit Ltd.
Hampden House
Chalgrove
OX44 7RW


020 3757 9100

NewOrbit Logo

Copyright © NewOrbit Ltd.