NewOrbit Triangle Blue
Azure Resilience ReviewDisaster Recovery | Scalability | Monitoring

Disaster Recovery planning for the cloud is different from Disaster Recovery planning for on-prem or co-lo. In the cloud, you are more at risk from individual services failing than from a whole data centre failing and you need an approach that is designed around that fact. NewOrbit has built systems in the Cloud for more than a decade, including several with five-nines uptime requirements.

Planning for Disaster Recovery for a cloud-hosted application can be daunting. Different cloud technologies have different SLAs and different abilities to deal with failures (see Azure Failover and Resilience 101 for an overview).

Different businesses and different systems have different resilience requirements. Some may require no downtime at all whereas others can cope with many hours of downtime. There are often also different contractual requirements in play for different systems. It is a well-known adage that for every 9 you add to the end of the SLA target, the cost goes up by an order of magnitude, so it is very important to match your level of resilience with your business requirements to avoid spending more than you need to.

Our process goes through the following steps:
  1. Understand your resilience requirements and your business constraints.
  2. Understand your current setup.
  3. Suggest a suitable setup based on your specific situation.
  4. Help you to plan how to implement this.

Each step is outlined in more detail below, with example questions. We will ask you many more questions during the consultation. Do bear in mind that many of the questions are over the top for many scenarios; we will evaluate the appropriateness with you based on your specific context.

Requirements and Constraints

What level of resilience does your business really need?

AreaExample questions
Real world Impact
  • What would be the real-world impact on your business, your customers and your users if the system was down for a minute? An hour? A day?
  • What would be the real-world impact be on your business, your customers and your users if you lost one minute’s worth of data? An hour’s worth? A day’s worth? All your data?
RPO/RTO
  • Do you have any externally or internally imposed Recovery Point Objective and/or Recovery Time Objectives?
Standards compliance
  • Do you need to meet external standards for recovery and resilience, i.e. FCA, ISO22301 etc?
  • Would a data loss violate an individual’s rights under the GDPR?
Contracts
  • What level of SLA are you expected to provide? What happens if you don’t?
  • Are there specific requirements in your contracts, for example a requirement to have a fail-over data centre or off-site backup etc?
Constraints
  • What time and financial constraints do you have?
Current situation

What is your current situation?

AreaExample questions
What is your application architecture?
  • What languages and frameworks do you use?
  • Monolithic or distributed?
  • Background jobs?
  • Do you use queues?
What does your infrastructure look like?
  • How is the system hosted?
  • Where is data stored?
  • Do you have any fail-over in place?
Infrastructure and deployment
  • How is your code deployed? Manual or CD?
  • Is your infrastructure setup scripted?
What monitoring do you have in place?
Architecture and Plan

A plan to improve the resilience of a system of Azure usually requires activity in one or more of the following areas, depending on requirements:

  • Deployment and code changes to facilitate hot fail over for different sub systems.
  • Deployment changes to make it easier/faster to re-deploy in a secondary Data Centre.
  • Code changes to make the system more resilient to failures in sub-systems.
  • Hosting and possibly code changes to allow for “warm” failovers.
  • Monitoring, especially early detection of impending failures.

One of the most important things to understand is that the biggest risk is not that an entire data centre fails but that individual services within a data centre fail. For example, SQL Azure could fail in the primary DC but the App Service might still be running. It is also important to have a sense of the likelihood of a particular sub system failing – something that goes beyond the official SLA numbers. A proper resilience plan for Azure will consider each of the individual services.

Note: The outcome from this review is a plan for how to improve the resilience of the system in accordance with the business requirements. Sometimes there is an external demand for a “disaster recovery plan” that has been “tested”. These usually make a simplified assumption that the whole DC has failed and requires moving everything. It is our experience that these plans – and in particular the testing of them – takes several person-weeks of effort to put into place. If requested, NewOrbit can help you with this (at additional cost) but note that most of the effort will be from the people who actually operate the system on a daily basis.

Implement

If desired, NewOrbit can help you to implement parts or all of the plan:

  • We can introduce your team to the selected tools and help you design the solution.
  • We can second a Azure developer to your team to pair-program on the initial implementation of a particular technology.
  • We can provide you with development and design capacity to help you build parts of the solution.
  • We can be your Azure Cloud Solution Provider, providing you with Azure hosting and giving you access to Azure experts and support as needed.
Interested in learning more?

Contact us to optimise your system's resilience

Azure in Action

Discover how our Azure services have helped clients across industries tackle challenges and innovate faster:

AI Isn’t Magic: Why Predictive Accuracy Can Be Misleading

by Frans Lytzen | 15/04/2025

One of the biggest misconceptions in AI today is how well it can actually predict things – especially things that are rare. This is most directly applicable to Machine Learning (as they are just statistical models) but the same principle applies to LLMs. The fundamental problem is the same and AI is not magic. In reality, AI’s predictive power is more complicated. One of the key challenges? False positives—incorrect detections that can significantly undermine the value of AI-driven decision-making. Let’s explore why this happens and how businesses can better understand AI’s limitations.

From Figma Slides to Svelte Page in Under an Hour – How I Accidentally Proved My Own Point

by Marcin Prystupa | 10/04/2025

A quick case study on how I went from a Figma presentation to a working Svelte page in less than an hour – with the help of AI and some clever tooling.

Embracing the European Accessibility Act: A NewOrbit Perspective

by George Elkington | 12/03/2025

As the European Accessibility Act (EAA) approaches its enforcement date on June 28, 2025, businesses must prioritise accessibility to ensure compliance and inclusivity. The EAA sets new standards for software, e-commerce, banking, digital devices, and more, aiming to make products and services accessible to all, including people with disabilities and the elderly. Non-compliance could lead to significant penalties across the EU. At NewOrbit, we believe that accessibility is not just a legal requirement—it’s good design. Take advantage of our free initial review to assess your compliance and stay ahead of the deadline.

Contact Us

NewOrbit Ltd.
Hampden House
Chalgrove
OX44 7RW


020 3757 9100

NewOrbit Logo

Copyright © NewOrbit Ltd.