Disaster recovery

This document is intended to list technical risks to our digital services and the mitigations we have in place.

For any issue affecting a service (or services) always check if there are any dependent services that may also be affected.

Application bug

A software or configuration defect gets deployed and it’s impacting users


Impact	Loss of some functionality in a service
Prevention	Code review. Implement sufficient unit tests and integration tests in production like environments.
Detection	It may be reported by a failing smoke test before release ideally, by error tracking software like Sentry or in the worst-case scenario, a user contacting support after release
Remediation	The quickest action is to roll back the problematic change or roll forward with a fix

Application crash

The application may crash because of a bug, memory leak, high utilisation…


Impact	It may or may not impact end users as a service may deploy multiple application instances.
Prevention	Crashes may happen because of high memory, CPU or disk usage. These metrics should be monitored and notify in advance to avoid the crash entirely.
Detection	Endpoint monitoring like `StatusCake` would notify of a total outage impacting users, if the whole application crashes. An application instance crash may be reported by monitoring.
Remediation	The quickest action is to roll back the problematic change or roll forward with a fix. Ideally the platform detects a failing application and restarts it. For example kubernetes detects the failure by running frequent healthchecks. Then it deploys a new container and kills the failed one. If there is no such feature, the application may be restarted manually. If the restart doesn’t work, the application and infrastructure must be investigated manually. AKS also uses rolling deployments, so a new deployment will only become active if the startup probe (set to a service healthcheck) is successful.

Data corruption

The data in the database is corrupted because of a bug, human error, malicious activity… and cannot be recovered.


Impact	Some data may be lost, updated with incorrect value or may be presented to the wrong users.
Prevention	Azure postgres keeps backups of the database and transaction logs. We can recreate the database with daily or point-in-time (1s resolution) backup
Detection	Smoke tests may detect corruptions in some critical data.
Remediation	Access to the service should be stopped immediately. The data may be fixed manually if the change is simple. If the change is complex or if we don’t know the extent of the issue, it may be necessary to recover the database from a backup whether daily, hourly or point-in-time using transaction logs. Restore database with latest snapshot or point in time

Loss of database instance

It is possible to lose the database instance and the associated backups. For example, if the database server is deleted from Azure, in case of human or automation error, the whole instance is deleted, including its backups.


Impact	Users can’t read or write any data. All data is lost.
Prevention	To protect against human errors, users should only be allowed to access production when they need to. To protect against automation errors, changes should be thoroughly reviewed, in pull requests or review apps. Keep a daily backup of the production databases on a secure place like an Azure storage account. Production backups should only be accessible by authorized users.
Detection	Endpoint monitoring may point to a healthcheck page checking the connection to the databse. Or smoke tests running in production may detect it.
Remediation	Restore database from external daily or most recent backup

Accidental resource deletion

We use terraform to provision resources, but it could be possible for a code change or a user with privileges to delete resources accidentally or otherwise.


Impact	Applications may be unavailable. Data may be lost.
Prevention	Approved PIM request required for production Azure access. Pull Requests require at least 1 approval. Soft delete and versioning enabled for key vaults and storage accounts. Azure resource locks placed on important resources Azure locks.
Detection	Endpoint monitoring may point to a healthcheck page that is now failing. Or smoke tests running in production may detect it.
Remediation	Recovery dependent on the resource deleted, either restore correct version or redeploy and restore data from backup

Loss of Azure/AWS availability zone

We deploy to the UK South or West Europe regions which have 3 separate availability zones (AZ). It may happen that one of them is unavailable: either network, compute or storage services are affected.


Impact	Applications may be slow or unavailable
Prevention	Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZs. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy. Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover). Azure key vault uses replication within region and to a paired region (automatic failover). Azure Postgres can be configured zone redundant, with the active/standby instances in different zones. Azure Redis can be zone redundant only if using the Premium SKU. Postgres and Redis utilise automatic failover. Postgres can also be failed over manually. Cluster Public IP addresses should be configured with zone redundancy Azure PIP redundancy
Detection	Endpoint monitoring checking for uptime and response time
Remediation	If not handled automatically by the platform, redeploy applications and fail over clusters

Loss of Google Cloud Platform availability zone

We deploy to the europe-west2 (London) region, which consists of three separate availability zones. While BigQuery is a regional service and not tied to a single zone, disruptions in an individual zone can still affect related zonal services such as compute or networking that interact with BigQuery (e.g., ETL pipelines or query clients running in zonal VMs or GKE).


Impact	BigQuery itself is unlikely to be impacted, but ETL workloads, scheduled queries, or data ingestion processes that depend on zonal resources may be delayed or fail. Query performance may be degraded if metadata services or job scheduling are affected by broader regional strain
Prevention	BigQuery is a regional, multi-zone service by design. No additional configuration is needed for zone redundancy. For ingestion workloads: Deploy GKE clusters or Compute Engine VMs across multiple zones within europe-west2. Use regional IPs and regional managed instance groups for resilient client infrastructure. Ensure Dataflow or other data processing services are regional, not zonal.
Detection	Use Cloud Monitoring to: Track scheduled query success/failure in BigQuery. Check health of ETL pipelines
Remediation	BigQuery will automatically fail over across zones as needed. Use retry logic in ingestion pipelines to absorb temporary failures

Loss of Azure/AWS region

In some rare cases, an entire region might become unavailable.


Impact	Applications may be unavailable
Prevention	For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up. Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate region. Any critical application data kept in a storage account should be GRS/GZRS. Azure Key Vault maintains a copy of the contents in another region. For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover.
Detection	Endpoint monitoring checking for uptime
Remediation	Start services in backup region, trigger DNS failover

Loss of Google Cloud Platform region

In rare scenarios, the entire europe-west2 (London) region may become unavailable due to a major outage.


Impact	Applications relying solely on resources in europe-west2 may become fully unavailable, including compute workloads and access to regional services like BigQuery and Cloud Storage
Prevention	We typically do not protect against regional failure due to the added complexity of multi-region architecture and regulatory requirement to confine data to the UK
Detection	Monitor endpoints with Cloud Monitoring uptime checks
Remediation	Contact Google support and monitor the Google status pages

Azure issues impacting delivery

We often rely on Azure for:

Terraform state in Azure Storage
Infrastructure and application secrets in Azure Key Vault
Daily production database backups in Azure Storage


Impact	This would not impact the running applications but would prevent us from deploying new versions and backing up the database.
Prevention	Enable soft delete on Key Vaults. Secrets are versioned. Enable container soft delete Enable versioning for blobs
Detection	The pipelines or deployments may fail
Remediation	Key Vaults with soft delete are recoverable. Secrets are versioned and recoverable in case of corruption or deletion Azure Storage accounts can be recovered for 14 days if they were deleted by mistake Storage containers with soft delete are recoverable Versioned files in storage blobs are recoverable in case of corruption or deletion

Denial of service

An attacker may send a high number of requests to overload the service and make it unavailable.


Impact	The service is unavailable or slow for users
Prevention	Every resource in Azure is protected by Azure’s infrastructure DDoS (Basic) Protection Depending on the criticality of the service, it is possible to use Azure DDoS Protection Standard instead. Public IP addresses can have DDOS protection enabled individually.
Detection	Endpoint monitoring checking for uptime and response time
Remediation	Protection measures are triggered automatically. It is also possible to analyse the traffic pattern and change the application accordingly.

Unauthorised access

A malicious actor steals credentials or an ex employee still has working credentials and gains privileged access to the live environment.


Impact	They may break the app, read or change confidential data
Prevention	Separate production environment and tighten security. Non production environments should only hold test or anonymised data. Use Azure PIM to give users temporary access. Make sure the offboarding process is followed. Use single-sign-on and 2FA when possible. Use Azure RBAC for AKS, and separate service namespaces and resources into service AD groups. To restrict developer access to their services only. Do not give databases a public IP.
Detection	Azure audit logs
Remediation	Revoke access of the suspicious user, investigate their actions Rotate secrets they may know and possibly restore the database to a known good state.

Disclosure of secrets

Different kind of sensitive information may be posted online accidentally by a developer. On a website like pastebin or committed to a GitHub public repository. Examples:

Deployment secrets like AWS API key
Application secrets like Google API key
Application data like a database dump


Impact	A malicious actor may gain access to the system, break the app, read or change confidential data, deploy extra applications.
Prevention	Secrets should be stored in a secure location like Azure Key Vault (see Managing secrets), Azure DevOps variables or GitHub secrets. They should not be stored in a local file. If necessary temporarily, make sure to exclude it with `.gitignore`. Use Terraform remote state backend in a secure Azure Storage account and not local files. Do not expose databases on the internet and refrain from downloading production dumps. Store database backups in a secure storage account and use ony anonymised data for non production environments. Install git-secrets locally. Configure GitGuardian on the repository.
Detection	Block by git-secrets or notification from GitGuardian. Alert on overspending.
Remediation	Remove the secrets from the public place, rotate all the exposed secrets, investigate if they were used. In the case of GitHub, remove from the repository and its history, make a request to GitHub to remove any pull-request or branch that include it, as those can exist beyond the life of the repo.

SSL certificate expiry

Each service must have a valid SSL certificate otherwise clients cannot connect. Certificates have an expiry date and are not valid after the date.


Impact	Users can’t access the website. Or they may ignore browser warnings and could then be tricked into a malicious website.
Prevention	Set up auto renewal of certificates stored in Azure Key Vault. Services using Azure front door are configured with a custom domain which generates a certificate and renews it automatically. If not auto renewed, set up monitoring of expiry date. Certficates created on DfE’s Globalsign are monitored by Operations and owners receive notifications.
Detection	Email from Operations or notification from monitoring
Remediation	If not auto renewed, issue a new DigiCert certificate and install it on the website

Traffic spike

A sudden spike in user traffic due to an announcement, a product launch or a coincidence may overload the system.


Impact	The system is slow or unresponsive
Prevention	Set up response time monitoring. Run load testing to determine bottlenecks and know how to scale up. Use CDN for web page caching and internal caching like Redis or Memcached.
Detection	Alert from response time monitoring, high CPU or memory usage, instances crashing
Remediation	Scale applications and services horizontally and vertically Disable expensive features

DfE Sign-in is a single-sign-on solution for many websites.


Impact	Users cannot connect to the website
Prevention	Implement login workaround via magic link or username/password
Detection	Smoke test failure. DfE Sign-in status page
Remediation	Activate the login workaround

GitHub

GitHub is our code repository, continuous integration system (GitHub actions) and Docker registry (GHCR).


Impact	Users are not impacted, but we would not be able to deploy via automation
Prevention	Plan to be able to deploy manually. Have DockerHub or Azure container registry ready as backup registry.
Detection	GitHub status page
Remediation	Build and deploy manually

DockerHub

DockerHub is a docker container registry. New application version images are built, pushed to DockerHub then used for deployment.

If DockerHub is down it won’t impact the running service, but we won’t be able to deploy new versions, for example for bugfixes or reverting to older versions.


Impact	Users are not impacted, but we would not be able to deploy via automation
Prevention	Have GitHub container registry or Azure container registry ready as backup registry.
Detection	DockerHub status page Notification of pipeline failures
Remediation	Build and deploy manually

Monitoring and logging failure

We rely on services like Logit.io, StatusCake, Prometheus ecosystem, Skylight, Sentry, Azure Monitor.


Impact	Users are not impacted, but we would lose visibility of our systems
Prevention
Detection
Remediation	Check manually

GOV.UK Notify

GOV.UK Notify is used to communicate with our users via emails, texts and letters.


Impact	We can’t send communications to our users Asynchronous sending jobs may queue until it’s available again
Prevention
Detection	GOV.UK Notify status page
Remediation

Google BigQuery

Our services load web analytics and database data into Google BigQuery via an event stream. BQ Data is then available for analysis using various tools.


Impact	Unable to send data to BQ. Reporting unavailable or out of date.
Prevention
Detection	Sentry errors. Daily monitoring of out of date data by the BI team. Daily checksums to confirm service database tables match data kept in BQ. GCP status page.
Remediation	Missing data can be manually reloaded when BQ is available. GCP Basic support.

Google API

Various Google API’s are used by our services, including Geocoding, Indexing, and Analytics.


Impact	Some application functionality will be degraded or unavailable.
Prevention
Detection	Sentry errors. Check maps api status.
Remediation	GCP Basic support.