Skip to main content

The DfE technical guidance and its content is intended for internal use by the DfE community.

Disaster recovery

This document is intended to list technical risks to our digital services and the mitigations we have in place.

Application bug

A software or configuration defect gets deployed and it’s impacting users

Impact Loss of some functionality in a service
Prevention Code review. Implement sufficient unit tests and integration tests in production like environments.
Detection It may be reported by a failing smoke test before release ideally, by error tracking software like Sentry or in the worst-case scenario, a user contacting support after release
Remediation The quickest action is to roll back the problematic change or roll forward with a fix

Application crash

The application may crash because of a bug, memory leak, high utilisation…

Impact It may or may not impact end users as a service may deploy multiple application instances.
Prevention Crashes may happen because of high memory, CPU or disk usage. These metrics should be monitored and notify in advance to avoid the crash entirely.
Detection Endpoint monitoring like StatusCake would notify of a total outage impacting users, if the whole application crashes. An application instance crash may be reported by monitoring.
Remediation The quickest action is to roll back the problematic change or roll forward with a fix. Ideally the platform detects a failing application and restarts it.
For example kubernetes detects the failure by running frequent healthchecks. Then it deploys a new container and kills the failed one.
If there is no such feature, the application may be restarted manually. If the restart doesn’t work, the application and infrastructure must be investigated manually.

Data corruption

The data in the database is corrupted because of a bug, human error, malicious activity… and cannot be recovered.

Impact Some data may be lost, updated with incorrect value or may be presented to the wrong users.
Prevention Azure postgres keeps backups of the database and transaction logs. We can recreate the database with daily or point-in-time (1s resolution) backup
Detection Smoke tests may detect corruptions in some critical data.
Remediation Access to the service should be stopped immediately.
The data may be fixed manually if the change is simple. If the change is complex or if we don’t know the extent of the issue, it may be necessary to recover the database from a backup whether daily, hourly or point-in-time using transaction logs.
Restore database with latest snapshot or point in time

Loss of database instance

It is possible to lose the database instance and the associated backups. For example, if the database server is deleted from Azure, in case of human or automation error, the whole instance is deleted, including its backups.

Impact Users can’t read or write any data. All data is lost.
Prevention To protect against human errors, users should only be allowed to access production when they need to.
To protect against automation errors, changes should be thoroughly reviewed, in pull requests or review apps.
Keep a daily backup of the production databases on a secure place like an Azure storage account. Production backups should only be accessible by authorized users.
Detection Endpoint monitoring may point to a healthcheck page checking the connection to the databse. Or smoke tests running in production may detect it.
Remediation Restore database from external daily or most recent backup

Loss of Azure/AWS availability zone

We deploy to the UK South or West Europe regions which have 3 separate availability zones (AZ). It may happen that one of them is unavailable: either network, compute or storage services are affected.

Impact Applications may be slow or unavailable
Prevention Applications should be built with failure in mind: deploy multiple application instances and deploy databases in cluster mode. Spread them across multiple AZs for high availability.
Our AKS clusters are spread across 3 AZs. Scale applications to more than 1 replicas and enable zone redundancy.
Detection Endpoint monitoring checking for uptime and response time
Remediation If not handled automatically by the platform, redeploy applications and fail over clusters

Loss of Azure/AWS region

In some rare cases, an entire region might become unavailable.

Impact Applications may be unavailable
Prevention For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up.
Detection Endpoint monitoring checking for uptime
Remediation Start services in backup region, trigger DNS failover

Azure issues impacting delivery

We often rely on Azure for:

  • Terraform state in Azure Storage
  • Infrastructure and application secrets in Azure Key Vault
  • Daily production database backups in Azure Storage
Impact This would not impact the running applications but would prevent us from deploying new versions and backing up the database.
Prevention Enable soft delete on Key Vaults. Secrets are versioned.
Enable container soft delete
Enable versioning for blobs
Detection The pipelines or deployments may fail
Remediation Key Vaults with soft delete are recoverable. Secrets are versioned and recoverable in case of corruption or deletion
Azure Storage accounts can be recovered for 14 days if they were deleted by mistake
Storage containers with soft delete are recoverable
Versioned files in storage blobs are recoverable in case of corruption or deletion

Denial of service

An attacker may send a high number of requests to overload the service and make it unavailable.

Impact The service is unavailable or slow for users
Prevention Every resource in Azure is protected by Azure’s infrastructure DDoS (Basic) Protection
Depending on the criticality of the service, it is possible to use Azure DDoS Protection Standard instead.
Detection Endpoint monitoring checking for uptime and response time
Remediation Protection measures are triggered automatically. It is also possible to analyse the traffic pattern and change the application accordingly.

Unauthorised access

A malicious actor steals credentials or an ex employee still has working credentials and gains privileged access to the live environment.

Impact They may break the app, read or change confidential data
Prevention Separate production environment and tighten security. Non production environments should only hold test or anonymised data.
Revoke access every day or use Azure PIM to give users temporary access. Make sure the offboarding process is followed. Use single-sign-on and 2FA when possible.
Do not give databases a public IP.
Detection Azure audit logs
Remediation Revoke access of the suspicious user, investigate their actions
Rotate secrets they may know and possibly restore the database to a known good state.

Disclosure of secrets

Different kind of sensitive information may be posted online accidentally by a developer. On a website like pastebin or committed to a GitHub public repository. Examples:

  • Deployment secrets like AWS API key
  • Application secrets like Google API key
  • Application data like a database dump
Impact A malicious actor may gain access to the system, break the app, read or change confidential data, deploy extra applications.
Prevention Secrets should be stored in a secure location like Azure Key Vault (see Managing secrets), Azure DevOps variables or GitHub secrets. They should not be stored in a local file. If necessary temporarily, make sure to exclude it with .gitignore.
Use Terraform remote state backend in a secure Azure Storage account and not local files.
Do not expose databases on the internet and refrain from downloading production dumps. Store database backups in a secure storage account and use ony anonymised data for non production environments.
Install git-secrets locally. Configure GitGuardian on the repository.
Detection Block by git-secrets or notification from GitGuardian. Alert on overspending.
Remediation Remove the secrets from the public place, rotate all the exposed secrets, investigate if they were used.
In the case of GitHub, remove from the repository and its history, make a request to GitHub to remove any pull-request or branch that include it, as those can exist beyond the life of the repo.

SSL certificate expiry

Each service must have a valid SSL certificate otherwise clients cannot connect. Certificates have an expiry date and are not valid after the date.

Impact Users can’t access the website. Or they may ignore browser warnings and could then be tricked into a malicious website.
Prevention Set up auto renewal of certificates stored in Azure Key Vault. Services using Azure front door are configured with a custom domain which generates a certificate and renews it automatically. If not auto renewed, set up monitoring of expiry date. Certficates created on DfE’s Globalsign are monitored by Operations and owners receive notifications.
Detection Email from Operations or notification from monitoring
Remediation If not auto renewed, issue a new DigiCert certificate and install it on the website

Traffic spike

A sudden spike in user traffic due to an announcement, a product launch or a coincidence may overload the system.

Impact The system is slow or unresponsive
Prevention Set up response time monitoring.
Run load testing to determine bottlenecks and know how to scale up.
Use CDN for web page caching and internal caching like Redis or Memcached.
Detection Alert from response time monitoring, high CPU or memory usage, instances crashing
Remediation Scale applications and services horizontally and vertically
Disable expensive features

DfE Sign-In failure

DfE Sign-in is a single-sign-on solutions for many website.

Impact Users cannot connect to the website
Prevention Implement login workaround via magic link or username/password
Detection Smoke test failure. DfE Sign-in status page
Remediation Activate the login workaround

GitHub

GitHub is our code repository, continuous integration system (GitHub actions) and Docker registry (GHCR).

Impact Users are not impacted, but we would not be able to deploy via automation
Prevention Plan to be able to deploy manually. Have DockerHub or Azure container registry ready as backup registry.
Detection GitHub status page
Remediation Build and deploy manually

DockerHub

DockerHub is a docker container registry. New application version images are built, pushed to DockerHub then used for deployment.

If DockerHub is down it won’t impact the running service, but we won’t be able to deploy new versions, for example for bugfixes or reverting to older versions.

Impact Users are not impacted, but we would not be able to deploy via automation
Prevention Have GitHub container registry or Azure container registry ready as backup registry.
Detection DockerHub status page
Notification of pipeline failures
Remediation Build and deploy manually

Monitoring and logging failure

We rely on services like Logit.io, StatusCake, Prometheus ecosystem, Skylight, Sentry

Impact Users are not impacted, but we would lose visibility of our systems
Prevention
Detection
Remediation Check manually

GOV.UK Notify

GOV.UK Notify is used to communicate with our users via emails, texts and letters.

Impact We can’t send communications to our users
Asynchronous sending jobs may queue until it’s available again
Prevention
Detection GOV.UK Notify status page
Remediation

Google BigQuery

TBD

Google API

TBD