Disaster recovery
This document is intended to list technical risks to our digital services and the mitigations we have in place.
For any issue affecting a service (or services) always check if there are any dependent services that may also be affected.
Application bug
A software or configuration defect gets deployed and it’s impacting users
Impact | Loss of some functionality in a service |
Prevention | Code review. Implement sufficient unit tests and integration tests in production like environments. |
Detection | It may be reported by a failing smoke test before release ideally, by error tracking software like Sentry or in the worst-case scenario, a user contacting support after release |
Remediation | The quickest action is to roll back the problematic change or roll forward with a fix |
Application crash
The application may crash because of a bug, memory leak, high utilisation…
Impact | It may or may not impact end users as a service may deploy multiple application instances. |
Prevention | Crashes may happen because of high memory, CPU or disk usage. These metrics should be monitored and notify in advance to avoid the crash entirely. |
Detection | Endpoint monitoring like StatusCake would notify of a total outage impacting users, if the whole application crashes. An application instance crash may be reported by monitoring. |
Remediation | The quickest action is to roll back the problematic change or roll forward with a fix. Ideally the platform detects a failing application and restarts it. For example kubernetes detects the failure by running frequent healthchecks. Then it deploys a new container and kills the failed one. If there is no such feature, the application may be restarted manually. If the restart doesn’t work, the application and infrastructure must be investigated manually. AKS also uses rolling deployments, so a new deployment will only become active if the startup probe (set to a service healthcheck) is successful. |
Data corruption
The data in the database is corrupted because of a bug, human error, malicious activity… and cannot be recovered.
Impact | Some data may be lost, updated with incorrect value or may be presented to the wrong users. |
Prevention | Azure postgres keeps backups of the database and transaction logs. We can recreate the database with daily or point-in-time (1s resolution) backup |
Detection | Smoke tests may detect corruptions in some critical data. |
Remediation | Access to the service should be stopped immediately. The data may be fixed manually if the change is simple. If the change is complex or if we don’t know the extent of the issue, it may be necessary to recover the database from a backup whether daily, hourly or point-in-time using transaction logs. Restore database with latest snapshot or point in time |
Loss of database instance
It is possible to lose the database instance and the associated backups. For example, if the database server is deleted from Azure, in case of human or automation error, the whole instance is deleted, including its backups.
Impact | Users can’t read or write any data. All data is lost. |
Prevention | To protect against human errors, users should only be allowed to access production when they need to. To protect against automation errors, changes should be thoroughly reviewed, in pull requests or review apps. Keep a daily backup of the production databases on a secure place like an Azure storage account. Production backups should only be accessible by authorized users. |
Detection | Endpoint monitoring may point to a healthcheck page checking the connection to the databse. Or smoke tests running in production may detect it. |
Remediation | Restore database from external daily or most recent backup |
Accidental resource deletion
We use terraform to provision resources, but it could be possible for a code change or a user with privileges to delete resources accidentally or otherwise.
Impact | Applications may be unavailable. Data may be lost. |
Prevention | Approved PIM request required for production Azure access. Pull Requests require at least 1 approval. Soft delete and versioning enabled for key vaults and storage accounts. Azure resource locks placed on important resources Azure locks. |
Detection | Endpoint monitoring may point to a healthcheck page that is now failing. Or smoke tests running in production may detect it. |
Remediation | Recovery dependent on the resource deleted, either restore correct version or redeploy and restore data from backup |
Loss of Azure/AWS availability zone
We deploy to the UK South or West Europe regions which have 3 separate availability zones (AZ). It may happen that one of them is unavailable: either network, compute or storage services are affected.
Impact | Applications may be slow or unavailable |
Prevention | Applications should be built with failure in mind: AKS clusters should be configured with nodes spread across multiple AZs. AKS Deployments should use a zone topology spread constraint. Scale applications to more than 1 replica to enable zone redundancy. Azure storage accounts should be ZRS or GZRS if zone redundancy is required (automatic and manual failover). Azure key vault uses replication within region and to a paired region (automatic failover). Azure Postgres can be configured zone redundant, with the active/standby instances in different zones. Azure Redis can be zone redundant only if using the Premium SKU. Postgres and Redis utilise automatic failover. Postgres can also be failed over manually. Cluster Public IP addresses should be configured with zone redundancy Azure PIP redundancy |
Detection | Endpoint monitoring checking for uptime and response time |
Remediation | If not handled automatically by the platform, redeploy applications and fail over clusters |
Loss of Azure/AWS region
In some rare cases, an entire region might become unavailable.
Impact | Applications may be unavailable |
Prevention | For critical applications, it is possible to deploy to 2 different regions, synchronise the data, configure a DNS based failover or GSLB. We don’t usually protect against this risk as it is not worth the complexity of the required set-up. Production Postgres backups are kept in a GRS Azure storage account, which maintains copies of data in a separate region. Any critical application data kept in a storage account should be GRS/GZRS. Azure Key Vault maintains a copy of the contents in another region. For storage accounts and key vaults, failover is automatic and transparent. Storage accounts also support manual failover. |
Detection | Endpoint monitoring checking for uptime |
Remediation | Start services in backup region, trigger DNS failover |
Azure issues impacting delivery
We often rely on Azure for:
- Terraform state in Azure Storage
- Infrastructure and application secrets in Azure Key Vault
- Daily production database backups in Azure Storage
Impact | This would not impact the running applications but would prevent us from deploying new versions and backing up the database. |
Prevention | Enable soft delete on Key Vaults. Secrets are versioned. Enable container soft delete Enable versioning for blobs |
Detection | The pipelines or deployments may fail |
Remediation | Key Vaults with soft delete are recoverable. Secrets are versioned and recoverable in case of corruption or deletion Azure Storage accounts can be recovered for 14 days if they were deleted by mistake Storage containers with soft delete are recoverable Versioned files in storage blobs are recoverable in case of corruption or deletion |
Denial of service
An attacker may send a high number of requests to overload the service and make it unavailable.
Impact | The service is unavailable or slow for users |
Prevention | Every resource in Azure is protected by Azure’s infrastructure DDoS (Basic) Protection Depending on the criticality of the service, it is possible to use Azure DDoS Protection Standard instead. Public IP addresses can have DDOS protection enabled individually. |
Detection | Endpoint monitoring checking for uptime and response time |
Remediation | Protection measures are triggered automatically. It is also possible to analyse the traffic pattern and change the application accordingly. |
Unauthorised access
A malicious actor steals credentials or an ex employee still has working credentials and gains privileged access to the live environment.
Impact | They may break the app, read or change confidential data |
Prevention | Separate production environment and tighten security. Non production environments should only hold test or anonymised data. Use Azure PIM to give users temporary access. Make sure the offboarding process is followed. Use single-sign-on and 2FA when possible. Use Azure RBAC for AKS, and separate service namespaces and resources into service AD groups. To restrict developer access to their services only. Do not give databases a public IP. |
Detection | Azure audit logs |
Remediation | Revoke access of the suspicious user, investigate their actions Rotate secrets they may know and possibly restore the database to a known good state. |
Disclosure of secrets
Different kind of sensitive information may be posted online accidentally by a developer. On a website like pastebin or committed to a GitHub public repository. Examples:
- Deployment secrets like AWS API key
- Application secrets like Google API key
- Application data like a database dump
Impact | A malicious actor may gain access to the system, break the app, read or change confidential data, deploy extra applications. |
Prevention | Secrets should be stored in a secure location like Azure Key Vault (see Managing secrets), Azure DevOps variables or GitHub secrets. They should not be stored in a local file. If necessary temporarily, make sure to exclude it with .gitignore .Use Terraform remote state backend in a secure Azure Storage account and not local files. Do not expose databases on the internet and refrain from downloading production dumps. Store database backups in a secure storage account and use ony anonymised data for non production environments. Install git-secrets locally. Configure GitGuardian on the repository. |
Detection | Block by git-secrets or notification from GitGuardian. Alert on overspending. |
Remediation | Remove the secrets from the public place, rotate all the exposed secrets, investigate if they were used. In the case of GitHub, remove from the repository and its history, make a request to GitHub to remove any pull-request or branch that include it, as those can exist beyond the life of the repo. |
SSL certificate expiry
Each service must have a valid SSL certificate otherwise clients cannot connect. Certificates have an expiry date and are not valid after the date.
Impact | Users can’t access the website. Or they may ignore browser warnings and could then be tricked into a malicious website. |
Prevention | Set up auto renewal of certificates stored in Azure Key Vault. Services using Azure front door are configured with a custom domain which generates a certificate and renews it automatically. If not auto renewed, set up monitoring of expiry date. Certficates created on DfE’s Globalsign are monitored by Operations and owners receive notifications. |
Detection | Email from Operations or notification from monitoring |
Remediation | If not auto renewed, issue a new DigiCert certificate and install it on the website |
Traffic spike
A sudden spike in user traffic due to an announcement, a product launch or a coincidence may overload the system.
Impact | The system is slow or unresponsive |
Prevention | Set up response time monitoring. Run load testing to determine bottlenecks and know how to scale up. Use CDN for web page caching and internal caching like Redis or Memcached. |
Detection | Alert from response time monitoring, high CPU or memory usage, instances crashing |
Remediation | Scale applications and services horizontally and vertically Disable expensive features |
DfE Sign-In failure
DfE Sign-in is a single-sign-on solution for many websites.
Impact | Users cannot connect to the website |
Prevention | Implement login workaround via magic link or username/password |
Detection | Smoke test failure. DfE Sign-in status page |
Remediation | Activate the login workaround |
GitHub
GitHub is our code repository, continuous integration system (GitHub actions) and Docker registry (GHCR).
Impact | Users are not impacted, but we would not be able to deploy via automation |
Prevention | Plan to be able to deploy manually. Have DockerHub or Azure container registry ready as backup registry. |
Detection | GitHub status page |
Remediation | Build and deploy manually |
DockerHub
DockerHub is a docker container registry. New application version images are built, pushed to DockerHub then used for deployment.
If DockerHub is down it won’t impact the running service, but we won’t be able to deploy new versions, for example for bugfixes or reverting to older versions.
Impact | Users are not impacted, but we would not be able to deploy via automation |
Prevention | Have GitHub container registry or Azure container registry ready as backup registry. |
Detection |
DockerHub status page Notification of pipeline failures |
Remediation | Build and deploy manually |
Monitoring and logging failure
We rely on services like Logit.io, StatusCake, Prometheus ecosystem, Skylight, Sentry, Azure Monitor.
Impact | Users are not impacted, but we would lose visibility of our systems |
Prevention | |
Detection | |
Remediation | Check manually |
GOV.UK Notify
GOV.UK Notify is used to communicate with our users via emails, texts and letters.
Impact | We can’t send communications to our users Asynchronous sending jobs may queue until it’s available again |
Prevention | |
Detection | GOV.UK Notify status page |
Remediation |
Google BigQuery
Our services load web analytics and database data into Google BigQuery via an event stream. BQ Data is then available for analysis using various tools.
Impact | Unable to send data to BQ. Reporting unavailable or out of date. |
Prevention | |
Detection | Sentry errors. Daily monitoring of out of date data by the BI team. Daily checksums to confirm service database tables match data kept in BQ. GCP status page. |
Remediation | Missing data can be manually reloaded when BQ is available. GCP Basic support. |
Google API
Various Google API’s are used by our services, including Geocoding, Indexing, and Analytics.
Impact | Some application functionality will be degraded or unavailable. |
Prevention | |
Detection | Sentry errors. Check maps api status. |
Remediation | GCP Basic support. |