Data Principles

Context

Our work is guided by the DfE Enterprise Data Architecture Principles - please read them before continuing. This guidance explains how we apply the principles in digital services - it extends, not replaces, the DfE Principles!

Principles

Data is an Asset

Data is valuable. The kinds of data we deal with usually fall into a few categories:

Transactional data is the data at the heart of the services we provide: the status of applications, for instance. Its value is obvious - without it, the services can’t work.
- Sometimes, the term Master data is used to refer to data about things that have long histories of interaction with the system, such as people and schools, and the term “transactional” data then used to refer to data about more transient things like applications to courses.
Reference data, also sometimes known as “lists” in common parlance, is data used to support services, largely read-only: lists of GCSE subjects, administrative districts, institutions, and so on. Reference data allows us to improve user experiences with autocompletion, improve the quality of data by letting users pick from a list rather than entering a name that might be wrong, and so on - but outdated reference data is worse than useless.
Analytical data is the “logs” of activity generated by users using our services. The value of these is immense, if it is used properly; we can use it to understand our users’ behaviour, identify ways to improve our services, and justify the value of services.
- We may also hear the terms Processed, Generated or Computed data - these are not different kinds of data, but instead refer to data that has been generated from other data in some way (for instance, summarising it, “cleaning” it by removing invalid records, anonymising it by removing personal information, etc). Data that is not the outcome of a process on some other data, but comes directly from an original source, is often referred to as Raw data.

Data is Secure

Our security concerns around data mainly boil down to:

Protecting data from loss or error. Risks here include physical damage to systems, malicious attackers attempting to destroy, alter, or deny access to data, accidental human error in data entry or modification, and bugs in software causing data loss. This risk applies to all data we manage.
Protecting sensitive data from being accessed by unauthorized parties. This includes malicious attackers attempting to circumvent access controls, as well as data accidentally being left unprotected - due to access controls not being correctly applied, and copies of the data being left in public places. This risk only applies to sensitive data such as personal information, but as we deal with analytical data, we must be particularly aware that data may be more sensitive than it initially appears: Sensitive data can often be inferred from seemingly innocuous data, particularly if it can be cross-referenced with other data.

Data is Shared, Data is Obtainable

Generally, the services we work with already make data available to the end users who need it, so our main concern is making sure that data is obtainable internally. As this means we are both sharing and using the data, we can consider these two principles as one.

Our desire to share data has two main drivers:

Creating more “joined-up” services for our users, to avoid duplication of effort and scope for error on their part (known as “Tell us once”: users shouldn’t have to tell us the same thing more than once).
Obtaining a “big picture” view of our services as a whole from within.

For instance:

Transactional data needs to be made available to generate analytical and reference data from.

Transactional data needs to be made available between services, to present a single joined-up “system” to end users and avoid duplication.

Analytical data needs to be available for aggregation (combining data from multiple services to create data about the entire service line) as well as actual analysis.

Reference data needs to be obtainable as painlessly as possible, to unleash its value in our services.

To facilitate all of these, we need to think about structuring our data in ways that make it easy to connect with other data sets, such as agreeing on common ways to identify and structure problem-domain objects; and choosing good technical mechanisms to access the data (such as shared databases, or APIs).

Data is Analysable

We need to make sure that we are mindful of how the data we gather can be used to drive analysis, and ensure that enough contextual information is gathered to be able to make sense of the data and link it to data from other services.

Data use is Ethical

It’s uncontroversial to state that data use must be ethical, but it’s not always obvious what ethical issues might arise. Regulations such as GDPR provide some ideas, but we must go above and beyond that bare minimum, and always ensure that our use of data does not compromise our mission: education. If we set a poor example of how data should be used, then we have failed.

For example, our use of data must never be more than what the person the data is about has knowingly consented to (acknowledging that most people don’t actually read privacy statements and other fine print).

If we ask for their address to send them a letter, and the user interface says “Please enter the address you’d like us to send your letter to”, then aggregating that address data to work out which regions of the country use our services the most is probably fine (IF done in compliance with all regulations, which means it’ll need to be mentioned in the privacy policy), because people generally assume that sort of thing might happen.

But using it to also send them a birthday card every year, even if it mentions that somewhere in the privacy policy, wouldn’t be ethical. Even if we think it’s a lovely idea, some people don’t want to be reminded of their birthday, and the user interface did not make it clear that was going to happen when they entered their address, so the recipient might find it an unwelcome surprise.

Rules

We are bound by the following regulations, and you should make yourself familiar with them:

GDPR (General Data Protection Regulation)
Data Protection Act
Freedom of Information Act
DfE Data Policies
Service teams are responsible for performing the DPIA (Data Protection Impact Assessment) for sensitive data they collect, including its further storage and processing as part of larger projects such as service line analytics.
The Data Infrastructure team, however, are responsible for security of that data once it has entered cross-team infrastructure. Therefore, they must provide information to the service teams to enable them to correctly assess the data protection impact of handing data over.

Guidelines

This section is intended as a way for us to collaborate on useful approaches to implementing the principles - please submit PRs to remove guidelines that are outdated, add new ones, or improve existing ones!

Keep things neat and tidy

Keep things fresh: Review database schemas at suitable intervals; ensure we know how each column of each table is being used, documentation is up to date, spot any accidental duplication or inconsistencies, and that anything not being used any more gets archived away.

Use common IDs: Attempt to find and use existing IDs for problem-domain objects, to simplify analytic cross-referencing between different services, and to facilitate transactional data sharing.

Use common formats: Attempt to re-use common structures for common data objects such as addresses, to simplify transactional data sharing.

Avoid unnecessary duplication of data: Carefully consider whether to use existing databases, or create new ones, where existing services deal with the same data objects as you.

For instance, a central database of all teachers (and candidate teachers) might not be practical to share between all services for a number of reasons, but if not, we need to think about how to synchronise common teacher data such as names and contact details across services, how to cross-reference them for analytics, and so on.

Track dependencies: As we are working on a system made by bringing many parts together, fragility due to unknown dependencies is a danger - try to make dependencies explicit, and design things with stable interfaces so the components that depend on them don’t need to be updated when things change.

For instance, in shared data repositories such as BigQuery, consider creating views onto your actual data tables that expose simplified versions of the data, organised for consumption rather than updating, and thereby decoupling the internal and external representations.

Openness and Communication

Share data: Where practical, legal, and ethical, publish anything that isn’t sensitive: if the information might be useful, and releasing it won’t cause any harm, then share it.

Don’t share security details: Other than the obvious issue of sensitive personal information, information that exposes the way we protect sensitive information needs to be carefully considered.

For instance, it’s great to publish the fact that we use locked-down service accounts to enforce the principle of least privilege

But we shouldn’t publish the actual list of accounts, their names, and their permissions. Publishing that information, as well as possibly enabling an attacker to find weaknesses, would probably also be a pointless duplication of data that’s stored in actual configuration databases anyway.

Communicate: It’s one thing to document your work, but if nobody knows it’s there they won’t go looking for that documentation. Make sure what your team is working on is talked about in Slack (eg, #weeknotes and other topic-specific channels) so people you can work together with find out about it.

Collaborate: Try to find people working on related projects to you; not only so you can re-use their work where possible, but to make sure your approaches align so you can integrate your systems later.

There are various DFE forums where you could share any findings. Some of them are listed below:

BAT/GIT show and tell every fortnight

#weeknotes slack channel

“Show the thing”

Document things once they are stable: but don’t slow down things that are in flux with documentation - instead, just make sure people know who’s working on them so they can ask you what they need. (But best of all is to make things that are self explanatory so you don’t need to document them at all…)

Dataflow contains version-controlled table and field metadata, which are propagated into BigQuery; use that where appropriate.

Positive User Experience

Use the data they’ve already given us: We do not ask users to enter data that we already have (but we should ask them to confirm what we have is still valid, and update it if not).

Use data to help them choose: When the user has to pick something from a fixed list (eg, a school or a course), we should have a canonical list for them to pick from (with, for instance, autocompletion if it’s a long list); and those lists should be complete, correct, and consistent across services.

Know how we use their data: We know how and where data supplied by users will be used, so we can inform them honestly.

Security and Responsibility

Know the landscape: Familiarize yourself with the risk register (FIXME: Link to it when it exists), and ensure that you record any new risks.

Protect sensitive information: Share only the minimum of sensitive / personal information required to meet our objectives.

In particular, only put such things into shared repositories such as BigQuery if there is a proportionate need to do so.

Because such shared repositories will inevitably include more and more sensitive information as our objectives proliferate, carefully restrict access to such repositories so that users can only access sensitive information they have a legitimate need for.

Providing database views to underlying raw data tables, and/or column-level access control, are useful tools to consider for fine-grained access control.

Collect data for a reason: Make sure the data is collected for a specific purpose, to make a decision, to test a hypothesis or to answer a specific question. Don’t collect data just because it’s there.

General Advice

Follow the Government Design Principles. A good example of their application is HESA’s Data Principles.

Checkpoint questions

Whenever a new data store is created, or an existing one extended with new kinds of data, the following questions should be asked and answered, and the answers documented as part of the documentation for that data set:

Is the data collected, modified and served with a clear purpose?
Is the usage of data responsible and ethical? Would the people or organisations this data refers to be pleased with how we use, store, and share it?
Is the data available (capable of being found, as well as accessed once found) to anybody who might possibly be interested in it, even the general public? If not, why?
Does this data duplicate / overlap with any other data set we have access to? If so, what is the rationale behind duplicating it?
Are you sharing your findings with colleagues at the DfE?
Are you using sensitive data? If yes, is it according to the sensitive data guidelines?
Make sure the presence of sensitive data in a data set is clearly indicated, so that people finding that data set know to treat it carefully.
Familiarise yourself with procedures for reporting breaches or vulnerabilities.
Who should have access to this data (and what level of access), what for, and what measures are in place to ensure that nobody gains more access than they should?
How long should this data be retained for, and is there a mechanism to destroy it after that point?
What agents / systems provide this data?
- In particular, if this data is a copy of master data stored elsewhere, how will it be updated, and is a mechanism in place to track how stale the data is so problems can be spotted?
What agents / systems use this data?
- What assumptions do they make about it (eg, what is the interface exposed to them), so we can tell if a change to this data’s structure needs to be notified to its users?
Who is responsible for correctness and management of this data set going forward?

Next Steps: Things We Would Like To Add

These are areas we would like to choose guidelines for, but haven’t decided what they are yet. If you are working in this area and wish there were guidelines to follow, then you are perfectly placed to join the discussion and help write those guidelines!

Cookie compliance management
Suggest some naming conventions for BigQuery / Dataform tables
Define more specific guidelines around sensitive data
- For example, define our policy surrounding IDs for problem domain objects such as people - are they sensitive data, or should we freely use them in URLs etc?
Where we recommend writing documentation, also recommend where that documentation should go
Define more guidelines about visualisation / UX / accessibility
Things to help us be consistent in how we present the same kind of thing in different reports might be good
Extra background material on data principles to look through for any gems: https://drive.google.com/drive/folders/1rQy0yX-ys8U7Y5bS7rRTp-SY5elaEU
Clarify when we need to get DSA / ATO / do DPIA on new project