All best practices
Data & Confidentiality
May 31, 2026 · 7 min read
The most sensitive data in any study is the data you didn’t plan to expose.

Obscure First: Pseudonymising Confidential Information Before You Analyse It

In behavioural research the open-ends and respondent records carry the real confidentiality risk. The best practice is to obscure identities at the point of capture — so the sensitive form is never the form you store, share, or feed to a model.
The practice

Pseudonymise confidential information before analysis, not after. Obscure identities at capture so the sensitive form is never the stored or shared form — and keep the key that re-identifies people separate from the data itself.


The risk hides in the free text

Structured fields are the obvious place identity lives — a name column, an email, an account. They are also the easy case. The harder, and often more sensitive, risk sits in the open-ended responses: the verbatims where a respondent names their employer, mentions a person, or describes something that identifies them or a third party in plain prose.

Because that material is unstructured, it is the part teams most often forget to protect — and the part most likely to end up in an export, a shared deck, or the prompt of an AI assistant. A good confidentiality practice treats free text as the primary risk surface, not an afterthought.

Obscure at capture, not at presentation

The common mistake is to clean data on the way to the screen while keeping a raw copy in a working store underneath. That leaves a shadow of the real identities sitting in a cache, one query away from exposure. The stronger practice is to obscure earlier: pseudonymise confidential values before the data is written to the store the analysis runs against, so the protected form is the stored form.

When you do this, the entire working surface of your analysis — every record that can be searched, every response that can be read — is already anonymous by the time it exists. Protection at rest, not merely protection in transit, is what makes the guarantee hold up under scrutiny rather than only in a demo.

Keep the key separate from the data

Pseudonymisation is only as good as the separation between the tokens and the map that reverses them. If the key that turns “token 47” back into a real person travels alongside the data — or is held by whoever you sent the data to — you have not really protected anything; you have added a step.

The practice that matters is to hold the re-identification map in a single, encrypted store under your own control, readable only where the translation genuinely has to happen, and never shipped to a third party. Then even if the analysis data were exposed, what leaked would be anonymous tokens — not a directory of the people behind them.

Layered detection beats one clever trick

No single technique catches everything. Structured fields can be recognised by what they are. Contact patterns — emails, phone numbers, government identifiers — can be caught by shape wherever they appear, even buried in a sentence. And a continuously growing registry of the real entities in your data — the customers, projects, and people that actually exist — can be swept across all free text so that once a name is known anywhere, it is obscured everywhere it appears.

Stacking those layers is what turns “we tried to redact PII” into a defensible posture. Be honest about the limit, too: automated detection is probabilistic, not infallible.¹ That is exactly why the guarantee should rest on the separated key — so a missed string is a bounded fragment, never the means to re-identify anyone at scale.

You don’t need identities to get the insight

The reassuring part is that obscuring identities costs you almost nothing analytically. An assistant — or an analyst — does not need to know a respondent’s name to tell you what that respondent thought. Grouping, counting, contrasting, theme-finding all work just as well on stable tokens as on real names, provided the same person maps to the same token everywhere.

This is what makes the practice practical rather than precious. Obscure first, analyse on tokens, and translate back to real records only inside your own environment when you actually need to act. You keep the speed and depth of the analysis, and your respondents keep the confidentiality they were promised.

DecisionObscure last (common)Obscure first (best practice)
When identities are removedOn the way to the screen / exportAt capture, before anything is stored
What sits in the working storeA raw copy of the real dataTokens only — no raw shadow copy
Where the re-identification key livesAlongside the data, or with a third partySeparate, encrypted, under your control
Consequence of a missed stringPotentially a full record exposedA bounded fragment — never re-identification at scale
Effect on the insightNoneNone — analysis runs on stable tokens
Two ways to handle confidential information in analysis — and why the order of operations matters.

1Pseudonymisation reduces identifiability but does not render data anonymous; under the GDPR and UK GDPR, pseudonymised data remains personal data and is processed accordingly. Detection of confidential values in free text is probabilistic and layered; no automated method identifies every possible identifier with complete accuracy. Statements describe practices as designed to operate under normal conditions and are not warranties.