Obscure First: Pseudonymising Confidential Information Before You Analyse It
In behavioural research the open-ends and respondent records carry the real confidentiality risk. The best practice is to obscure identities at the point of capture — so the sensitive form is never the form you store, share, or feed to a model.Pseudonymise confidential information before analysis, not after. Obscure identities at capture so the sensitive form is never the stored or shared form — and keep the key that re-identifies people separate from the data itself.
The risk hides in the free text
Structured fields are the obvious place identity lives — a name column, an email, an account. They are also the easy case. The harder, and often more sensitive, risk sits in the open-ended responses: the verbatims where a respondent names their employer, mentions a person, or describes something that identifies them or a third party in plain prose.
Because that material is unstructured, it is the part teams most often forget to protect — and the part most likely to end up in an export, a shared deck, or the prompt of an AI assistant. A good confidentiality practice treats free text as the primary risk surface, not an afterthought.
Obscure at capture, not at presentation
The common mistake is to clean data on the way to the screen while keeping a raw copy in a working store underneath. That leaves a shadow of the real identities sitting in a cache, one query away from exposure. The stronger practice is to obscure earlier: pseudonymise confidential values before the data is written to the store the analysis runs against, so the protected form is the stored form.
When you do this, the entire working surface of your analysis — every record that can be searched, every response that can be read — is already anonymous by the time it exists. Protection at rest, not merely protection in transit, is what makes the guarantee hold up under scrutiny rather than only in a demo.
Keep the key separate from the data
Pseudonymisation is only as good as the separation between the tokens and the map that reverses them. If the key that turns “token 47” back into a real person travels alongside the data — or is held by whoever you sent the data to — you have not really protected anything; you have added a step.
The practice that matters is to hold the re-identification map in a single, encrypted store under your own control, readable only where the translation genuinely has to happen, and never shipped to a third party. Then even if the analysis data were exposed, what leaked would be anonymous tokens — not a directory of the people behind them.
Layered detection beats one clever trick
No single technique catches everything. Structured fields can be recognised by what they are. Contact patterns — emails, phone numbers, government identifiers — can be caught by shape wherever they appear, even buried in a sentence. And a continuously growing registry of the real entities in your data — the customers, projects, and people that actually exist — can be swept across all free text so that once a name is known anywhere, it is obscured everywhere it appears.
Stacking those layers is what turns “we tried to redact PII” into a defensible posture. Be honest about the limit, too: automated detection is probabilistic, not infallible.¹ That is exactly why the guarantee should rest on the separated key — so a missed string is a bounded fragment, never the means to re-identify anyone at scale.
You don’t need identities to get the insight
The reassuring part is that obscuring identities costs you almost nothing analytically. An assistant — or an analyst — does not need to know a respondent’s name to tell you what that respondent thought. Grouping, counting, contrasting, theme-finding all work just as well on stable tokens as on real names, provided the same person maps to the same token everywhere.
This is what makes the practice practical rather than precious. Obscure first, analyse on tokens, and translate back to real records only inside your own environment when you actually need to act. You keep the speed and depth of the analysis, and your respondents keep the confidentiality they were promised.
| Decision | Obscure last (common) | Obscure first (best practice) |
|---|---|---|
| When identities are removed | On the way to the screen / export | At capture, before anything is stored |
| What sits in the working store | A raw copy of the real data | Tokens only — no raw shadow copy |
| Where the re-identification key lives | Alongside the data, or with a third party | Separate, encrypted, under your control |
| Consequence of a missed string | Potentially a full record exposed | A bounded fragment — never re-identification at scale |
| Effect on the insight | None | None — analysis runs on stable tokens |
