Page MenuHomePhabricator

Use a bloom filter (or low-resolution hashing?) to reserve unique identifiers in a GDPR-compliant way with reasonable technical foundations
Open, WishlistPublic

Description

We generally want to avoid ever reissuing unique identifiers (usernames, instance names, email addresses, etc) because we run into a million kinds of different security and user confusion mess if we do.

Retaining these tokens is arguably at odds with the GDPR, which requires deletion of personally identifiable data. I suspect this won't actually hold up in court if it's ever tested -- probably all reasonable use, including ours, will be exempted as having a "lawful basis" (here, probably fraud/abuse prevention and security concerns).

How does GitHub delete your email address from all the repositories where it appears? How does Gmail delete your email address from all the inboxes it hosts? If I register "evanpriestley.com", can I compel the registrar to delete all evidence that I ever registered it because it's personally identifiable? Can I compel anyone running a Bitcoin node to destroy the blockchain because it contains the slanderous claim "epriestley likes zucchini"? Can I compel Twitch to unban my IP address after I run viewbots from it under the GDPR? I think the answer to all these questions is almost certainly "no", or "yes, the EU claims; in practice: no, of course not".

I generally suspect the GDPR will ultimately be "Cookie Law v2" -- a toothless piece of compliance busywork -- but we'll see. (If it actually has very sharp teeth, that's probably good for us, since we're generally very far on the extreme of data protection anyway.)

However, unique identifier reservation is a case where we could be more extreme. Generally, we'll retain the identifier in the original form and prevent re-reservation with a UNIQUE KEY.

A step beyond this is to hash it, but for all identifiers we reasonably want to protect, a normal hash (e.g., SHA256) is effectively recoverable: the input domain of, e.g., reasonable instance names is very small, and you can reasonably recover all the identifiers by exhaustively hashing everything. I'm consequently hesitant to pursue this approach because it lacks any real technical foundation.

We could possibly give this approach a technical foundation by using a bloom filter -- or just dramatically lowering the resolution of the hash by reducing the number of bits we retain. e.g., if we use SHA256 but throw away all but the first 20 bits, we'll get piles and piles of collisions and some level of real deniability on the hash being reversible. (Low-resolution hashing probably gives us better tools to work with -- and is simpler -- than an actual bloom filter.)

So the reservation algorithm would look like this:

  • Poorly hash the username and check the "tombstones" table. If it collides, reject it.
  • If it does not collide, continue normally.

The deletion algorithm would look like this:

  • Poorly hash the username and add it to the "tombstones" table, with the originating user PHID.
  • Remove the username from the user table.

The undeletion algorithm would look like this:

  • Ask for a new username.
  • Check for a tombstone other than the one with the user PHID.
  • Remove the tombstone for the user PHID.
  • Continue normally.

One cost is that every disabled user blocks tons of other valid usernames, but this is necessary for the hash to be plausibly irreversible. Another cost is that un-disabling users is a pain, but this is a requirement if we're plausibly destroying the data.

We could maybe make tombstoning identifiers like this optional.

I'd ideally like to see the EU actually assess damages against some company for, e.g., reserving usernames before spending time on this, since it strains the bounds of plausibility to me that individual concerns will be stronger than lawful basis concerns in the context of identifier reservation, but this is at least a reasonably sound technical approach which we could pursue without much disruption to anything.