Page MenuHomePhabricator

Simplify and correct some challenge TTL lockout code

Authored by epriestley on Dec 14 2018, 3:57 PM.



Depends on D19889. Ref T13222. Some of this logic is either not-quite-right or a little more complicated than it needs to be.

Currently, we TTL TOTP challenges after three timesteps -- once the current code could no longer be used. But we actually have to TTL it after five timesteps -- once the most-future acceptable code could no longer be used. Otherwise, you can enter the most-future code now (perhaps the attacker compromises NTP and skews the server clock back by 75 seconds) and then an attacker can re-use it in three timesteps.

Generally, simplify things a bit and trust TTLs more. This also makes the "wait" dialog friendlier since we can give users an exact number of seconds.

The overall behavior here is still a little odd because we don't actually require you to respond to the challenge you were issued (right now, we check that the response is valid whenever you submit it, not that it's a valid response to the challenge we issued), but that will change in a future diff. This is just moving us generally in the right direction, and doesn't yet lock everything down properly.

Test Plan
  • Added a little snippet to the control caption to list all the valid codes to make this easier:
$key = new PhutilOpaqueEnvelope($config->getFactorSecret());
$valid = array();
foreach ($this->getAllowedTimesteps() as $step) {
  $valid[] = self::getTOTPCode($key, $step);

    'Valid Codes: '.implode(', ', $valid)));
  • Used the most-future code to sign L3.
  • Verified that L4 did not unlock until the code for L3 left the activation window.

Diff Detail

rP Phabricator
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline


Since the dialog can now say "Wait Patiently", it's super easy to legitimately hit 10 bad submissions per hour by clicking "Wait" over and over again to see how much longer you need to wait, especially now that this change gives you an exact "17 seconds" countdown.

Currently, it's not trivial to detect that the user is submitting a "wait" (i.e., their form submission has no new responses) vs a real response. This might become easier in the future. I think the ideal fix here is to only penalize them for this action if they're actually submitting new responses, but currently we cost them an action any time they submit the form.

If detecting "you're actually giving an answer" gets easier in the future I'll probably do that and reduce this a bit, but 100 challenges / hour vs 10 challenges / hour has no real security implication and this makes testing a lot easier.


This bit was somewhat untrue in the overlapping-windows case (we want a challenge at timestep 3 to lock out a new challenge being issued at timestep 6, since codes 4 and 5 are valid responses to both). Just use the TTL, which is simpler and lets us tell users how long they need to wait in exact seconds instead of coarse 30-second blocks.

amckinley added inline comments.

Anecdotally, I've definitely failed MFA challenges for Google properties when the authenticator app changed the key within a second of my entering the code. Does the human-visible Google Authenticator countdown correspond to multiple "timesteps" in TOTP RFC-ese? I should probably just RTFM, but what does this change do in human terms to the length during which an MFA code is valid?

This revision is now accepted and ready to land.Dec 17 2018, 11:47 PM

Does the human-visible Google Authenticator countdown correspond to multiple "timesteps" in TOTP RFC-ese?

Yeah -- in Google Authenticator, the little pie-shaped countdown timer is 30 seconds and the code refreshes once it finishes counting down. This is one timestep.

We've always accepted the previous two timesteps (T-2, T-1), the current timestep (T+0), and the next two timesteps (T+1, T+2). RFC 6238 more or less recommends this, first talking about transmission delays and then about clock sync:

The validation system should compare OTPs not only with
the receiving timestamp but also the past timestamps that are within
the transmission delay.
We RECOMMEND that at most one
time step is allowed as the network delay.
Because of possible clock drifts between a client and a validation
server, we RECOMMEND that the validator be set with a specific limit
to the number of time steps a prover can be "out of synch" before
being rejected.
This limit can be set both forward and backward from the calculated
time step on receipt of the OTP value. If the time step is
30 seconds as recommended, and the validator is set to only accept
two time steps backward, then the maximum elapsed time drift would be
around 89 seconds, i.e., 29 seconds in the calculated time step and
60 seconds for two backward time steps.

So we're pretty much doing what the RFC recommends, except it suggests only going backwards. We have no reason to believe that the server is ahead of the client (vs behind the client) and at least in ~2013 I think we ran into a lot of clock skew issues with arc install-certificate (e.g. in T3025#31011 perhaps) which suggested that installing on a server with a "roughly" accurate clock and no NTP was common.

We could reasonably reduce this to 1 timestep in either direction, or 0. If Google rejects immediately after the code vanishes, it presumably uses 0.

As of this diff, you can do 100 attempts per hour. Doing some napkin math: there are 1,000,000 possible codes and 5 of them will be accepted (with the -2 ... +2 window), so you have a ~1/200,000 chance of hitting a code with a given attempt or ~1/2,000 with all the attempts. So the chance you don't hit a valid code with all attempts is ~1,999/2,000. That lines up with a ~50% chance to hit a valid code is about 60 days if you max out your attempts at every timestep. We can make this ~600 days by fixing the form submission logic to drop the rejections back to 10 (which I expect to do later in this series), and ~3,000 days by narrowing the window to only the current timestep. The cost is some amount of user confusion/failure around slow typing or slow copy/pasting on mobile.

I think a better general defense here is probably putting a threshold system into the rate limiting code so you're notified after one failure, and doing login alerts when new sessions are established -- a lot of the math around TOTP ends up feeling a little uncomfortable to me because the response token is so short.

Reducing this window generally makes our "wait for a bunch of seconds for the challenge to cycle" behavior more user-friendly, so I'm sort of inclined to at least try dropping it from 2 to 1. I think it would be reasonable to drop it to 0 and just assume everyone's clocks are in reasonable shape, but at least in 2013 this article suggested that a lot of phones were only in the general realm of accurate.

So maybe we:

  • Fix the rate limiter thing later in this series.
  • Drop the window to 1 for now.
  • If no one complains for a while, drop the window to 0.
  • Do login and MFA failure alerts at some point.

Seem reasonable?

So maybe we:

That all sounds reasonable. I don't even really have a problem with the current (-2, +2) implementation; I mostly just wanted to make sure I understood what this change was doing.

For archaeologists, D19898 tightens those limits as described.

This revision was automatically updated to reflect the committed changes.