Page MenuHomePhabricator

When tasks fail permanently, they should log the failure even if `phd.verbose` is not enabled
Closed, ResolvedPublic

Description

This isn't exactly a bug, but permanent failures from the daemons currently call $this->log(...), which only does anything if phd.verbose is enabled. I think this is probably not the right balance of verbosity concerns in the modern daemons, as all other log stuff is spammy/advisory/debug-ish/diagnostic but permanent failures are materially useful in common diagnostic situations.

Instead, we should probably log permanent failures at the default verbosity level.

Existing sources of permanent failure are worth at least a cursory review before we ship this since they're pretty easy to grep for, but I don't anticipate any issues.

Event Timeline

Existing sources of permanent failure are worth at least a cursory review before we ship this since they're pretty easy to grep for, but I don't anticipate any issues.

There are a few cases which are little borderline:

  • The Asana and JIRA bridges can fail if an object has no users with linked accounts. We might want to calm this out of the log (perhaps by treating it as successful no-op bridging) eventually.
  • Some Drydock stuff isn't quite as unambiguous as "bad ID", although I believe it's all similar at first glance.
  • Commit parse tasks abort if a commit is unreachable by the time they run. This is rare, and seems fairly reasonable to log, although we could calm it eventually if it proves too noisy.
  • Search indexing fails permanently fairly quickly, without retrying as hard as other tasks (because the next time an object is updated, it will be indexed again). This could be tweaked/calmed.
  • In silent mode, SMS tasks fail as permanent failures. Certainly fine for now since SMS isn't really hooked up to anything at application level.

Overall, I'm comfortable with just changing the default behavior and then tweaking these cases as necessary. There are a fairly large number of other sites (~70 total) but the remainder all seem totally unambiguous cases of "bad ID / bad config which can definitely never work".

It looks like the current behavior rose out of D16268 / T11309, although it was something of a side effect of those changes.