Page MenuHomePhabricator

Record of hibernating daemon was garbage collected
Closed, ResolvedPublic

Description

We've missed two normal Phacility deployments for mundane reasons, so daemon uptime is currently ~20 days, which is unusual because we usually deploy a minimum of once per week.

An instance is reporting that they're seeing the "Phabricator Daemons Are Not Running" setup warning, but the daemon processes appear to be running normally on the actual host. I believe this is because the records were garbage collected (we have a 7-day retention period on daemon records).

Here's the actual table state:

mysql> select * from daemon_log limit 100;
+-----+-----------------------------+-----------------------+-------+-----------------+--------------+-------------+--------------+--------+---------------+--------------+
| id  | daemon                      | host                  | pid   | argv            | explicitArgv | dateCreated | dateModified | status | runningAsUser | daemonID     |
+-----+-----------------------------+-----------------------+-------+-----------------+--------------+-------------+--------------+--------+---------------+--------------+
| 700 | PhabricatorTriggerDaemon    | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1493943117 |   1495024078 | wait   | ubuntu        | 13170:wzdfmm |
| 716 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494424008 |   1494424028 | exit   | ubuntu        | 13170:o3nmex |
| 717 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494498596 |   1494498620 | exit   | ubuntu        | 13170:23fllc |
| 718 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494498603 |   1494498620 | exit   | ubuntu        | 13170:z5riov |
| 719 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494511057 |   1494511092 | exit   | ubuntu        | 13170:cgtowd |
| 720 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494511061 |   1494511093 | exit   | ubuntu        | 13170:nhnq7j |
| 721 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494511065 |   1494511093 | exit   | ubuntu        | 13170:yj3q47 |
| 722 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494514712 |   1494514730 | exit   | ubuntu        | 13170:pmcej2 |
| 723 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494584118 |   1494584135 | exit   | ubuntu        | 13170:sdcc5s |
| 724 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494590494 |   1494590512 | exit   | ubuntu        | 13170:6eipr4 |
| 725 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494591398 |   1494591416 | exit   | ubuntu        | 13170:yncejx |
| 726 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494593012 |   1494593029 | exit   | ubuntu        | 13170:yoir4s |
| 727 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494597530 |   1494597565 | exit   | ubuntu        | 13170:pfyt7k |
| 728 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494597534 |   1494597564 | exit   | ubuntu        | 13170:tfcats |
| 729 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494597538 |   1494597564 | exit   | ubuntu        | 13170:42vbcw |
| 730 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494832677 |   1494832695 | exit   | ubuntu        | 13170:4xmv6d |
| 731 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494841226 |   1494841248 | exit   | ubuntu        | 13170:6tt2cp |
| 732 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494841230 |   1494841248 | exit   | ubuntu        | 13170:iv6oww |
| 733 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494853581 |   1494853601 | exit   | ubuntu        | 13170:e6eol6 |
| 734 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494856113 |   1494856130 | exit   | ubuntu        | 13170:fkcruu |
| 735 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494928972 |   1494929006 | exit   | ubuntu        | 13170:te44vi |
| 736 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494928976 |   1494929007 | exit   | ubuntu        | 13170:bvu6gh |
| 737 | PhabricatorTaskmasterDaemon | repo004.phacility.net | 13170 | ["-l","XXXXXX"] | []           |  1494928980 |   1494929005 | exit   | ubuntu        | 13170:i27spt |
+-----+-----------------------------+-----------------------+-------+-----------------+--------------+-------------+--------------+--------+---------------+--------------+
23 rows in set (0.01 sec)

The GC query does this:

queryfx(
  $conn_w,
  'DELETE FROM %T WHERE dateCreated < %d AND status != %s LIMIT 100',
  $table->getTableName(),
  $this->getGarbageEpoch(),
  PhabricatorDaemonLog::STATUS_RUNNING);

Probably the simplest fix is to use dateModified instead of dateCreated, since I think daemons are required to modify this column via heartbeat every ~6 minutes? This could use some validation.

Event Timeline

I papered over this in the short term by restarting daemons for all instances;

$ phage remote restart --hosts repo001-24 --limit 8 --timeout 450

I believe we see the same issue in our environment, but I didn't think much of it/rule out actual problems with our setup and just restarted the daemons the first few times it's happened.

Yeah, some workarounds are:

  • Restart the daemons once every GC cycle, per whatever the daemon.processes GC is configured for (default: 7 days).
  • Change the daemon.processes GC cycle length to be longer (or indefinite). In the modern codebase, I believe the table fills very, very slowly.
  • Or just "Ignore" the setup warning.

Ah this probably explains what I have observed on our installation too.