A related issue here is exemplified in https://discourse.phabricator-community.org/t/importing-libphutil-repository-on-fresh-phabricator-triggers-an-error/2391/, which basically amounts to:
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 15 2019
Both tables have this key:
I believe we haven't seen more of this in two years, and "make the worker always exit in less than 2 hours" is a more-or-less reasonable remedy. Getting one extra email every two hours also isn't a huge problem even if we do get this wrong.
As of D19748, I'm not aware of any change of size X that requires more than 8X bytes of memory to parse. This isn't ideal, but it's a fair bit better than the 32X in the original report.
Presumably resolved elsewhere by D19503.
No clue how to reproduce this and we haven't seen anything similar since.
Jan 14 2019
Jan 5 2019
Jan 4 2019
Jan 2 2019
Jul 19 2018
EC2 volume ddata005.phacility.net filled up, causing problems for instances hosted on db005, leading to PHI771. I'll dig back into the CloudWatch monitoring stuff I setup a few months ago and make the db hosts report storage metrics the same way the repo hosts already do.
Apr 20 2018
Apr 12 2018
Apr 9 2018
Mar 6 2018
This log is now available at HEAD of master.
Mar 5 2018
Mar 1 2018
Feb 14 2018
We no longer offer free instances so I don't currently plan to pursue this.
Jan 29 2018
D18962 uses this to implement "Export Data" for large result sets.
Jan 27 2018
Jan 24 2018
Jan 4 2018
Oct 12 2017
Jul 27 2017
Agreed. I haven't experienced the problem since I upgraded, so I think it was related to an earlier fix, even if it wasn't the identified fix (which should have already been in my install when I did have the problems). There's nothing that needs to be addressed here.
We aren't going to implement a bin/phd start-missing-daemon command.
Jul 9 2017
Jun 23 2017
I think a minimal reproduction case which is typical of this example is:
Looks to be just the presence of the "?" in the text
"XXX://123456 XXX XXX XXX://123456 XXX XXX"
You and me both. I am super confused.
Well the stack trace says PhabricatorYoutubeRemarkupRule, so I'm confused what the issue is.
LOL @chad Literally reproduced this issue here by trying to paste the above line without the back ticks. It refuses to let me comment.
I'm not sure how without giving you the entire commit message which I cannot. I think key would be a having a commit with a line (probably the first one) that looks like this "XXX://123456 XXX XXX XXX? XXX://123456 XXX XXX XXX"
How can we reproduce this issue locally?
There also seems like there may be an issue with the current parsing logic since the "detected URI" has spaces which I do not think are valid in a uri and it should have been detected as two separate URI's with some stuff in the middle.
The URI that is ambiguous is a URI to a company App on MacOS. The only part of the URI that matters is the xxx://123456. The rest is just the title of the item referenced by this URI and this title contains a "?" which mixed with T12526 may be causing this issue. There may also be a place you now need to catch exceptions thanks to URI parsing logic changes. Just guessing here from what I can tell in the code.
Jun 21 2017
I think we should have our crontabs in version control regardless of whether or not we add tmpreaper to them, so I'll make a task for that.
If you want to move forward with that:
This should do the trick. It runs off atime by default. We could just set the time period to several days if we wanted to. Alternatively, if the filenames for extremely long-running jobs are predictable, there's a --protect '<shell_pattern>' argument we could use to avoid cleaning up those files.
Every repo host is equally affected, so I'd like to deploy crontabs as part of the regular deployment process if we use them as part of the approach here. That would require first codifying a handful of custom crontabs, including one on secure which regenerates documentation daily on only one host. This codification should happen anyway eventually, but it's a little bit of work, and wasted effort if we're switching to Chef/Salt/Ansible/etc soon anyway.
Should we just add a crontab entry to clean /tmp to paper this over until we get it fixed for real?
Jun 20 2017
Here's another clue, from the relevant host's error log:
Jun 19 2017
Jun 14 2017
Oh, this doesn't isolate things because they're on different databases, and thus we establish different connections. The daemon insert does not happen inside a transaction.
Jun 12 2017
See also T4124 for another Solaris issue.
Jun 8 2017
Existing sources of permanent failure are worth at least a cursory review before we ship this since they're pretty easy to grep for, but I don't anticipate any issues.
Jun 7 2017
May 26 2017
May 23 2017
Ah this probably explains what I have observed on our installation too.
May 18 2017
Yeah, some workarounds are:
I believe we see the same issue in our environment, but I didn't think much of it/rule out actual problems with our setup and just restarted the daemons the first few times it's happened.
May 17 2017
I papered over this in the short term by restarting daemons for all instances;
Apr 24 2017
One thing I noticed. All three daemons (Taskmaster, Trigger, PullLocal) are listed as "Waiting" on my install currently, and also show up in the output of phd status. When this problem occurred, I didn't look at the Daemons app in the web UI, but I did notice that Taskmaster was not listed in the phd status output. I'm guessing that behaviour is not normal and perhaps provides a little insight into what's going on here.
I've adjusted my monitoring to just alert me instead of restart the daemons when there's an issue so if/when this happens again I can investigate more fully/provide more information. The code from D17397 had definitely landed when I experienced this, as I saw it in the source code when I investigated. I've upgraded to current stable now.
Apr 23 2017
It is intentional that daemons shutdown when they aren't doing anything. See T12298. They will be restarted automatically when work becomes ready.
I made a diff (D17780) that adds bin/phd check which runs the setup check that the web UI runs, writing the result to the console, and exiting with an indicative status. This at least allows the circumstance to be detected and I can fix up the problem with bin/phd restart. This might be good enough. Even though using bin/phd start or having Phabricator self-repair through the Overseer would be better, it's likely too rare to warrant work on more complex options.
Apr 18 2017
Apr 17 2017
Apr 12 2017
Running multiple different versions of Phabricator on a single host is not currently supported. We should probably handle this situation better than we do, and there is no technical reason we can't support this, but this use case is very rare.
Apr 10 2017
I still had this problem, in a fresh install. Had to run
Apr 9 2017
- When you click "Delete File", we currently delete the file in the web process. Since we've supported enormous files and pluggable storage backends for a while, this could take an arbitrarily long amount of time to complete.
- Instead, we want to flag the file as "deleted", hide it in the web UI, and queue up a task in the daemons to actually get rid of the data.