Page MenuHomePhabricator

High-latency file storage can potentially break user profile pictures horribly
Closed, ResolvedPublic

Description

NOTE: This is a bit of an odd report, and I was only able to trip it by requiring custom code to some degree, although the error itself could manifest more generally in any storage subsystem depending on the network/host environment. It is highly unlikely to occur in most cases, but should it happen, I don't see any recovery method beyond fixing the database.

The story: When Phabricator is first initialized, there are no files stored anywhere. However, aside from actually uploading files, some interactions will generate them on demand. One of these "magical" file auto-uploads is the default set of Pokemon pictures, for default user profiles.

When you visit https://phab.domain.org/people/picture/1/ for the first time to edit your profile picture (say, right after you initialize Phabricator and have set up the first admin account), if the default set of 9 profile pictures has not yet been initialized, the HTTP handler which is serving this request takes the time to generate them, and upload them to the configured file storage. (If you look at the list of all files in another tab as this is happening, you can see the files being created.)

If the file storage subsystem has a particularly high latency, then the upload of these 9 small pictures can, under certain network conditions, take more than 30 seconds, which is the default HTTP handler timeout before your request goes 'boom'. If this occurs, any attempt to further edit your profile pictures will fail, due to Phabricator attempting to insert a row with a duplicate unique key.

Specific details: In my particular case, I am using my libphutil-backblaze library to upload my data to their storage system. This library requires a few roundtrip HTTP requests per upload call, so it has a bit of a higher latency than something like S3. It seemed like a few uploads took a particularly long time at first when I was testing something, which caused this break.

Unfortunately when trying to track this down I reinitialized my database and figured out what was wrong with another attempt, which succeeded. I also lost the Stack trace due to my automation, although I'll try to reproduce it shortly.

However, the following should work I think (I will follow up with a confirmation)

  1. Apply this patch:
diff --git a/src/applications/files/engine/PhabricatorLocalDiskFileStorageEngine.php b/src/applications/files/engine/PhabricatorLocalDiskFileStorageEngine.php
index d9f7caf..47e06f4 100644
--- a/src/applications/files/engine/PhabricatorLocalDiskFileStorageEngine.php
+++ b/src/applications/files/engine/PhabricatorLocalDiskFileStorageEngine.php
@@ -59,6 +59,7 @@ final class PhabricatorLocalDiskFileStorageEngine
     }

     AphrontWriteGuard::willWrite();
+    sleep(45);
     Filesystem::writeFile($root.'/'.$name, $data);

     return $name;
  1. Initialize a brand new Phabricator installation that is very shiny.
  1. Vist https://phab.domain.org/people/picture/1/, and wait for the creation of the default set of Pokemon profile pictures to time out the request.
  1. Attempt to reload the page, and notice it is broken (due to an invalid attempt to create a row with a duplicate unique key, I believe).

I will follow up with an exact stack trace shortly after I cajole another occurrence of this out while my automation runs, I just wanted to jot all this down first.

Event Timeline

What I found interesting about this bug is, I've had some obvious high latency with this storage engine - but incremental storage uploads for large files work fantastic in every use case I've tried, e.g. cancelling requests and all kinds of stuff. I guess this 'magical' default profile picture creation can't use the 'resume upload' mechanics?

T8124 is probably adjacent.

Yeah, T8124 is definitely the error user-facing I saw when this occurred, except there's no possible way to edit your profile pic at that point as the exception takes up the whole page.

T10907 is also sort of adjacent.

T10907 is also sort of adjacent.

wow, actual ghosts, being busted

spooky

I can't really reproduce this, exactly, I think (?), but I suspect I can fix this and T8124 by just making this code simpler.

(I can only reproduce a similar condition by manually deleting a row from phabricator_file while leaving a row in phabricator_transformedfile pointing at it.)

D16271 probably either fixes or moots this.

iiam

Just as a note: I had this happen again on another very low-power instance (same file storage backend, encryption enabled etc), but with the latest HEAD, simply reloading with Ctrl+R after the webserver timed out the request made it continue and work perfectly. So yes, this is fixed it seems.