Page MenuHomePhabricator

Facebook Updated Again? (Umbrella UI Feedback Task)
Closed, ResolvedPublic

Description

(It sounds like Facebook has updated to HEAD or somewhere near it, or something. This is an umbrella task for feedback about UI changes.)

In the last four months we've built a lot of new stuff, including per-object privacy and repository hosting. Many of these new features involve UI changes, reorganizations, or new UI elements.

We developed and deployed most of these features over the course of the last four months and sought feedback when we built them, incorporated that feedback, and ended up with designs which all installs giving us feedback were happy with. Facebook hasn't had anyone staying in touch with the upstream during this time. After the debacle last summer we met with Facebook and created the "FacebookPOC" list to solve this issue of Facebook being surprised by updates. Over the last four months, we've CC'd the list on essentially every relevant-to-Facebook or otherwise notable change we've made in the upstream.

This install (secure.phabricator.com) also stays near HEAD, so you can poke around and check out what we're working on at any time (almost all content has been public -- no login required -- since we built that feature). Additionally, we publish monthly changelogs to the Facebook page, phabricator.org site, and twitter feed.

Basically, the best time to provide UI feedback about features is when we are building them, and we try to make it very easy to keep an eye on what we're building. That ship has sailed on most of the changes since September, in that we consider them stable and shipped, and we can no longer be as flexible about making UI and functional changes as we could have been if feedback had been more timely.

That said, we're still open to feedback. Just don't expect miracles if you're complaining about stuff we built four months ago and which every other install has been happy with since then.

(Last time, a small amount of the feedback was legitimate bug reports rather than UI feedback. We're thrilled to receive these no matter how late they are.)

So far:

  • New vertical display of reviewers (implemented for T1279) "feels more sparse". See T1279 for discussion of this feature.
  • Request to reduce/hide policy information on objects (like "All Users" in the subheader of many revisions). See T603 for discussion of this feature.

Event Timeline

epriestley claimed this task.
epriestley raised the priority of this task from to Normal.
epriestley updated the task description. (Show Details)
epriestley added projects: Facebook, Design.
epriestley added subscribers: epriestley, asukhachev, Unknown Object (MLST) and 3 others.
epriestley added a subscriber: schrockn.

Preemptive +@schrockn

It looks like the 'z' shortcut to bring up the comment composer doesn't work anymore. Is this intentional?

From what I can tell, Facebook's install updated from some time in late August / early September to some version prior to Oct 19 -- so what you're running now is about 4 months old instead of 5 months old. You can check this install for HEAD, where "z" should be working. The "z" feature was broken briefly before Oct 19 and fixed in D7366.

I'm not sure what Facebook's upgrade plan is, all our knowledge of what's going on is via people showing up in IRC with problems which couldn't have existed in early September.

Basically, if you're seeing issues with anything, see if you can reproduce them on this install. If stuff works on this install it will probably start working once Facebook's install moves further forward. When it gets near HEAD, the transaction/comment style in Differential will change to look like this install (we did that a week or so ago); until it does things are at least two weeks behind.

The z shortcut is fixed - we moved to the 10/19 version since there is a big migration we need to take special care of, and I picked the fix in for 'z' later yesterday.
The biggest complaint so far on UI is the big white header in diffusion - it seems to be some waste space with just a few menu items which are seldom used.

This thing, you mean? I'm not sure what it looked like on 10/19.

Screen_Shot_2014-02-21_at_11.43.39_AM.png (217×1 px, 28 KB)

Ah. T4426 has some discussion about building accordion menus, which would limit the size of that menu.

We've also been rolling out action buttons into the header itself, they just haven't been deployed everywhere.

Hey Evan and Co. - I think it's better to communicate with you guys with our plan on upgrading phabricator, so that you won't see unexpected comments from Facebook. We want to upgrade to the version of Jan 2014 this time, but there are a couple big data migrations we need to take additional care of. So the whole process will have five pushes:

  • Move the production to before D7139
  • Do the migration part of D7139 live. The original script is taking too long for us - due to the size of our inline comment table and the location of the DB tier. So we are changing the script a little bit to batch the database access to speed it up (we can do it in 3 hours now). In order to do it live, the inline comment storage code is changed a little bit so it does double writing to both tables whenever inline comments are added/edited/deleted.
  • Remove the double writing code in last step, and apply all the code changes up to D7513
  • Do the migration part of D7513 live. Again we do similar things - batch database access, do double writing
  • Remove the double writing code in last step, and apply all the code changes to some point in Jan 2014 (somewhere around Jan 13 as I remember)

Those approaches sound reasonable. Two suggestions:

In order to do it live, the inline comment storage code is changed a little bit so it does double writing to both tables whenever inline comments are added/edited/deleted.

Double writing this may be complicated. An alternative might be to just disable inline comments for a few hours? You could also do something like this:

  • With production Phabricator running the old code, put the new code somewhere and run the migration. Save the largest ID you migrate.
  • Copy the migrated table to the production instance.
  • Upgrade Phabricator and migrate just the new inlines which didn't get caught by the migration (i.e., those with larger IDs than the saved ID in step 1).

That should give you a minute or two of downtime instead, without requiring any double writes.

Again we do similar things - batch database access, do double writing

You shouldn't need any custom code for double writes here -- the new should automatically populate the column. You can fill in the older records at your leisure (we still don't actually use this column even at HEAD, but will soon).

After the migrations you discuss, here are the bigger ones ahead of you:

  • 20140130.mail.1.retry.sql, 20140130.mail.2.next.sql, 20140201.gc.1.mailsent.sql, 20140201.gc.2.mailreceived.sql: These four migrations can be made much faster by preemptively deleting the majority of the records from the tables they affect. In old code, these tables (metamta_mail, metamta_receivedmail) were allowed to grow without bound. After these patches, we garbage collect records from these tables (after 90 days, I think). However, the patches will apply to the entire table, and apply before GC starts. You can apply these far more quickly by identifying the largest ID in each table which is more than 90 days old (or pick some other number, you could do 30 or 14 or 7 safely but might want to keep some data around), then delete all smaller IDs. These records will be deleted by Phabricator afterward anyway, the process will just be less efficient.
  • 20140211.dx.2.migcommenttext.php, 20140212.dx.1.armageddon.php: We can talk about these when you get there, I don't see a super easy way to deal with them.

Also, if you're in Menlo Park, I'm happy to come help you perform these migrations in person.

Yeah you are right we don't need to have special action on D7513 - just batch the DB queries to make the migration faster I think. For D7139, double writing is not that hard I think.
Also thanks for the heads up on new big migrations - I'll look into them later once the current ones are done. Unfortunately I am living in Seattle, so have to communicate with you this way or on IRC.

I've put up a paste P1075 for the custom code we try to apply during data migration for D7139. Please take a look if you have some time. I've tested it on a test tier and it works fine to me.

That patch looks good to me. I don't think any of the migrations had anything particularly tricky or subtle about them, so if they work in normal cases there's a good chance they'll work in all cases.

One more complaint from Facebook: could we bring the author name back to blame view?

A few more issues popping up during the past few weeks - most are general issues and not quite related to the rebase, but I put them here for some initial discussion anyway. I'll file individual tasks later on.

  1. Data compression. Some tables are taking hundreds of GB nowadays and keep growing. We need to address it since the total DB size already reached 2.2 TB and the hard limit is 2.5TB. Two major ones are the file blob and the differential hunks. We can enable compression in the table level in MySQL, but it's not as efficient as field compression from php code.
  2. In configerator repo we have a folder with more than 16k files (all the sitevars). Open it in diffusion will flood and DOS the whole site. Should we limit the requests sent out here? Maybe only the ones that are visible on the screen should have the request out.
  3. Recently we had some issues with the herald rules when the commit is huge (a branch cut for example). Basically if the commit rule has a condition with diff content, it will fail there, and after the daemon restarts it will fail there again and so on. We should be able to handle this type of failures better, or maybe we should not try to get the diff content at all if we know it's going to be enormous.
  4. Related to 3, the repeatedly failing herald worker will generate some huge emails in each run, and flood the active task queue. All other tasks are essentially blocked. The task queue is roughly a FIFO queue without taking into consideration the task types.

Data compression. Some tables are taking hundreds of GB nowadays and keep growing. We need to address it since the total DB size already reached 2.2 TB and the hard limit is 2.5TB. Two major ones are the file blob and the differential hunks. We can enable compression in the table level in MySQL, but it's not as efficient as field compression from php code.

  • Facebook should move file storage out of MySQL. It is not intended as a filestore for large installs. We ship with Local Disk and S3 storage engines, or we can help you write something against another file store (like Haystack or whatever). There are utilities to help you migrate files between storage engines. I'd guess compression won't help all that much since a lot of this data is probably images. This migration can happen without downtime.
  • We should probably move hunks to file storage too; generally we need to come up with a more nuanced approach for them because of T4045. I plan to get to this somewhat soon as it blocks T1191. How large are the hunk and file data tables, specifically?
  • At HEAD, we've added garbage collection to other tables which previously grew without bound (like user logs, conduit logs, and metamta). This may give you some headroom.

In configerator repo we have a folder with more than 16k files (all the sitevars). Open it in diffusion will flood and DOS the whole site. Should we limit the requests sent out here? Maybe only the ones that are visible on the screen should have the request out.

This is T4366, right?

Recently we had some issues with the herald rules when the commit is huge (a branch cut for example). Basically if the commit rule has a condition with diff content, it will fail there, and after the daemon restarts it will fail there again and so on. We should be able to handle this type of failures better, or maybe we should not try to get the diff content at all if we know it's going to be enormous.

I think this should be fixed in HEAD -- see T4276, D7885, D7887.

Related to 3, the repeatedly failing herald worker will generate some huge emails in each run, and flood the active task queue. All other tasks are essentially blocked. The task queue is roughly a FIFO queue without taking into consideration the task types.

We should fix the root issues in cases like this (Herald should already be fixed). I'm very hesitant to make the task queue more complicated so that it might perform better when there are severe failures. See T3557 for discussion.

Mail retry behavior should generally be improved and simpler at HEAD, too.

Ah yes it is 4366, sorry.
Both tables have about 400+ GB (data+index). Yeah I'll explore the alternatives you mentioned for the files.
Actually we already have D7885 and D7887 in our instance. I'll try to dig out more detailed information how it is failing.

One more question on the task daemon behavior: when a task failed (like herald worker) and the daemon restart after a short period, would it retry the failed task immediately since the lease is not expired yet? I read the code and my understanding is that we always get the unleased tasks first, but in production it seems to be failing repeatedly. I am confused on that a little bit.

would it retry the failed task immediately since the lease is not expired yet

It should not, no. Your understanding is consistent with my expectations.

There's nothing really actionable left here, I think.

  • On changesets, we now support compression and nearly support archiving.
  • The queue now has explicit priorities, although I'm not sure that's relevant.
  • There are tasks elsewhere for stuff like huge directory lists.