Page MenuHomePhabricator

Change notification server so it only alerts appropriate clients
Closed, ResolvedPublic

Description

In the Aphlict server code, right now if there's a notification for anybody, every client gets a push and then polls to see if it needed to update. In case there are a lot of phabricator tabs open, the requests are flooding the phabricator servers and have made service unavailable.

Event Timeline

sowedance assigned this task to epriestley.
sowedance raised the priority of this task from to Needs Triage.
sowedance updated the task description. (Show Details)
sowedance added a subscriber: sowedance.

In case there are a lot of phabricator tabs open, the requests are flooding the phabricator servers and have made service unavailable.

Can you quantify/substantiate this? While the scalability issue you mention is definitely a real one that we have plans to resolve, I don't want to jump the gun or fix the wrong thing if this isn't really the problem.

epriestley added a subscriber: Unknown Object (MLST).Jan 15 2014, 8:17 PM

We've had to disable the broadcast because the load was so high, we were essentially ddosing ourselves. Peak load at the load balancer with the notification server broadcasting was about 20x the peak with it disabled.

(by load, i mean request rate. should have used clearly language there)

Good enough for me. Steps to mitigate this are:

  • Restore LocalConnection support to aphlict.swf so each client machine issues one connection instead of each tab, and each push is responded to once per machine instead of once per tab. This was removed in D2704 because the students in 2011 couldn't get it working, but I suspect this was an expertise issue and there is no real technical reason we can't get it working (Facebook used LocalConnection to play sounds successfully for a number of years, although it seems to have dropped it recently and in doing so reintroduced the "sounds play multiple times from each tab" bug).
  • Allow clients to subscribe to object PHIDs (generally, their user PHID and any objects they're looking at).
  • Send notification update messages only to subscribed clients.

While we're in there, we should improve infrastructure:

  • We need a slightly more sophisticated channel to support Conpherence realtime updates (T4083).
  • Some of the debugging workflow for notifications is unclear right now.

(What's the urgency / deployment plan on your end? Are you planning to cherry-pick these changes?)

We'd probably cherry-pick, but I don't think this is at all urgent.

I disabled the broadcasting in early december and I don't think anybody's noticed or been bothered enough to file a task about it.

epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
epriestley edited this Maniphest Task.
chad edited this Maniphest Task.

I'm pretty sure our install is running into performance issues due to this as well.

@epriestley, do you remember what the issues with LocalConnection were? If I have time, I might try to fix this.

I was never directly aware of them, the student-interns were just like "we couldn't get it to work". If you fish through history, the commit prior to them removing that stuff should have what I believed was a working implementation. It might be approximately functional, or it might be a total mess.

The bigger issue is probably fixing the notification server itself so it notifies relevant clients instead of all clients. I think that's not too difficult, but I think it involves some compatibility breaks so I wanted to have a chunk of time to try to move everything over all at once (I added versioning a while ago to support this).

If you have a bit of time, I can probably put a bit of time in too and hopefully condense the breakage to a short period of time.

I'm going to try to do some work on this, hopefully.

joshuaspence raised the priority of this task from Low to Normal.May 25 2014, 10:34 PM

Increasing the priority because of T5063.

@epriestley, do you have any recollection of how to compile the ActionSctipt? We should add some developer documentation here (comments somewhere noticeable would suffice) because it's not as straight forward as I was expecting.

I believe that I need to download the Adobe Flex SDK + mxmlc compiler but its not clear where I can obtain these. There are NPM packages for mxmlc and flex-sdk, which seem easier than downloading from Adobe.

Sounds like you figured things out -- I think the best modern answer is probably to download the Apache Flex SDK: http://flex.apache.org

We should probably move the build script to bin/aphlict build and make the error message about missing mxmlc be more informative (i.e., "here's where to get it, yada yada").

Did you end up using the npm flex-sdk? The Apache installer was a bit of a pain (download a DMG to install an installer, then install the SDK after agreeing to like 8 licenses). Completely manageable, but if flex-sdk was easier we could recommend that instead.

Yeah I figured it out. flex-sdk was the way to go. It was pretty straight-forward to setup (besides a few cryptic npm errors).

./bin/aphlict build sounds reasonable (after D9226).

Cool.

Do you have the time/ambition/clarity to keep moving this forward? The next couple of steps might be a bit fuzzy. I can put some time into this too, but I don't want to step on your toes if you have a clear plan for moving forward.

Yeah, I should be able to put a bit of work into this.

Yeah, I should be able to put a bit of work into this.

awesometown

Before I go about implementing this, I want to discuss my possible approach to make sure that it is reasonable.

  1. The Aphlict client needs to be "aware" of the logged-in user.
    • Probably eventually there should be some sort of authentication involved here, after which it would be reasonable to send the message contents from the Aphlict server to the Aphlict clients (rather than just broadcasting the key).
    • Initially, we can be really dumb here and just believe whatever the client tells us. Since the client will have to make a call to /notification/individual/ in order to retrieve the contents of the notification, there shouldn't be any real harm in doing this. If the client lies about the user ID, then they will not be able to access the contents of the notification anyway.
    • We will probably need to ensure that the LocalConnection that is used to route notifications between multiple tabs is aware of the user ID as well. This is probably not likely to present issues except in a dev environment, when multiple users are logged in in different windows (or perhaps the LocalConnection cannot be shared between incognito Chrome sessions... I'm not really sure). This would be as simple as change the name of the master LocalConnection from aphlict_master to aphlict_master_${USER_ID}.
  2. The Aphlict server (aphlict_server.js) will need to store the user ID for all of the listeners.
  3. The broadcast function in aphlict_server.js will need to be modified to be function broadcast(data, recipients) where recipients is an array containing the user IDs that should receive the notification. This should be able to be determined using the PhabricatorNotificationQuery class.

@epriestley, wdyt?

Yeah, my plan is basically one step up from what you describe:

  • Update broadcast messages to include the PHIDs they should be broadcast to.
    • This already exists, because it's how we deliver non-realtime notifications. For example, revision updates are broadcast to the revision PHID itself, the author, actor, subscribers, etc.
  • Update broadcast messages to include some metadata about the message type (e.g., notification vs conpherence vs other things we add in the future).
  • The actual data on broadcasts stays as just IDs/PHIDs/non-authenticated stuff.
  • This is slightly more involved than just "users", because a user looking at T123 should receive real-time notifications about it, even if they aren't a subscriber/author/etc.

Then, update the client:

  • It now sends "listen to" and "stop listening to" messages to the server.
  • "listen to" is a list of PHIDs the client cares about.
  • "stop listening" is a list of PHIDs the client no longer cares about.
  • When the user visits a page, the browser sends "listen to" for the object PHID on the page (if one exists) and the user PHID (if not already listening).
  • When the user leaves a page, the browser sends "stop listening" (ideally; not a big deal if not).
  • We do some reference counting on the client to keep this not-too-spammy I guess.

So the server now knows which PHIDs a message is about, and which clients care about those PHIDs. It can broadcast messages to just the relevant clients, and we don't need to do any auth stuff.

This is a little bit involved, but shouldn't be too bad, and we'd need it anyway even if we do auth eventually to cover the "you can see T123 and are looking at it right now, but aren't associated with it in any way" case.

Ok. I'm working on a diff but I might leave some of that stuff out initially.

One issue that I am not sure of is how to access the user PHID from the JavaScript?

We will probably need to ensure that the LocalConnection that is used to route notifications between multiple tabs is aware of the user ID as well. This is probably not likely to present issues except in a dev environment, when multiple users are logged in in different windows (or perhaps the LocalConnection cannot be shared between incognito Chrome sessions... I'm not really sure). This would be as simple as change the name of the master LocalConnection from aphlict_master to aphlict_master_${USER_ID}.

Oh, and I think this is reasonable. This just needs to be non-confusing, not super-secure (since multiple open browsers on the same computer aren't meaningfully isolated from one another in a security sense and we aren't sending any authenticated data over this channel).

This starts to get trickier if we want to have just the master do the ajax request or something like that, and then send the results to the clients. But we can cross that bridge when we come to it.

@epriestley, I've moved some of your wishlist into T5284 since it's not-quite-related to this ticket. This ticket itself is almost complete IMO.

This is basically done... there is still some touching up to be done but this is more related to T5284.