Page MenuHomePhabricator

Improve reliability of real-time notification delivery
Closed, ResolvedPublic

Description

Just consolidating reports of real-time chat notifications not arriving reliably, see some discussion in T4083 and, uh, "General Chat", which I can't really link to yet and will file an issue for shortly.

This is expected to probably work some of the time, but not necessarily all of the time. In particular:

  • Conpherence is generally churning a lot in connection with the "Conpherence v2" project (T7565).
  • I believe subscription management (which controls which notifications you receive) is finicky after D11769. There are some TODOs in the code about this. Prior to D11769, this was masked by bugs which oversubscribed users on the notification server. The notifications we send are now much closer to the correct set of notifications, but I believe the behavior may omit notifications in some cases.
  • There are some known bugs with handling behavior (like T6713 and some things connected to T7708) so we may not update the UI properly even if we do receive the event notification.
  • If you're using the chat column, you activate Quicksand which essentially breaks everything. In particular, subscriptions definitely won't update properly in Quicksand (T7680) and the column itself likely mis-handles them. This stuff falls under the general umbrella of T7573.

This will be reliable before we ship Conpherence v2, but a lot of the infrastructure is still in a buildout phase right now.

I'll poke at this and see if I can reproduce it never working, but if it works sometimes and is just unreliable, that's roughly in-line with expectations until we're able to start stabilizing the Conpherence changes.

Event Timeline

I can get this to work some of the time so it's not 100% broken, it's just really really finicky right now. We'll shore it up as the rest of the Conpherence changes move forward.

We have been attempting to use conpherence more often, and the real time updates of the threads only seem to work for the first and sometimes second messages, and then stop. There are no errors in the browsers JS console, and other real time notifications appear to be working reliably and do not appear to stop e.g. task/diff updates, reload page popups etc...

Repro

  1. User1 and User2 open existing thread in the main conpherence app
  2. User1 types a message
  3. Conpherence updates locally for User1 and User2 gets the message in the main conpherence app
  4. User2 replies and Conference updates locally for User2, but User1 does not get the message
  5. User1 types a message, without getting the message from User2, and then presses submit, the conpherence app updates locally with both the message from User2 and message User1 created, but the message is not real-time received by User2. (This is then repeatable by both users)

If either user refreshes the page at any time, all of the messages load.

I think this is fixed at HEAD now? Specific reproduction instructions are much appreciated if folks can still get errors here.

Migrated over from rocket.chat (to pick up all the other awesome things within phabricator) and this is one of the first 'issues' I'm seeing with users. Almost everyone is connecting from Fedora 23 boxes with the latex Firefox release (44). It currently feels like you have to _send_ a message to get the updates in the chat room.

[...]. It currently feels like you have to _send_ a message to get the updates in the chat room.

This sums up our experience as well. This in combination with the fact that there is no indication if the other user saw the message makes it kind of unsuitable for messages where timely delivery is required.

I'd like to not have the users drop back to other means like hangouts. I was happy to wind down the rocket instance, but I might have to bring it back if there are situations arising where real-time coms are required.

I'd be generally interested to know if there is any server side modifications I should be making to lessen the effect. It seems like now I'm just gonna have to mash f5 periodically to see if someone is talking to me.

It sounds like neither of you have actually configured Phabricator's real-time infrastructure. See:

https://secure.phabricator.com/book/phabricator/article/notifications/

Correct... I'll dive into that. Although it doesn't look like its a quick/simple config change. I'll report back when I have it running.

We had (and have) the notification server up&running.
Notifications are working in general, but not reliably for chat messages.

We discovered one more thing, namely that the issue seems to be I/O-related:

In general our server is a bit on the slow side I/O-wise.
In maybe 90% of the cases, chat notifications are off-by-one and the receiver needs to refresh the page or send a message himself to see everything. Sometimes it is enough for the sender to send another message to make his one-but-last message appear to the receiver.

When we move the MySQL DB to a Ramdisk for testing, making I/O fast but unreliable, the situation changes.
In maybe 90% of the cases, chat notifications now work as expected.
There seems to be a timing/queuing/concurrency issue somewhere with the notification overtaking the chat message.