Page MenuHomePhabricator

Prevent rapid AWS calls being made due to SSH test returning quickly
AbandonedPublic

Authored by hach-que on Jun 25 2015, 5:46 AM.

Details

Reviewers
epriestley
Group Reviewers
Blessed Reviewers
Summary

This prevents the EC2 blueprint from hammering the AWS API when SSH or WinRM start returning immediately.

We observe this scenario occurring frequently, but we just don't have access to the internal Amazon implementation to find out why their networking behaves like this (does the connection terminate while EC2 is assigning IPs, does the connection terminate because the box temporarily brings up a firewall that REJECTs instead of DROPs, etc.) The actual causes for why the connection might be refused instead of waiting for the timeout are numerable and impossible to know (and may vary for different people).

This bug fix is required because in the event the connections are refused, we don't want to hammer the EC2 API in a fast-running loop and consume all of the API credits. When all of the API credits / limits have been exhausted for the time period, AWS blocks API access to all applications on the account, and thus when this scenario occurs, Phabricator essentially ends up DoS'ing any other applications using the AWS API.

Test Plan

Tested in production.

Event Timeline

hach-que retitled this revision from to Prevent rapid AWS calls being made due to SSH test returning quickly.
hach-que updated this object.
hach-que edited the test plan for this revision. (Show Details)
hach-que added a reviewer: epriestley.
hach-que edited edge metadata.

Fix issues with calculation of sleep time

epriestley edited edge metadata.

(under certain, unknown conditions).

I don't want to upstream this kind of stuff.

This revision now requires changes to proceed.Aug 8 2015, 6:55 PM

(under certain, unknown conditions).

I don't want to upstream this kind of stuff.

To be clear "certain, unknown conditions" doesn't mean "it doesn't happen". We observe this scenario occurring frequently, we just don't have access to the internal Amazon implementation to find out why their networking behaves like this (does the connection terminate while EC2 is assigning IPs, does the connection terminate because the box temporarily brings up a firewall that REJECTs instead of DROPs, etc.) The actual causes for why the connection might be refused instead of waiting for the timeout are numerable and impossible to know (and may vary for different people).

But this bug fix is still required because in the event the connections are refused, we don't want to hammer the EC2 API in a fast-running loop and consume all of the API credits. When all of the API credits / limits have been exhausted for the time period, AWS blocks API access to all applications on the account, and thus when this scenario occurs, Phabricator essentially ends up DoS'ing any other applications using the AWS API.

hach-que edited edge metadata.