Page MenuHomePhabricator

When exporting more than 1,000 records, export in the background
ClosedPublic

Authored by epriestley on Jan 29 2018, 7:31 PM.
Tags
None
Referenced Files
F14757439: D18962.id.diff
Wed, Jan 22, 1:08 AM
Unknown Object (File)
Tue, Jan 21, 10:34 AM
Unknown Object (File)
Mon, Jan 6, 12:16 PM
Unknown Object (File)
Mon, Dec 30, 8:20 AM
Unknown Object (File)
Mon, Dec 30, 6:34 AM
Unknown Object (File)
Sat, Dec 28, 6:29 PM
Unknown Object (File)
Dec 21 2024, 11:09 PM
Unknown Object (File)
Dec 21 2024, 10:47 AM
Subscribers
None

Details

Summary

Depends on D18961. Ref T13049. Currently, longer exports don't give the user any feedback, and exports that take longer than 30 seconds are likely to timeout.

For small exports (up to 1,000 rows) continue doing the export in the web process.

For large exports, queue a bulk job and do them in the workers instead. This sends the user through the bulk operation UI and is similar to bulk edits. It's a little clunky for now, but you get your data at the end, which is far better than hanging for 30 seconds and then fataling.

Test Plan

Exported small result sets, got the same workflow as before. Exported very large result sets, went through the bulk flow, got reasonable results out.

Diff Detail

Repository
rP Phabricator
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

To step back a minute, how do we feel about long-running requests taking up space on the web tier? It would be nice to live in a world where the max request time is bound tightly to a few seconds, and everything else runs in an "async" tier. Related to that, should we add rate limits to the export functionality? Seems like a good DoS vector as-is.

This revision is now accepted and ready to land.Jan 29 2018, 10:16 PM

To step back a minute, how do we feel about long-running requests taking up space on the web tier?

All web requests should currently be limited to 30s by max_execution_time in php.ini. (We currently don't set this explicitly and just use the default value.)

The exception is DiffusionServeController, which explicitly calls set_time_limit(0) so that uploads can take however long they want, but that's the only callsite to set_time_limit() in the codebase.

I think "execution time" only starts when we actually start executing, but after the enable_post_data_reading changes connected to T13008 that should generally be very shortly after the PHP process starts. In a default configuration, POST requests will read the entire body before "execution time" begins, I believe, but we now start executing as soon as the request can be routed.

The mod_reqtimeout changes from T13008 also prevent clients from connecting and then hanging or trickling data very slowly -- they must maintain a minimum upload rate. And they'll take up "maximum simultaneous connection" rate limit slots while they do. So they can eat up a few slots by uploading a 100GB file at the minimum rate, but they'll get limited on open connections from their remote address and/or instance before they harm other instances, at least in theory. Trickling data up also doesn't take much on our side beyond process slots.

(Separately, we also have query timeouts that kill long-running queries, and usually have time limits on other service/subprocess sorts of things. We also have mechanisms like "overheating" where pathological queries just fail instead of consuming resources, see T11773. And stuff like Conduit won't let you request large page sizes, and Calendar won't let you upload an event that repeats every second, and so on.)

So I believe it should be very hard to meaningfully take up too much space or too many resources on the web tier already, at least in a naive way by just making some big request. There are probably still be some ways you can do slow-ish things, but I think we've covered all the major stuff and have the tools to stop new slow-ish things from happening as they come up.

Obviously, you can still be legitimately abusive, e.g. write 100 million Herald rules and then every request will probably start timing out. But I think it's mostly pretty hard to be accidentally abusive, and mostly pretty hard to harm other instances too much.

It would be nice to live in a world where the max request time is bound tightly to a few seconds, and everything else runs in an "async" tier.

I think we already live in that world, at least mostly. "A few seconds" is "30 seconds", and "async tier" is "daemons on the repo tier", but this is more or less the goal of the Bulk workflow. See also T5166.

That task lists a few other things which we currently don't support because they'd be unboundedly slow, but which we can support with the bulk infrastructure. In some of those cases, we just haven't implemented these features because they'd require unbounded web process time.

We could reduce "30 seconds" to something tighter -- maybe "10 seconds" -- but one thing I think we'd get pushback on is the Maniphest Reports stuff, which is slow and awful but sees some use. After Facts happens I'm not aware of any other slow stuff that runs in web processes.

Related to that, should we add rate limits to the export functionality? Seems like a good DoS vector as-is.

The jobs queue up in the daemon queue at BULK priority, so you're only competing with other bulk jobs submitted by your instance (and lower priority tasks, which are just commit imports today). The total parallelism is limited by the number of workers your instance has. So queueing up a billion of these will only hurt your instance by slowing down how quickly lower-priority jobs complete.

You could repeatedly export 999-record result sets to make them happen in a web process, but these build pretty quickly (feels like it takes about a second?) and there are a zillion things you can do which take a second or two so I don't think this is exploring any new ground as a DOS vector.

so that uploads can take however long they want

This is Git LFS uploads, specifically. (Which aren't enabled in the cluster today, so it's even more moot.)

The terminology here also ended up being kind of confusing. There are two separate pieces of infrastructure here:

The BulkJob infrastructure implements an async tier in the daemons.

The Bulk Editor is built on top of that and lets you edit a bunch of tasks (or, in theory, other objects) at once, using the BulkJob infrastructure to do the edit via the daemon queue instead of on the web tier.

This new change uses the BulkJob infrastructure to move large/slow exports off the web tier.

Future changes may use this infrastructure to do recursive document moves in Phriction, build repository tarballs, etc.

Cool, that all makes sense 👍

This revision was automatically updated to reflect the committed changes.