Page MenuHomePhabricator

Decide the fate of FutureGraph
Closed, ResolvedPublic

Description

One limitation I'm concerned about with new Arcanist Hardpoint infrastructure is that it makes parallelizing service calls difficult. I'm generally satisfied with it from a power/extensibility standpoint and I think the complexity is manageable, but I'm uncomfortable about performance.

A 2013 series of changes starting with D5104 built FutureGraph, which may be a reasonable approach here. That stuff never moved forward because the motivation was parallelizing queries (which we generally have less to gain from) and we didn't get anywhere I was satisfied with on the call chaining API. However, the performance motivation is now more material and call chaining is less of a concern with the simpler Conduit futures.

Revisions and Commits

rARC Arcanist
D21083
D21082
D21080
D21079
D21078
D21075
D21074
D21072
D21071
D21070
D21058
D21053
D21046
D21038
D21036
D21035
D21034
D21033
D21031
D21032
rP Phabricator
D21054

Event Timeline

avivey added a subscriber: avivey.Dec 8 2016, 6:41 PM
epriestley moved this task from Backlog to v3 on the Diffusion board.Jan 18 2017, 6:59 PM
epriestley edited projects, added Diffusion (v3); removed Diffusion.
epriestley edited projects, added Diffusion; removed Diffusion (v3).Feb 2 2017, 3:54 PM

A problem in moving forward here is that we ultimately do very little complex data access in Phabricator, and what complex data access we do perform can often be faked. We likely have more use cases in Arcanist and provisioning code: API calls are slower, workflows are more parallel/interactive, and we can't just fake it all with AJAX.

The basic "future graph" problem is that if we have a block of code like this today:

$list_a = id(new AQuery())
  ->needB(true)
  ->execute();

$list_x = id(new XQuery())
  ->needY(true)
  ->execute();

It will execute four queries in sequence:

SELECT * FROM a ...
SELECT * FROM b WHERE id IN (...)
SELECT * FROM x ...
SELECT * FROM y WHERE id IN (...)

Query "B" depends on query "A", and query "Y" depends on query "X", but the two pairs are otherwise independent of one another and could execute in parallel.

An ideal API would let the code above execute in parallel (so queries "a" and "x" were issued first, then "b" and "y" as the first queries came back) without making the block itself significantly more complex -- and it should be easy to add a new thing which also executes in parallel without changing the existing code. That ("make it easy to use") is the hard part. We also "can't" use yield. This constraint is somewhat artificial, but I'd like to preserve it.


A related issue is the "need/attach" pattern. Currently, object use needX() on Query classes to request properties be loaded, then attachX/getX. This is mostly fairly good. Two weaknesses are:

  • There's a moderate amount of boilerplate.
  • There's no nice way to say "load and attach this thing if it isn't already loaded and attached", and no way to easily build this given how the code is structured today. This leads to a low background level of "// Reload the objects to get such-and-such property." in the codebase, and public static function loadAndAttach(...).

It should be reasonable to say "this object has some named list of attachable properties", and to have a single definition of how to load-and-attach a property that can serve both needX() and loadAndAttachX(), where needX() would just become a shorthand for calling loadAndAttachX() on query result pages.

In the experimental / wilds branch, there's some "Hardpoint" code which accomplishes some of these goals. A "Ref" is like a storage object, a "HardpointLoader" is like a "Query" object. I think most of this code is largely reasonable. I'd like to move toward a world which looks like this, maybe with different names:

  • Ref: A storage object with hardpoints.
    • DAO: A database-backed storage object with hardpoints and query-based data access.
  • Loader: Defines how to look up data related to a ref.
    • Query: A database-backed looker-upper.

In particular, "Ref" is pretty overloaded and "[Lisk]DAO" isn't really connected to anything else these days, so I'm less excited about those names.

(A theoretical weakness is that there's currently no pattern to return some arbitrarily constrained variation of an attachment, e.g. "load and attach the first 10 diffs" or "load and attach all subprojects that start with 'q'". We never (?) need to do this today and I'm not sure we ever will -- in particular, MySQL generally can not reasonably execute this kind of query against a list of things being attached to -- but it is maybe something to keep in mind.)

So: can we build "Ref + Hardpoint + Loader" in a way that works with Futures without making them substantially more complicated?

We expect that Loader ultimately should have some function that looks like this:

public function loadHardpoints(array $objects, $hardpoint_key) {
  $phids = mpull($objects, 'getPHID');

  $data = id(new Query())
     ->withParentPHIDs($phids)
     ->execute();

  foreach ($objects as $object) {
    $data_for_object = idx($data, $object->getPHID());
    $object->attachHardpoint($hardpoint_key, $data_for_object);
  }
}

With yield, this can be made future-oriented like this:

$data_query = id(new Query())
   ->withParentPHIDs($phids);

yield $data_query;
 
$data = $data_query->resolve();

The advantage of yield is that the rest of the function state is preserved "for free". We can make this work without yield like this:

if (!$this->dataQuery) {
  $phids = mpull($objects, 'getPHID');
 
  $data_query = id(new Query())
    ->withParentPHIDs($phids);

  $this->dataQuery = $data_query;

  return $data_query;
}

But this means we need to manually store all function state in object properties, and explicitly write a function which rebuilds function state. Prior to yield, we made this work at Facebook, but it was pretty messy.

For example, if the Loader has three steps and steps 1 and 3 need $phids, you either compute it twice (slow) or store it (boilerplate/mess). The boilerplate is also all totally unique so it's super error prone.

This also overloads the meaning of return. We can fix the overload of return by replacing return $data_query with throw new YieldToQueryException($data_query), but this isn't much cleaner.

Another way to make this work looks more like this:

  $data = id(new Query())
     ->withParentPHIDs($phids)
     ->thenCall(array($this, 'continueLoadingStep2'));

  return $data;
}

public function continueLoadingStep2(...) {
  foreach ($objects as $object) {
    ...
  }
}

This is roughly the direction the original FutureGraph series went in. This allows you to potentially use argument passing to transfer state, but some of the arguments you're interested in passing ($this, the resolved result of the Future) can't be referenced when you declare the call. D5028 used this style and is okay, but there's a whole lot of legwork there.


I think complex loading sequences don't need to be able to defined in a procedural way. For example: load all objects of type X, then load all sub-objects of those objects with unusual constraint Q. We basically never actually do this. If we do, it's fine for the API to require that "unusual constraint Q" can be held in the state of some Query object and the API looks like needSubobjects($matching_this_subquery), and you just have to do some kind of weird special case if your subquery really depends on your primary object results.

That is, we might want to load all highly-utilized volumes on all downed nodes behind all west-coast load balancers. The non-future version of this looks like:

$balancers = load_balancers('coast = west');

$nodes = load_nodes($balancers, 'state = down');

$volumes = load_volumes($nodes, 'fullness > 95');

The future version has an issue where it's difficult to pass $balancers to load_nodes(...). We can do this the thenCall() style, which is approximately just de-sugared continuations:

$balancers_future = load_balancers('coast = west');

$balancers_future->thenCall('loadNodesForTheseBalancers', 'state = down', ...);

Argument passing is a mess. Argument passing is a huge mess if we don't know the state we want to load ahead of time.

We can do the continuation style (except: not supported by PHP versions we support):

$balancers_future = load_balancers('coast = west');

$nodes_future = $balancers_future->thenCall(function($future) {
  $balancers = $future->resolve();
  return load_nodes($balancers, 'state = down');
});

We can do the yield style (except: not supported by PHP versions we support):

$balancers_future = load_balancers('coast = west');

yield $balancers_future;

$balancers = $balancers_future->resolve();

$nodes_future = load_nodes($balancers, 'state = down');

yield $nodes_future;

All of these are quite messy and we essentially never actually do any of them. The far messier versions are when you want to load all highly-utilized volumes on downed nodes if more than 16 load balancers are in service, and something else if fewer than 16 load balancers are in service. This happened sometimes at Facebook, I believe, but we literally have zero queries that do this in Phabricator, I think. (There are likely a lot of reasons, but at least some of those reasons are rooted in Phabricator having much more flexibility to mutate storage, far more rarely needing to follow an Edge as the first step in a query, and almost always being able to order objects reachable via Edges by Edge age instead of "machine-learned friendship score" or whatever.)

In particular, we have zero cases like "get all of the user's friends, and all of the user's events, and all of the RVSP'd users to those events, and then intersect the attending users with the user's friends, and then order that result list by friendship score, and then load the full profiles for those friends and events, and then suggest that list to the user as a reminder that you're attending Joe's Birthday tonight with Amy and Ryan." And if we do have cases like that in the future, it's fine to write an AmyAndRyanFriendshipAttenderQuery to deal with it.


yield seems like the obviously-least-bad of these options, but requires a minimum PHP version of 5.4. This is probably realistic to require -- it was released March 1, 2012. Requiring PHP 5.4 also gets us traits, which would simplify some things.

How closely can we emulate yield without yield? Suppose this:

$data = id(new Query())
   ->withParentPHIDs($phids)
   ->execute();

...becomes this:

$query = $this->getQuery('xyz', new WhateverQuery());
if ($query->isUnresolved()) {
  $query->withParentPHIDs(...);
}
$data = $query->resolve();

The "trick" here is that resolve() throws some YieldException() the first time it executes. Maybe this isn't completely terrible?

Even with yield I think we're stuck with $data = (yield $result)->resolve(); which isn't great.

So that might leave us with:

  • Follow the outline in D5104 + D5105 + D5107 + D5108-ish to turn FutureIterator into FutureGraph and resolve all futures via FutureGraph.
  • Follow the outline on experimental to structurally define Ref and Loader APIs, but with a mind toward integration with DAO and Query in the future. I have a separate HardpointList change which is likely closer in various ways.
  • Make the core loading workflow future-aware.
  • Probably require PHP 5.4 regardless.
  • Use whichever of yield or throw feels less messy.

Probably require PHP 5.4 regardless.

Looking at this more closely: PHP 5.5 gets us finally, which is a fairly meaningful piece of sugar, so I'm inclined to make that the minimum version instead.

Somewhere in experimental or wilds, I introduced ArcanistConduitEngine. This has some weird fake future stuff going on, so this is probably now ripe.

One adjacent issue is T13177, which amounts to "move service profiler integration to Future". This makes good sense.

The core idea in D5104 + D5105 is that $future->resolve() and id(new FutureIterator(array($future)))->next() (like, roughly) execute meaningfully different code paths.

An adjacent issue is that the Future->resolve() prototype takes $timeout, which is "maximum amount of time to spend in resolution". The entire codebase appears to have no nontrivial callers of this parameter (all callers are in unit tests to make sure the parameter works, etc). Generally, any caller which wants to return control to the caller prior to resolution can reasonably bear the burden of wrapping the future in an explicit Iterator. I'm going to remove this parameter to simplify things.

In the specific case of the Hardpoints, we currently often have code which loads objects but doesn't do anything with them. For example, most Query classes use didFilterResults() to fill things-that-sure-look-like-hardpoints, but few do anything with the results.

In this particular case, we can avoid the whole yield issue by reentering the graph inside the call. That is, if some callsite says:

Loader::loadSynchronous(X);

The only costs to continuing resolution inside Loader::loadSynchronous() are that, until we resolve the future:

  1. we can't reach the code after this call; and
  2. we can't reach code after any parent call in the callstack; and
  3. eventually, we'll blow the callstack.

(1) doesn't matter, since the call is explicitly a synchronous barrier anyway.

I'm fine with putting a hard depth cap in place to prevent (3), callers "should never" be resolving futures thousands-deep. The two times you truly need to write "resolve all ancestors in this arbitrarily large graph" code in the codebase, you can use futures directly.

Condition (2) isn't universally free. We can have code like this:

$futures = (A, B);
foreach ($futures as $future) {
  if ($future is A) {
    $futures->addFuture(C);
  }
}

If B uses a synchronous barrier, we won't return execution to the top level and queue C until B resolves, even if A resolves much sooner.

However, maybe we have very little of this code, particularly after restructuring into Hardpoints. I think this kind of code is generally hard to read, understand, and maintain and probably not desirable in most cases.

After D21036, FutureIterator is approximately ready to become FutureGraph, so I'm going to take a closer look at how Hardpoint can be structured to work with Future.

Here's an actual example of loadHardpoints($objects, $hardpoint):

public function loadHardpoints(array $refs, $hardpoint) {
  $api = $this->getQuery()->getRepositoryAPI();


  $futures = array();
  foreach ($refs as $ref_key => $ref) {
    $hash = $ref->getCommitHash();

    $futures[$ref_key] = $api->execFutureLocal(
      'log -n1 --format=%C %s --',
      '%s%n%n%b',
      $hash);
  }

  $iterator = $this->newFutureIterator($futures);

  $results = array();
  foreach ($iterator as $ref_key => $future) {
    list($stdout) = $future->resolvex();
    $results[$ref_key] = $stdout;
  }

  return $results;
}

Without any changes, it's possible for this to be restructured so that:

  • newFutureIterator() adds the futures to an existing pool of futures, then returns a "View" object into a subset of that pool.
  • foreach ($iterator ... works on all the futures in the entire pool, but only returns execution here when a future which is part of the given "View" resolves.

This isn't perfect: foreach ($iterator ... becomes a barrier for additional work that could otherwise be done in parallel using futures that were queued higher in the callstack. But it does mean that the default behavior would be fairly good. If this piece of code was blocked by something lower in the callstack inside iteration, it wouldn't affect its behavior in any significant way.

Some of the other styles we could use to express this are:

Yield an Iterator
  $iterator = $this->newFutureIterator($futures);

+ yield $iterator;

  foreach ($iterator as ...

We can yield the iterator to mean "resolve all futures on this iterator before returning execution here".

This isn't perfect either: it allows us to continue higher in the call stack, but means we can not continue on these futures until they all resolve.

We could fix that with something like this:

Yield an Iterator + Working Set
while ($iterator->isWorking()) {
  yield $iterator;
  foreach ($iterator->getReadyFutures() as $future) {
    // ...
  }
}

Since we can $iterator->addFuture() and hold state in the method I think this allows us to resolve any sequence of futures "perfectly", but this is rapidly getting pretty icky.

We can use callbacks, which is the approach that D5111 + D5112 originally pursued:

Callbacks
$future->thenCall($this, 'processLog', $object, ...);

I stylistically dislike callbacks; elsewhere, Javascript has moved from "callback hell" to futures to async/await.

A "light" version of callbacks is to give loadHardpoints() separate "generate futures" and "resolve/attach one future" calls. This isn't necessarily bad but means hardpoints can only have one round of future dispatch in the general case. Maybe this is fine?

We can sort of implement a variant of "await" but since "await" is just sugar on futures and we have to desugar it, I think there's no benefit and we end up with syntax that looks almost exactly the same as callbacks.

We could also move the processing logic into FutureProxy objects; the vast majority of post-resolution processing can likely be handled by a few templates. But this is really another flavor of callbacks.

I'd like to choose a design here by building a general-purpose system and then sugaring it for the common cases, but there are about five different reasonable ways to build the general-purpose system and it's not totally clear that any of them would actually have any callers.


We do actually have some nontrivial loaders from arc browse in experimental. The rough logic for one is this (the others are largely part of this sequence):

  • You type arc browse master.
  • To figure out which revision master corresponds to, we load the (local, client) commit identified by the name master.
    • Then, we load revisions associated with that commit's hash and tree hash.
    • Also, we load that commit's message.
      • Then, we load revisions identified by the commit's message.

The longest path may result in this call sequence:

  • git log -n1 ... --format=X master (commit hash)
  • git log -n1 ... --format=Y master (commit message -- this currently uses %s%n%n%b but should be updated to %B). This could be done in one step if the first loader was slightly smarter.
  • differential.query (to pull the revision object)

Each service call currently blocks and resolves synchronously, although we execute all the service calls of the same kind in parallel.

In this particular case, this is probably actually near-optimal, since we expect differential.query to dominate the cost of resolution here and don't want to dispatch a separate call per-object.

The code generally looks like this:

$this->newQuery($refs)
  ->needHardpoints(
    array(
      'commitRef',
    ))
  ->execute();

$commit_refs = array();
foreach ($refs as $ref) {
  $commit_refs[] = $ref->getCommitRef();
}

$this->newQuery($commit_refs)
  ->needHardpoints(
    array(
      'message',
    ))
  ->execute();

$map = array();
foreach ($refs as $ref_key => $ref) {

...where each execute() is synchronous.

This particular case could be simplified slightly by letting needHardpoints() express more complex queries (load hardpoint A, then load hardpoint B on those hardpoints).

This whole mess is also probably mooted by the realization that looping over git log appears to be expressible with a single execution of git show -q --format=X ... A B -- (except that if one name is bad, this entire command is poisoned and fails, and we would need to fall back to individual commands).

I have some code which runs and looks plausible (i.e., not covered in piles of callback garbage), at least:

$engine->requestHardpoints(
  array($object),
  array('A', 'B'));
public function loadHardpoint(array $objects, $hardpoint) {
  $future = new ExecFuture('sleep 5 && echo %s', $hardpoint);

  yield $this->yieldFuture($future);

  list($stdout) = $future->resolvex();
  $stdout = trim($stdout);

  $result = array();
  foreach ($objects as $object_key => $object) {
    $result[$object_key] = $stdout;
  }

  return $result;
}
$ time ./test.php  --trace
>>> [0] (+0) <exec> $ sleep 5 && echo 'A'
>>> [1] (+1) <exec> $ sleep 5 && echo 'B'
<<< [0] (+5,006) <exec> 5,006,335 us
<<< [1] (+5,006) <exec> 5,005,445 us
string(1) "A"
string(1) "B"

real	0m5.048s
user	0m0.031s
sys	0m0.017s

This appears to cleanly address the "complex dependencies" use case, where you want some piece of data which exists in a well-known place but is complicated to build, like the revisions which correspond to some symbol when the user types arc browse quack.

On its own, this doesn't cover the other major use case -- where you don't have direct access to the place where the data should be attached. For example, you have a User and want to load all of their ExternalAccount objects and all the AccountIdentifier objects on all those objects. You can't directly request AccountIdentifier objects since this isn't a hardpoint on User and you don't have references to the ExternalAccount objects yet.

This use case is fairly rare in Phabricator and can likely be addressed by making the request more complex -- instead of requesting hardpoint "A", you make a structured request for "A" and hardpoint "B" on the results of that request. This runs into additional complexities (if we load some "A", should we queue "B" loads immediately, or wait for the entire set to finish? The answer depends on the nature of "A" and "B" and the query) but I think there's nothing here that's very hard to overcome.

I'm also not dealing with a bunch of other complexities here, like error propagation when a particular hardpoint fails to load. But the abstraction has room to represent hardpoint state and I suspect none of this is terribly difficult to navigate.

Authors can make one notable error here -- omitting the yield keyword. If you omit yield, your code works correctly but may not run as quickly as it could. We could reasonably add a lint warning about this ("result of yield...() method unused") since it seems fair to define a convention that methods beginning "yield..." expect to be used with yield.

ftdysa added a subscriber: ftdysa.Feb 29 2020, 10:48 PM

XHPAST currently can't build an AST for $result = yield ..., even though this is a valid construct. This is probably a straightforward fix.

XHPAST currently can't build an AST for $result = yield ..., even though this is a valid construct. This is probably a straightforward fix.

See discussion in T13492. This isn't actually a valid construct in PHP 5.5, where the syntax must be $result = (yield ...);.

Generators can't "return" until PHP 7:

https://wiki.php.net/rfc/generator-return-expressions

Prior to PHP 7, return from a generator fatals:

PHP Fatal error: Generators cannot return values using "return" in /core/data/drydock/workingcopy-75/repo/arcanist/src/query/ArcanistCommitUpstreamHardpointQuery.php on line 19

This is definitely not great, although probably not exceptionally hard to work around.

epriestley closed this task as Resolved.Apr 11 2020, 2:41 PM

"FutureGraph" is dead. Long live "HardpointEngine".

The current implementation does not cover these things:

Declarative, Plan-Based Queries / "Query Language": You can't currently translate "load all users, then load all their revisions with the word 'node', then load all failing unit tests for those revisions" into a "Request" for "HardpointEngine" and run it in parallel with other queries which may need to load similar (or identical) resources. The API imagines supporting some of this eventually, but I haven't prototyped it and it may be fairly complicated. This is currently omitted mostly because there are no use cases in workflows running on it today, and it's not clear this is really a use case which needs much support anywhere in Phabricator.

(HardpointEngine can execute this request plan perfectly fine today, the code to do it just isn't a tidy, declarative "plan"-like piece of query language that can go at the application top level.)

Hardpoint Fetch Deduplication: If you issue two separate requests for ref A's hardpoint H at the same time, we may currently execute the resource fetch twice (and then discard the second result): hardpoints currently do not get marked into a "pending" state. This is probably an easy fix but I'm waiting until it shows up on a profile so I don't need to build a fake test case.

Request Deduplication: If you issue a request for branch A's commit and a separate request for branch B's commit at the same time, this currently resolves as two separate requests which can run in parallel but can't batch together. In theory, the engine can sometimes merge incoming requests for hardpoint H on object O into any similar request which has not been started yet. We likely don't issue any of these today and I'm sort of waiting for a profile here, too.

Ref Deduplication: If you issue a request for branch A's commit and symbol B's commit, and they happen to be the same commit, we currently build two separate refs for this commit. In theory, refs are append-only and the engine could generate and attach the same object in both cases. This circumstance probably never happens today. Duplicating refs like this likely doesn't matter much -- the cost of the ref duplication itself is very small. The most likely path to real costs is probably that we later load hardpoints for both refs when we don't strictly need to. If the "fetch deduplication" was fixed, most cases of this might be mooted.

Hardpoint Extensions: Hardpoints aren't currently modular, but the API imagines they will become modular.

Debugging Tools: You can use arc inspect and --trace, but there's no way to dump the request execution graph out of a HardpointEngine today. It's likely easy to add something simple, at least, but this feels somewhat premature today.

Yielding Database Queries / Phabricator Support: There's currently no yieldable database query and PhabricatorQuery does not extend ArcanistHardpointQuery. I expect to do this sooner or later (part of the motivation behind this change was to formalize "make sure hardpoint X is loaded on object O") but this is mostly aspirational for now.

Exceptions and Failures: There's no explicit support for propagating exception/failures states out of the Query layer today, but HardpointList can freely store arbitrary amounts of hardpoint metadata so I anticipate this being reasonable to deal with once the code gets there. Today, all queries in Arcanist just fail the request entirely if they fail, which I believe is a sort of broadly reasonable thing for them to do. Needs are more nuanced in Phabricator (where things can fail for policy reasons) but in Arcanist we generally don't expect git show X to fail for any reason other than "catastrophe occurred" or "X does not exist".