Page MenuHomePhabricator

Improve the performance of custom field queries
Closed, ResolvedPublic

Description

The normal usage pattern I'm seeing is that instead of making "sibling aware" queries to retrieve values from storage, custom field callers iterate over their objects and issue queries individually. There are comments to this effect littered throughout the codebase.

In most contexts, when you're viewing or interacting with custom fields, you're doing so with a single object, so these inefficiencies don't usually stack up. There are a few notable exceptions, mainly differential API calls. Each of these API methods will execute ~20 + 2N queries when called:

That's not so, so, so terrible, until you start writing your own custom fields that have storage backing. A typical API call for revisions results in ~800 queries being run on our production instance right now. Couple that with the fact that our engineers love to hammer the API and...

I've screwed around a little with replacing custom field data exposed as "fields" in conduit with "attachments", using engine extensions to sneak them into misbehaving search engines. Interestingly enough, search engine attachments are given the opportunity to avoid issuing N+1 but engine extensions are not.

Event Timeline

I think the issue is that there's no callable thing for bulk-loading custom fields right now, which is why SearchEngineExtension doesn't have a hook for it. Other SearchEngine extensions don't need it since none currently load any data.

Having such a hook (likely structured very similarly to the EditEngineExtension hook) is desirable once we have something for at least one extension to do with it.

Although doing this as an attachment should be possible, I think you'll hit the same wall immediately -- specifically, that there's no replacement that you can just drop in here to make it do a bulk load:

$fields = PhabricatorCustomField::getObjectFields(
  $object,
  PhabricatorCustomField::ROLE_CONDUIT);

$fields
  ->setViewer($this->getViewer())
  ->readFieldsFromStorage($object);

There should be no particular technical difficulty in building some kind of bulk variant and then using it in SearchEngine, although I don't have a granular technical attack on it in mind so it's hard for me to spell out exactly the shape of an upstreamable patch. I'd like to fix or "fix" this (from the current load code) too:

// NOTE: We assume all fields share the same storage. This isn't guaranteed
// to be true, but always is for now.

On its own, fixing the SearchEngineExtension won't fix differential.query, since it has a separate similar copy of this code. Fixing it is probably simple and unlikely to trigger surprises, and moving arc to the EditEngine APIs is probably still some ways away (roughly, T10945) this is probably worth fixing once there's a "do all that, but in bulk" replacement available since arc will benefit in the meantime.

This is brittle but it works if your objects all use the native storage.

<?php

final class CIBulkCustomFieldStorage extends Phobject {

  public static function readFieldsFromStorage(
    array $objects,
    array $indexes,
    $role) {
    $storage_indexes = [];
    foreach ($objects as $object) {
      foreach ($indexes as $field_index) {
        $object_field = PhabricatorCustomField::getObjectField(
          $object,
          $role,
          $field_index);
        $storage_index = $object_field->getFieldIndex();
        $storage_indexes[$storage_index] = $field_index;
      }
    }

    if (!$storage_indexes) {
      return [];
    }

    $table = $object_field->newStorageObject();

    $values = array();
    $phids = array_filter(mpull($objects, 'getPHID'));
    if ($phids) {
      $values = $table->loadAllWhere(
        'objectPHID IN (%Ls) AND fieldIndex IN (%Ls)',
        $phids,
        array_keys($storage_indexes));
      $values = mgroup($values, 'getObjectPHID');
    }

    foreach ($storage_indexes as $storage_index => $field_index) {
      foreach ($objects as $object) {
        if (!$object->getPHID()) {
          continue;
        }
        $field_values = mpull(
          idx($values, $object->getPHID(), []),
          'getFieldValue',
          'getFieldIndex');
        $storage_value = idx($field_values, $storage_index);
        $field = PhabricatorCustomField::getObjectField(
          $object,
          $role,
          $field_index);
        if ($storage_value) {
          $field->setValueFromStorage($storage_value);
          $field->didSetValueFromStorage();
        } else if ($object->getPHID()) {
          $field->setValueFromStorage(null);
          $field->didSetValueFromStorage();
        }
      }
    }

  }

}

example:

<?php

final class CIDevelopmentTestSearchEngineAttachment
  extends PhabricatorSearchEngineAttachment {

  public function getAttachmentName() {
    return pht('Development Tests');
  }

  public function getAttachmentDescription() {
    return pht('Show devtests associated with this object.');
  }

  public function loadAttachmentData(array $objects, $spec) {
    $phids = mpull($objects, 'getPHID');
    $edge_query = (new PhabricatorEdgeQuery())
      ->withSourcePHIDs($phids)
      ->withEdgeTypes([
          CIRevisionHasDevelopmentTestEdgeType::EDGECONST,
          CICommitHasDevelopmentTestEdgeType::EDGECONST,
        ]);
    $edge_query->execute();
    return [
      'edge_query' => $edge_query,
    ];
  }

  public function getAttachmentForObject($object, $data, $spec) {
    return $data['edge_query']->getDestinationPHIDs([$object->getPHID()]);
  }

}

underlying field:

<?php

final class CIDevelopmentTestsDifferentialCustomField
  extends DifferentialCoreCustomField {

  public function getFieldKey() {
    return 'integrator:dev-tests';
  }

  public function getFieldKeyForConduit() {
    return 'devTestPHIDs';
  }

  public function canDisableField() {
    return false;
  }

  public function shouldAppearInEditView() {
    return false;
  }

  public function getFieldName() {
    return pht('Dev Tests');
  }

  public function getFieldDescription() {
    return pht(
      'Tasks corresponding to extended testing performed on this revision, '.
      'such as track testing on a vehicle, or staging area deployments.');
  }

  public function shouldAppearInPropertyView() {
    return true;
  }

  public function renderPropertyViewLabel() {
    return $this->getFieldName();
  }

  protected function readValueFromRevision(DifferentialRevision $revision) {
    if (!$revision->getPHID()) {
      return array();
    }

    return PhabricatorEdgeQuery::loadDestinationPHIDs(
      $revision->getPHID(),
      CIRevisionHasDevelopmentTestEdgeType::EDGECONST);
  }

  public function getApplicationTransactionType() {
    return PhabricatorTransactions::TYPE_EDGE;
  }

  public function getApplicationTransactionMetadata() {
    return array(
      'edge:type' => CIRevisionHasDevelopmentTestEdgeType::EDGECONST,
    );
  }

  public function getNewValueForApplicationTransactions() {
    $edges = array();
    foreach ($this->getValue() as $phid) {
      $edges[$phid] = $phid;
    }

    return array('=' => $edges);
  }

  public function getRequiredHandlePHIDsForPropertyView() {
    return $this->getValue();
  }

  public function renderPropertyViewValue(array $handles) {
    return $this->renderHandleList($handles);
  }

  public function shouldAppearInCommitMessage() {
    return true;
  }

  public function shouldAllowEditInCommitMessage() {
    return true;
  }

  public function getCommitMessageLabels() {
    return array(
      'Dev Test',
      'Dev Tests',
    );
  }

  public function parseValueFromCommitMessage($value) {
    return $this->parseObjectList(
      $value,
      array(
        ManiphestTaskPHIDType::TYPECONST,
      ));
  }

  public function getRequiredHandlePHIDsForCommitMessage() {
    return $this->getRequiredHandlePHIDsForPropertyView();
  }

  public function renderCommitMessageValue(array $handles) {
    return $this->renderObjectList($handles);
  }

}

That looks broadly correct to me, I just want to try to remove the assumption that the first object's first field's storage is universally the right storage in fixing this upstream. It's always true today and for the foreseeable future, just obviously not a Fundamental Axiom of the System. Let me review D16346 and I'll spend 10 minutes on this, I'm pretty sure it's straightforward, not a weird snowbally mess, and that the upstreamable patch is only cosmetically different from that one.

Ah, there are really a couple of issues here.

Current cost looks like it's O(2N + FN) queries, where F is the number of "normal" custom fields you have (using regular database storage). I have a patch for that.

The 2N is coming from loading edges for "depends on" (type 41) and "projects" (type 5).

In differential.revision.search, projects can be loaded efficiently via attachment. There's also no use case for "Depends On" yet, exactly. I'm inclined to do this:

  • Stop these fields from loading information for Conduit.
  • Hard-code them back into differential.query to maintain compatibility, using efficient queries.

That should make both API methods issue about O(1) queries.

I'm not entirely sure I can do that cleanly, but I think I can remove both fields from the Conduit role.

That sort of leaves you in trouble with custom edge-based fields like the one above, but a reasonable short term approach is the one you've already taken:

  • Don't expose the field via Conduit directly.
  • Expose it as an Attachment instead.
  • (Or you can auto-include it with an Extension shortly, using similar code, if it's 100% core.)

I suspect that the long-term will not look hugely different from this, but probably more edge stuff will happen automatically via some sort of generality layer on top of edges (some discussion in T5873, maybe).

I think we can't provide an edge.write or edge.read sort of API since there are security problems with doing that, but we can have objects and extensions expose some sort of lightweight EdgeWhichTheAPICanInteractWith API that configures read/write and generates strings and documentation, and then that can automatically generate appropriate attachments or whatever. Maybe. I don't think this is too important for now, and the existing "projects" and "subscribers" attachments are better than any generic attachment is likely to be.

There's still quite a bit of room here. Locally, I see a 220ms call for differential.revision.search. Here are savings which look fairly easy to capture:

Pointless Typechecking: 30ms is spent on this:

PhutilTypeSpec::checkMap(
  $options,
  array(
    'withDisabled' => 'optional bool',
  ));

I'll just remove that, it's a tiny guard rail which in no way justifies 15% of page cost.

JIRA Provider: 20ms is spent checking if the JIRA field is enabled. This isn't completely trivial to fix.

Auth Stuff: 25ms is spent loading the viewer token, object, and settings. I believe this could be significantly more efficient (we do similar logic faster on web requests).

Also:

  • 20ms of exclusive time in buildFieldList() seems high.
  • 11ms doing PhutilOpaqueEnvelope on some JIRA stuff? This might be particular to my install since I have weird JIRA things going on. This cost is unnecssary.

So theoretically that's ~50% of the call weight if all of that vanished.

I don't immediately have an attack on the JIRA thing (it should be less expensive in normal cases), and don't want to do the Auth stuff without more planning, and nothing else leaps out at me as having a similar sort of value for the cost.

One other issue is that we're loading flags and drafts because the web UI needs them, but I'd like to split that apart in a more general way with more planning.

I think this can maybe get ~20-40% faster with careful application optimizations, then we'll start hitting the interpreter startup cost wall as the mountain we have to move. The easiest way to move it is likely by modifying PHP (some out-of-date discussion in T2312 -- we no longer run the preload stuff since Phabricator is fast enough without it), although there are application ways to do this (basically, run an "API daemon" that listens on a unix domain socket, have it hold a warm PHP process and serve requests). We actually have most of the infrastructure we'd need to do this, it's just real dumb and I'd rather try to make PHP interpreters reusable (moving startup costs to before requests ran was pretty easy in the past, which is a big chunk of the benefit).

epriestley claimed this task.

I deployed this stuff here. I neglected to get a production timing beforehand so I can't really claim it's actually faster, but it seems fast-ish now. Let me know if things look better and/or I ruined everything when you get around to deploying it.

There's room to improve this (and pretty much everything else) further, but it doesn't look like any of it is such an efficient use of time that we can't not do it.