Page MenuHomePhabricator

Maniphest Task Edit Edge Transactions should create metadata based on edit type (add or remove)
Closed, WontfixPublic

Description

A lack of descriptive metadata currently exists in the Maniphest Transaction data for core:edge transactions. Whether a task edit transaction adds or removes a project or both, for example, is undefined. The only metadata that is currently provided is EDGECONST written as type:"41". If a single task edit includes both an add and a remove, perhaps these should be written as two separate transactions assigned with distinct metadata types. Providing the logic that compares the old value with the new and writes metadata with a unique type depending on the result should be possible.

The use case for this is to provide projects with a history of scope changes, including task adds and removes. Specifically, because a remove type core:edge transaction is very relevant to the edge (project), it may be useful to also write this event to the project transaction table. Otherwise, the project has no way to track changes of this type since the task is no longer in its scope. This would provide a limited subset of maniphest transaction data to the project.

This is the questionable code from ManiphestTaskEditController lines 284-303.

foreach ($changes as $type => $value) {
          $transaction = clone $template;
          $transaction->setTransactionType($type);
          if ($type == ManiphestTransaction::TYPE_PROJECT_COLUMN) {
            $transaction->setNewValue($value['new']);
            $transaction->setOldValue($value['old']);
          }  else if ($type == PhabricatorTransactions::TYPE_EDGE) {
            $project_type =
              PhabricatorProjectObjectHasProjectEdgeType::EDGECONST;
            $transaction
              ->setMetadataValue('edge:type', $project_type)
              ->setNewValue(
                array(
                  '=' => array_fuse($value),
                ));
          } else {
            $transaction->setNewValue($value);
          }
          $transactions[] = $transaction;
        }

Event Timeline

ChristopherHJohnson raised the priority of this task from to Needs Triage.
ChristopherHJohnson updated the task description. (Show Details)
ChristopherHJohnson moved this task to Backlog on the Transactions board.
epriestley claimed this task.
epriestley added a subscriber: epriestley.

This change isn't useful for the upstream, and isn't the direction we intend to pursue. See discussion here:

Instead, our plan is to build a new "Facts" application which can do "fact extraction" in an offline process. It can pull pertinent details out of transactions, including making adds and removes distinct.

We don't anticipate ever using the transaction record for any charting/analytical purposes. It encodes most of the interesting data in JSON. This is good for flexibility, but means it can't ever be queried quickly or scale very well. Extracting this data into a denormalized, query-able form should give us far greater scalability and flexibility.

I think that promoting Facts to address this problem is a red herring. I agree that Facts offers many advantages for offline processing of transactions. But I do not understand the reasoning for having such arcane and basically useless metadata for maniphest transactions. If you are trying to discourage the direct usage of transaction queries for performance reasons, this is also sensible. However, we need more descriptive data about task history that can only be provided with transaction metadata. Why not fix the problem where it exists? If you are reluctant to add more logic into TaskEditController, I understand that as well. This is a class that is desperate for refactoring.

But I do not understand the reasoning for having such arcane and basically useless metadata for maniphest transactions.

Can you walk me through how the data is useless? The data is not in a queryable form, but it is complete: it fully captures the state change -- and we use that fact to render strings like "added project x" and "removed project y". To find adds and removes, you perform set differences on the "old value" and "new value". PHIDs only present in one of the two values were added or removed.

However, we need more descriptive data about task history that can only be provided with transaction metadata.

Can you walk me through why you need this, and why you couldn't use Facts for it if Facts existed?

Specifically, Facts will support a query like "find all points in time where a task was added to or removed from project X". It will be able to satisfy this query quickly, using a denormalized schema without requiring decoding JSON. Critically, it will be able to find tasks which previously were in the project, which is impossible when using the transaction record unless you query every record on every task.

Why not fix the problem where it exists?

I don't understand what problems are not better solved by the upstream plan.

Can you walk me through how the data is useless? The data is not in a queryable form, but it is complete: it fully captures the state change -- and we use that fact to render strings like "added project x" and "removed project y". To find adds and removes, you perform set differences on the "old value" and "new value". PHIDs only present in one of the two values were added or removed.

The new and old value data is great, but the metadata is lacking. Describing a transaction with an arbitrary numerical type assignment is not useful to anyone unless you know what the number refers to. Since the metadata field is JSON, it could easily include the supplementary information that has been requested in this task.

Can you walk me through why you need this, and why you couldn't use Facts for it if Facts existed?

I have used Facts and understand its capabilities. However, unless Facts provides can provide data views that are "project task centric" it will remain a generic aggregator. And, Facts cannot just add metadata that does not exist to begin with. The factType inference of "core:edge" ["project removed"] still has to be created with the offline processing of the transaction table and this still remains an expensive and rather cumbersome operation.

What is needed is a task tracking mechanism from the project point of reference. Wikimedia is project oriented, so most questions to the data start with "For my project, I need to know this ..." If there were a reflexive write of task transaction metadata to some project specific storage, (even some form of non-db log), this would be very useful, and much more elegant and smaller than having to query millions of Facts for the same information.

I don't understand what problems are not better solved by the upstream plan.

From the application point of view, the upstream plan is solid, but what may not be obvious is that the uses for this transaction data can be very specific and narrow. Maybe try looking at it from the bottom up instead of the top down.

Maybe I'm still just not understanding what you need. You can get all the data you want today like this, right?

  1. Load all transactions for the task with type "core:edge".
  2. Discard any with metadata edge types that don't use the PhabricatorProjectObjectHasProjectEdgeType::EDGECONST value (41), since these are some other kind of edge edit.
  3. Use array_diff() on the old and new values to figure out if the transaction added or removed projects.

If we wrote two different types of transactions, you'd still need to do steps (1) and (2). You'd only have a slightly easier time with step (3). I don't understand how needing to do array_diff() makes this data any less useful. It's complete, and it's really slow to query. If you didn't have to array_diff(), it would still be really slow to query. How would separating these transaction types help?

If there were a reflexive write of task transaction metadata to some project specific storage, (even some form of non-db log), this would be very useful, and much more elegant and smaller than having to query millions of Facts for the same information.

This, particularly, does not make sense to me. I don't see how writing a flat log to disk can possibly be more useful or elegant than issuing one query against Facts to get exactly the data you want.

Facts will perform a reflexive write, cheaply and easily, in the background. It will write in a highly-queryable form, where you can phrase your query as "For project X, ...".

Can you give me a specific example of a query that you believe Facts will be unable to satisfy, or will satisfy less elegantly than parsing a flat logfile on disk?

Can you give me a specific example of a query that you believe Facts will be unable to satisfy, or will satisfy less elegantly than parsing a flat logfile on disk?

The Facts schema has these fields (factType, objectPHID, objectA, valueX, valueY, epoch). We can assume that these fields are populated with history data where objectPHID represents a taskPHID and objectA represents a projectPHID. factType will be a type assignment (I assume) of "project removed" or "project added". The query starts with for objectA return * where factType= 'project removed' and epoch is between epoch 1 and epoch 2.

This will return a list of tasks removed for a project within a certain time period, correct? If this can be done like this, this is exactly what is needed. The only information that is missing is the comprehensive association with the task data at that point in time (including custom field data), and tracking metadata like what user performed the transaction. How can a comprehensive project task history be recreated from individual Facts?

How would separating these transaction types help?

I think that if the transactions were logged with descriptive metadata about the transaction type and written reflexively to project specific storage, then this could be used as an sequential event log (like Feed), but with values that could be queried. I am not suggesting that a flat log is the best solution, but the relational db option just seems too cumbersome for the use case which always starts with a single project.

This will return a list of tasks removed for a project within a certain time period, correct? If this can be done like this, this is exactly what is needed.

Yes. The schema is nonfinal, but that's roughly the expectation.

Another way this query might be represented is by having the fact extractor create rows with objectPHID = <the project>. Then you'd query for objectPHID = <project> AND factType IN (task-added, task-removed) and use objectA to figure out which tasks were affected. Basically the same thing, just with the objects flipped. These could be created either instead of or in addition to the rows you describe, depending on what use cases we actually have. It's OK for one transaction record to emit a lot of different facts (e.g., both task-centric and project-centric rows, and rows describing multiple different types of facts), since it runs asynchronously in the background, fact rows are small, and the infrastructure anticipates that events may produce many different facts.

The only information that is missing is the comprehensive association with the task data at that point in time (including custom field data), and tracking metadata like what user performed the transaction. How can a comprehensive project task history be recreated from individual Facts?

If you need a small number of specific values (like the value of a "points" field), we might reasonably write that to valueX or valueY. If the upstream extractor can't write the correct value (for instance, because it needs to be derived from several custom fields in a complex way), you can create a similar, custom extractor which does write the correct value. You could use several similar facts to extract several values. This can't do everything, but I believe it is flexible enough to satisfy almost all reasonable use cases.

The originating transaction ID would probably be stored in valueX or valueY, or we might add a new field to store sourceID or similar. The current schema just represents best guesses about how the fact table should actually look, and I think I created it prior to the universal adoption of transactions. You could use this to query the original transactions and get authorship, content source, etc., information.

There is no way to efficiently query the entire state of an object at an arbitrary point in time, and it's vanishingly unlikely that we'll ever build this. If you need this, Phabricator probably does not have the infrastructure fundamentals required to provide it: transactions are not rewindable, and there's no formal versioning of objects or caching infrastructure for old states. I'm not confident I could design a system which would give you scalable access to task state as it existed in the past fast enough to do anything in a web UI with it. Certainly, if this were a design goal, I would build transactions very differently than I have.