Page MenuHomePhabricator

Separate "accumulate(...)" from Fact functions
AbandonedPublic

Authored by epriestley on Apr 18 2019, 9:07 PM.

Details

Reviewers
amckinley
Maniphest Tasks
T13279: Build Charting for Facts
Summary

Depends on D20446. Ref T13279. Currently, the raw ETL fact data is just changes to counts, e.g. a "+1" when a task is created or a "-1" when a task is closed.

We accumulate these changes into a line as part of the "fact()" function, but we can do this more cleanly by making accumluation a separate function.

The raw, unaccumulated functions become "impulse" functions, i.e. each point is like an acceleration "impulse" which we can accumulate to plot speed, since "accumulate" is really "crappy, low-budget integrate() that only works in super easy cases".

The "accumulate()" function can only operate on discrete "impulse" functions because I'm not expecting us to be able to chart "accumulate(mul(x, 2))" and have it figure out that that ∫(2x)dx = x^2 and chart that. We can actually run "accumulate()" on sampled real functions and get a numerical approximation, but this is silly and far afield from the useful set of problems we're trying to solve, so just prevent it.

The name "impulse" may change since I'm still not totally sure how functions will end up organized, I'm just trying to move toward a reasonable definition of "add(x, y)" that works when X and Y are functions like "open tasks in project X" and "open tasks in project Y" and being able to get a sensible line out of it.

Test Plan

Here's accumluate(scale(x(), 2)) for kicks. This is not allowed, but does draw a fairly accurate chart numerically approximating x^2:

Here's accumulate(fact(open-tasks)), which is just the same thing that fact(open-tasks) used to be (the spike is when I used bin/lipsum to create a lot of tasks):

Diff Detail

Repository
rP Phabricator
Branch
chart11
Lint
Lint OK
SeverityLocationCodeMessage
Advicesrc/applications/fact/chart/PhabricatorChartFunction.php:158XHP16TODO Comment
Unit
Unit Tests OK
Build Status
Buildable 22674
Build 31072: Run Core Tests
Build 31071: arc lint + arc unit

Event Timeline

epriestley created this revision.Apr 18 2019, 9:07 PM
epriestley requested review of this revision.Apr 18 2019, 9:09 PM

Also featured here is "load the data for all functions in the call tree, not just top-level functions".

epriestley abandoned this revision.Apr 18 2019, 11:15 PM

I'm think I'm going to tackle this a little differently. Mostly more rambling:

accumulate(...) is not really a function of evaluating fact(open-tasks). That is, accumulate() can not produce a y-value at point X given only fact(open-tasks) evaluated at X. accumulate() is a sort of functor on the "open tasks" dataset.

Perhaps we're better off looking at this on two dimensions: each function is really a "functor" with configuration arguments, and these functors are chained together.

So sin(x()) is x | sin. Easy enough.

scale(x(), 2) is x | scale(2).

scale(shift(scale(cos(x()), 1), 2, 3) is x | cos | scale(1) | shift(2) | scale(3).

accumulate(fact(open-tasks)) is really x | accumulate(open-tasks), not x | fact(open-tasks) | accumulate.

The start of a pipe is always x (or constant(...), but we can pipe x to constant without any issues).

So far, so good. This gets tricky with sum(), though.

We'd like to actually pass functions to sum() as real arguments, so sum(cos(x), sin(x)) is truly x | sum(cos, sin). sum evaluates the pipe as (x | cos) + (x | sin). That's fine.

But x is also a list of distinct samples, not some abstract mathematical ideal.

So I think the real approach here is:

  • Get rid of "source functions" as arguments. Functions are sources (like x) or not (like cos). (Actually, I'm unsure if we even care about this distinction anymore.)
  • Restructure functions to use chains instead of composition.
  • To find the list of points we're going to evaluate, we walk the chain until we hit a function with a domain, and fall back to sampling if we don't find one? I think this works. But what we actually care about is whether the function wants to suggest samples or not. We can figure the domain out from the samples.

So we actually figure out the domain like this:

  • Walk the chain until we find a function which has samples.
  • If we find one, those are our input samples. Their extent is our domain.
  • If we don't find one, pick a default domain (or we can do another pass and have functions guess a domain, some day).
  • If we have a domain but no samples, use linear samples (or we can walk the chain and ask functions to guess some reasonable samples, some day).

The last two steps are refinements, since fact(...) can guess that "last 90 days" is a good domain, and weird functions like atan() could guess that better sample density near the part of the graph where it goes "woosh" gives us a better graph shape.

Now we have samples, and we shove them through the chain and make each function image them. Whatever we get out of the other end is our actual data.

I've sort of headed in this direction by exploring things anyway, and I think this is going to look significantly more robust as an approach.

On selecting a domain, you can build a chain like this:

x | shift(1000) | fact(Q)

Then, if fact(Q) supplies samples or a domain, we're out of luck.

We can only "fix" this if shift(1000) can be inverted. I'm not going to worry about this for now since I don't think this chain makes a ton of sense. I think it's only useful for "show today's activity against yesterday's activity", and we should accomplish that by having two different X axes overlaid, not by manually shifting the points around so that "evaluate(April 19)" pops out the datapoint for April 18.