Page MenuHomePhabricator

Separate chart functions into a class tree

Authored by epriestley on Apr 17 2019, 12:23 PM.



Depends on D20440. Ref T13279. Create a class to represent a chartable function: something we can get some data points out of.

Then, make the chart chart two functions.

For now, the only supported function is "fact(key)", which pulls data from the Facts ETL pipeline, identified by "key", and takes no other arguments.

In future changes, I plan to support things like "fact(, PHID-PROJ-xyz)", "constant(1000)" (e.g. to draw a goal line), "sum(fact(...), fact(...))" (to combine data from several projects), and so on.

The UI may not expose this level of power for a while (or maybe ever) but until we get close enough to the UI that these features are a ton of extra work I'm going to try to keep things fairly flexible/modular.

Test Plan

Diff Detail

rP Phabricator
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

epriestley created this revision.Apr 17 2019, 12:23 PM
epriestley requested review of this revision.Apr 17 2019, 12:25 PM
  • Slight sequence adjustment.
  • Remove debugging code.
Harbormaster completed remote builds in B22618: Diff 48748.
amckinley accepted this revision.Apr 17 2019, 10:50 PM
amckinley added inline comments.

I'm pretty much just not going to comment about hard-coded stuff like this because everything is so much in flux. I'm assuming you're on top of vaguely-defined "scalability" stuff like how to downsample extremely frequent datasets, caching chart results (or CDN'ing them), etc etc.




You're trying to remove $every-1 / $every of the datapoints, right? So if $count = 100 and $limit = 20, you'd remove datapoints 1-4, 6-9, 11-14, etc?

So shouldn't this be if (!($ii % $every) && ($ii != $count))?

Also, I'm not sure if it's faster to iterate through all the data and call unset or to build a new array by doing $ii+=$every. My interview candidate self tells me that both will run in linear space and time, so that's super-helpful.

This revision is now accepted and ready to land.Apr 17 2019, 10:50 PM

Yeah -- for now, I am mostly just cheating because even our most frequent datasets generally only have a couple million total points, and if you want to use our AMAZING charting tools for "total clock cycles executed on all 100K machines" or something you can easily sample the data yourself before pushing it in (and there's no push-it-in API yet anyway), and this isn't a major use case anyway since there are already 200 other tools that count clock cycles.

There's a lot of performance layers we can add later as necessary, notably:

  • sampling or aggregating in the ETL layer, to produce sampled or aggregated variations of facts, e.g. " => =>" or something. Most of the datasets are so small that I think it'll be a while before we need this.
  • since charts are immutable and none of our data changes quickly, we can cache datasets by chart key very near the display layer and pretty much just dump JSON out of the cache for like 10 minutes after we build it as a fairly safe default.

But the biggest defense here is only having teeny tiny little small data.

we can cache datasets by chart key very near the display layer

You could probably cache the actual SVG element itself, right? I bet in most cases that would be smaller than the dataset itself.

just dump JSON out of the cache for like 10 minutes

As an ops person, that all sounds good as long you provide some way to force fetch the most-recent data point(s). If the fact is like "attempted logins per second" and I'm trying to blacklist the right stuff that's slamming us, it is miserable to sit there hitting refresh, waiting for what you know is a cache timer to expire.

epriestley added inline comments.Apr 18 2019, 2:20 PM

That's the intent. I think it's sort of working, but let me see:

If count = 100, limit = 20, we'll get every = ceil(100 / 20) = 5.

$ii % $every will then be true for points 1-4 inclusive, 6-9 inclusive, etc. So !($ii % $every) will be true for points 0, 5, 10.

Since the body discards the point rather than retaining it, I think the code as written is roughly doing the right thing?

If I add a phlog(), I get this:

Started with 528 points and limit 100, 88 remain.

So this code is definitely garbage, but seems to be keeping and discarding roughly the right number of points.

I expect to fix these issues later in the series:

  • We should always keep the first and last points (if the limit is at least 2).
  • We shouldn't discard too many points due to rounding errors -- if the limit is 100 and we start with more than 100 points, we should end with 100 points, not 88.

It would also be nice to prefer to discard tightly-packed points over loosely-packed points (since we lose less resolution this way) but I'm not sure if there's an easy algorithm for that which performs reasonably well for large inputs. At first glance, I don't see an easy way to do it that doesn't require re-evaluating the remaining points after we remove one.

Maybe the algorithm is:

  • Iterate over the points and compute maximum distance to either neighbor.
  • Sort the list.
  • While we have too many points:
    • Remove the first point.
    • Remove the adjacent points from the list.
    • Recompute their maximum neighbor distances.
    • Insert them back into the correct position in the list.

I think the "meaty" part of that is O(R log N) where R is the number of points we remove and N is the total number of points, and is dominated by O(N log N) in step 2, where we sort the list, so we end up somewhere reasonable.

Rather than removing the "most tightly packed point", we might actually want to remove the point which has the shortest distance to the line between the two adjacent points.

That is, if we have dataset <1, 100>, <2, 101>, <3, 102>, we can remove the middle point without changing the shape of the graph at all, because the line between <1, 100> and <3, 102> intersects <2, 101>.

As an even more advanced technique, we may want to remove the point which causes the smallest change in the area under the line when it is removed. This considers both "packing" and "unusualness".

That is, if we have points like this:

...we might prefer to remove point "B" over point "A", even though "B" maybe have a longer distance the line between the adjacent points, because removing "B" changes the visual shape of the graph by a smaller amount.

(The point between "A" and "B" might really be the best point to remove in this particular drawing.)

But removing the point which causes the smallest area change would require that I know how to compute the area inside a triangle given its vertices, which I don't, so maybe we can hold these features until the charts can serve at least one actual use case.

This revision was automatically updated to reflect the committed changes.