✩ Development Notes (2015 Week 52)

Development Notes (2015 Week 52)

ApplicationEditor has promoted to stable, search/indexing updates have landed, bin/lipsum got a little fancier, and subproject infrastructure is taking shape.

This is a companion to the 2015 Week 52 (Late December) Changelog.

ApplicationEditor

Earlier this month, we made large infrastructure changes to how object creation and editing works in some applications, including Maniphest. I've held promotion to stable for several weeks to let these changes ("ApplicationEditor") settle, but these changes now appear stable and I've promoted them.

See T9905 for high-level discussion. See prior notes (Week 50 and Week 51) for more context.

The new features are also discussed from a user perspective in the documentation, here:

The code itself didn't change much this week, but there were some minor fixups and polish. T10004 has a list of errata which has been identified and (mostly) resolved.

I expect the changes will be minor and mostly positive unless you rely on "Can Edit <Field>" policies (discussed in T10003) or EDITTASK events (discussed in T9860), but these more disruptive changes are important for the architectural quality of the project in the long term.

Engines: Indexing, Destruction and Hovercards

The indexing, destruction, and hovercard pipelines has been restructured to use the modern Engine + EngineExtension pattern. This is primarily a code quality change with little user-facing impact, although future third-party applications will now have parity with upstream applications in these workflows.

Previously, indexing and destruction both had blocks of code like this:

if ($object instanceof UpstreamInterfaceOne) {
  $this->doNonModularThing($object);
}

if ($object instanceof ProprietaryInterfaceTwo) {
  $this->severelyLimitExtensibility($object);
}

This pattern is usually a strong sign of badness: it spreads application behavior across the codebase, prevents third-party code from interacting with the workflow, and usually means that responsibilities haven't been modeled and separated as cleanly as they could be.

The current desired approach in modern code is this:

// Load all extensions provided by applications and third-party code.
$extensions = PipelineExtensions::loadAll();

foreach ($extensions as $extension) {
  if ($extension->shouldDoStuffWithObject($object)) {
    $extension->doStuffWithObject($object);
  }
}

In this case, there were no real architectural issues with the destruction or hovercard pipelines, so extracting extensions from them was straightforward and just made the code more standard and extensible.

The indexing pipeline had minor architectural weirdness, mostly because it was trying to do too little. For example, Conpherence had an indexing step that only caused side effects, because the indexing pipeline did not anticipate the need to do general-purpose indexing and focused only on fulltext indexing.

I generalized indexing to cover a larger class of indexing behaviors, and made fulltext indexing a substep. I lifted the side-effect, non-fulltext indexes in Conpherence and Projects up above fulltext indexing, then the remaining fulltext stuff was straightforward to modernize.

Hovercards were in less-bad shape than the other pipelines and technically modular through events, but are now more standard and structured as EngineExtensions instead of event listeners.

Modernizing the indexing pipeline allowed us to make several additional improvements. Notably, we now lock/version during index rebuilds, and have preliminary support for ngram indexes to improve substring matching.

Indexing: Locking and Verisoning

Previously, the indexing pipeline had poor behavior if you had a script post hundreds of comments to a task in a very short period of time: we would do too much work rebuilding indexes, and multiple processes could try to rebuild an object's index at once. Although we eventually converged to the right state, this took a long time and could be confusing. This issue is primarily discussed in T9890.

We now hold a lock while reindexing an object, and use a simple versioning mechanism to skip redoing work when it has already been done. This should improve behavior in pathological situations by letting most of the updates exit immediately without doing any real work.

Indexing: Ngram Indexes for Substring Search

Previously, we didn't have a way to run efficient substring queries on tables that might become too large to reasonably run LIKE "%substring%" against. This issue is primarily discussed in T9979, but see also D14411.

MySQL can use keys for LIKE "prefix%", but not for LIKE "%substring%", and these queries can become quite slow when a table has a large number of rows. A recent blog post on the GitHub engineering blog discusses problems they encountered when users were able to turn a prefix% query into a %substring% query in an unintended way.

Here, we want to issue a %substring% query intentionally, against tables (including the file table) which may be large enough to slow down LIKE substantially (many installs have millions of rows in this table).

I explored a few potential approaches; of them, ngram index seems like a promising way forward (it appears to be roughly 100x faster than LIKE "%substring%" on the 200,000-row dataset on this install for reasonable queries, and should scale sub-linearly to larger datasets). T9979 has additional discussion and details. A great resource for understanding these indexes is Regular Expression Matching with a Trigram Index, or How Google Code Search Worked.

This will probably need some tweaks, but the first product implementation is now available in Owners ("Name Contains" when searching for packages). This will eventually be available more widely in other applications where substring search could be helpful (for example, "File Name Contains" in Files).

Aside: LIKE Injection

LIKE injection is tricky, but qsprintf() has had supported the %> (LIKE prefix), %~ (LIKE substring) and %< (LIKE suffix) conversions since around 2008. I think I just caught the need for these by anticipating it while trying to write a LIKE at some point at Facebook, so the story behind them isn't very interesting.

A more interesting story is the %K conversion that qsprintf() supports, which escapes a comment. Facebook put the current URL in a comment at the beginning of web queries, to make it easier to figure out where a query was coming from. For example, you might catch a slow query like this in MySQL or in the slow query log:

/* /help/profiles.php?section=3 */ SELECT * FROM a JOIN b JOIN c ...

...and the comment would make it clear where to start looking to find and fix the problem query.

However, this comment wasn't escaped at all and included the entire query string, so for several months in 2007 or 2008 you could visit any page on facebook.com with a query string something like this:

https://www.facebook.com/profile.php?anything=*/ SELECT * FROM dark_secrets; --

...and run whatever SQL you wanted on the first connection the page happened to establish.

Lipsum

lipsum is a tool that generates fairly-realistic test data to populate a development environment with plausible synthetic objects so that testing user interfaces and interactions is a little more realistic and every application and object isn't an empty, barren wasteland. It isn't perfect, but it can be a good start to get a development environment off the ground and help it feel more real:

$ ./bin/lipsum generate project paste
 GENERATORS  Selected generators: Projects, Pastes.
 WARNING  This command generates synthetic test data, including user accounts.
 It is intended for use in development environments so you can test features
 more easily. There is no easy way to delete this data or undo the effects of
 this command. If you run it in a production environment, it will pollute your
 data with large amounts of meaningless garbage that you can not get rid of.


    Are you sure you want to generate piles of garbage? [y/N] y

 LIPSUM  Generating synthetic test objects forever. Use ^C to stop when satisfied.
Generated "Project": Data Center Performance
Generated "Project": Ion Cannon Automation
Generated "Paste": P239 administrate_shards.txt
Generated "Project": Improve Security Monetization
Generated "Project": Drones
Generated "Paste": P240 plan_users.txt
Generated "Project": Accelerating Robot
Generated "Project": Hardware Automation
Generated "Paste": P241 account_shard.php
Generated "Paste": P242 update_ancient_lost_hosts.txt
Generated "Paste": P243 administrate_disk_memory_memory.php
Generated "Paste": P244 probe_accounts.php
...

lipsum has been around for a while, but since it's purely a development tool it doesn't get updated very often. This week saw some updates to the top-level UX to make it easier to use, and improvements to the Paste and Projects generators to fix some bugs, modernize them, and let them generate better data.

Upcoming: Subprojects and Milestones

The next major project I am pursuing is subprojects/milestones. The top-level task is T3670, with an implementation log in T10010.

I made significant progress on internal infrastructure changes to support these features. Policies, queries, data structures, and new materialization for project membership are all in place, working in a way that appears to be correct and consistent, and reasonably well covered by unit tests. However, most of the work on this project is likely to ultimately be frontend changes to workboards and typeaheads, so the bulk of the work is still ahead of us, even though the technical backend part is substantially complete.

I stopped short of any product/UI changes this week, so there's no way to actually create these projects yet. This is mostly because I wanted to be sure we could promote this week and didn't want to make risky changes to Projects ahead of that which could jeopardize my confidence with promotion, but partly because of product ambiguity that I want to plan our way through better (discussed in T10054). I expect subprojects to get at least some UI this week, although they're likely to still be prototypes for a minimum of one more week.

Development Notes (2015 Week 51)

Development Notes (2016 Week 1)

Written by epriestley on Dec 28 2015, 8:10 PM.

Overengineer

Projects: None
Subscribers: CodeMouse92

Event Timeline

Broken link for Customizing Forms documentation. Should be to https://secure.phabricator.com/book/phabricator/article/forms/

Thanks, fixed.